Skip to main content

Syllabus

WeekDatesTopicReading/ContentHomework
Week 11/12Introduction to LLM[slides]
1/14GPU Programming Basics 1[slides]Chap 2,4 of Programming Massively Parallel Processors, 4th EdHW1 Released
1/16Recitation 1PSC Guidelines, Simple CUDA Demo
Week 21/19no class
1/21GPU Programming Basics 2[slides]Chap 3 of Programming Massively Parallel Processors, 4th Ed
Week 31/26GPU Acceleration[slides]Chap 5,6 of Programming Massively Parallel Processors, 4th Ed
1/28Deep Learning Frameworks and Auto Differentiation[slides] [slides]Tensorflow Auto Diff survey Differentiable ProgrammingHW1 Due
1/30Recitation 2More examples, MiniTorch
Week 42/2TPU and Acceleration
2/4Deep Learning Compilation and JAXHW2 Due
Week 52/9Transformer[slides]Attention is all you need
2/11Pre-trained LLMs[slides]LLaMA, GPT3, Annotated Transformer
Week 62/16Tokenization and Embedding [slides]BPE, Sentence-Piece, VOLT
2/18Generation and Speculative Decoding [slides]
2/20Recitation 3Annotated Transformer & Decoding
Week 72/23Accelerating Transformer on GPU Part 1[slides]LightSeq
2/25Accelerating Transformer on GPU Part 2[slides]LightSeq2HW3 Due
2/27Recitation 4LightSeqProject Proposal Due
Week 83/2spring break
Week 93/9Distributed Model Training[slides]
3/11Distributed Model Training II[slides]DDP
Week 113/16Distributed Model Training III[slides]GPipe, Megatron-LMHW4 Due
3/18Large models with Mixture-of-Expert [slides]GShard, Switch Transformer, DeepSpeed-MOE, Deepseek-MoE
3/20Recitation 5Distributed Training
Week 123/23Memory Optimization in Distributed Training [slides]ZeRO (DeepSpeed)
3/25Model Quantization [slides]HW5 Due
Week 133/30Optimizing Attention for Modern Hardware (Tri Dao) [slides]FlashAttention
4/1Model Quantization II [slides]GPTQMid-term Report Due
Week 144/6LLM serving with SGL [slides]SGLang
4/8Efficient LLM Inference with Paged Attention and vLLM (Woosuk Kwon) [slides]vLLMHW6 Due
Week 154/13Efficient fine-tuning for Large Models [slides]CIAT, LORA, QLoRA
4/15Efficient Reinforcement Learning System for LLMs (Yi Wu)ReaLHF
Week 164/20Serving with Disaggregated Prefill-Decoding (Vikram Sharma Mailthody) [slides]DistServeHW7 Due
4/22LLM Serving on Heterogeneous Hardware (Mingxing Zhang)
Week 174/27Final project presentation
4/28Final report due
Better KV Cache for LLM Serving (Yuhan Liu) [slides]CacheGen CacheBlend
DistServe: Disaggregated Prefill-Decoding (Hao Zhang) [slides]DistServe
App Stack and Model Serving[slides]Triton, LightLLM
Triton for Kernel OptimizationJAX
Retrieval-augmented Language ModelsRAG
Nearest Vector Search for EmbeddingsHNSW
Multimodal LLMsFlamingo
Efficient Streaming Language Models with Attention SinksAttention Sink