| Week 1 | 8/24 | Introduction to LLM[slides] | | |
| 8/26 | GPU Programming Basics 1[slides] | Chap 2,4 of Programming Massively Parallel Processors, 4th Ed | HW1 Released |
| 8/28 | Recitation 1 | PSC Guidelines, Simple CUDA Demo | |
| Week 2 | 8/31 | GPU Programming Basics 2[slides] | Chap 3 of Programming Massively Parallel Processors, 4th Ed | |
| 9/2 | GPU Acceleration[slides] | Chap 5,6 of Programming Massively Parallel Processors, 4th Ed | |
| Week 3 | 9/7 | no class | | |
| 9/9 | Deep Learning Frameworks and Auto Differentiation [slides] | Tensorflow Auto Diff survey Differentiable Programming | HW1 Due, HW2 Released |
| 9/11 | Recitation 2 | HW2, MiniTorch, More GPU | |
| Week 4 | 9/14 | Transformer[slides] | Attention is all you need | |
| 9/16 | Pre-trained LLMs[slides] | LLaMA, GPT3, Annotated Transformer | HW2 Due, HW3 Released |
| 9/18 | Recitation 3 | Annotated Transformer | |
| Week 5 | 9/21 | Tokenization and Embedding [slides] | BPE, Sentence-Piece, VOLT | |
| 9/23 | Generation and Speculative Decoding [slides] | | |
| 9/25 | Recitation 4 | Decoding | |
| Week 6 | 9/28 | Accelerating Transformer on GPU Part 1[slides] | LightSeq | |
| 9/30 | Accelerating Transformer on GPU Part 2[slides] | LightSeq2 | HW3 Due |
| 10/2 | Recitation 5 | LightSeq | Project Team Due |
| Week 7 | 10/5 | Guest Lecture by Srinath Mandalapu (Google): TPU and JAX [slides] | | |
| 10/7 | Guest Lecture by Srinath Mandalapu (Google): Pallas and Splash Attention [slides] | | |
| 10/9 | | | Project Proposal Due |
| Week 8 | 10/12 | spring break | | |
| Week 9 | 10/19 | Distributed Model Training[slides] | | |
| 10/21 | Distributed Model Training II[slides] | DDP | HW4 Due |
| Week 10 | 10/26 | Distributed Model Training III[slides] | GPipe, Megatron-LM | |
| 10/28 | Large models with Mixture-of-Expert [slides] | GShard, Switch Transformer, DeepSpeed-MOE, Deepseek-MoE | |
| 10/30 | Recitation 6 | Distributed Training | |
| Week 11 | 11/2 | Memory Optimization in Distributed Training [slides] | ZeRO (DeepSpeed) | |
| 11/4 | Model Quantization [slides] | | HW5 Due |
| Week 12 | 11/9 | Model Quantization II [slides] | GPTQ | |
| 11/11 | Optimizing Attention for Modern Hardware (Tri Dao) [slides] | FlashAttention FlashAttention2 FlashAttention3 FlashAttention4 | Mid-term Report Due |
| Week 13 | 11/16 | LLM serving with SGL [slides] | ORCA SGLang | |
| 11/18 | Efficient fine-tuning for Large Models [slides] | CIAT, LORA, QLoRA | |
| Week 14 | 11/23 | Efficient LLM Inference with Paged Attention and vLLM (Woosuk Kwon) [slides] | vLLM | HW6 Due |
| 11/25 | no class | | |
| Week 15 | 11/30 | Efficient Reinforcement Learning System for LLMs | ReaLHF | |
| 12/2 | Serving with Disaggregated Prefill-Decoding [slides] | DistServe | HW7 Due |
| Week 16 | 12/11 | Final project presentation | | |
| 12/7 | | | Final report due |
| | Better KV Cache for LLM Serving (Junchen Jiang) [slides] | CacheGen CacheBlend | |
| | DistServe: Disaggregated Prefill-Decoding (Hao Zhang) [slides] | DistServe | |
| | App Stack and Model Serving[slides] | Triton, LightLLM | |
| | Triton for Kernel Optimization | JAX | |
| | Retrieval-augmented Language Models | RAG | |
| | Nearest Vector Search for Embeddings | HNSW | |
| | Multimodal LLMs | Flamingo | |
| | Efficient Streaming Language Models with Attention Sinks | Attention Sink | |
| | LLM Serving on Heterogeneous Hardware [slides] | Mooncake, kTransformer | |