| Week 1 | 1/12 | Introduction to LLM[slides] | | |
| 1/14 | GPU Programming Basics 1[slides] | Chap 2,4 of Programming Massively Parallel Processors, 4th Ed | HW1 Released |
| 1/16 | Recitation 1 | PSC Guidelines, Simple CUDA Demo | |
| Week 2 | 1/19 | no class | | |
| 1/21 | GPU Programming Basics 2[slides] | Chap 3 of Programming Massively Parallel Processors, 4th Ed | |
| Week 3 | 1/26 | GPU Acceleration[slides] | Chap 5,6 of Programming Massively Parallel Processors, 4th Ed | |
| 1/28 | Deep Learning Frameworks and Auto Differentiation[slides] [slides] | Tensorflow Auto Diff survey Differentiable Programming | HW1 Due |
| 1/30 | Recitation 2 | More examples, MiniTorch | |
| Week 4 | 2/2 | TPU and Acceleration | | |
| 2/4 | Deep Learning Compilation and JAX | | HW2 Due |
| Week 5 | 2/9 | Transformer[slides] | Attention is all you need | |
| 2/11 | Pre-trained LLMs[slides] | LLaMA, GPT3, Annotated Transformer | |
| Week 6 | 2/16 | Tokenization and Embedding [slides] | BPE, Sentence-Piece, VOLT | |
| 2/18 | Generation and Speculative Decoding [slides] | | |
| 2/20 | Recitation 3 | Annotated Transformer & Decoding | |
| Week 7 | 2/23 | Accelerating Transformer on GPU Part 1[slides] | LightSeq | |
| 2/25 | Accelerating Transformer on GPU Part 2[slides] | LightSeq2 | HW3 Due |
| 2/27 | Recitation 4 | LightSeq | Project Proposal Due |
| Week 8 | 3/2 | spring break | | |
| Week 9 | 3/9 | Distributed Model Training[slides] | | |
| 3/11 | Distributed Model Training II[slides] | DDP | |
| Week 11 | 3/16 | Distributed Model Training III[slides] | GPipe, Megatron-LM | HW4 Due |
| 3/18 | Large models with Mixture-of-Expert [slides] | GShard, Switch Transformer, DeepSpeed-MOE, Deepseek-MoE | |
| 3/20 | Recitation 5 | Distributed Training | |
| Week 12 | 3/23 | Memory Optimization in Distributed Training [slides] | ZeRO (DeepSpeed) | |
| 3/25 | Model Quantization [slides] | | HW5 Due |
| Week 13 | 3/30 | Optimizing Attention for Modern Hardware (Tri Dao) [slides] | FlashAttention | |
| 4/1 | Model Quantization II [slides] | GPTQ | Mid-term Report Due |
| Week 14 | 4/6 | LLM serving with SGL [slides] | SGLang | |
| 4/8 | Efficient LLM Inference with Paged Attention and vLLM (Woosuk Kwon) [slides] | vLLM | HW6 Due |
| Week 15 | 4/13 | Efficient fine-tuning for Large Models [slides] | CIAT, LORA, QLoRA | |
| 4/15 | Efficient Reinforcement Learning System for LLMs (Yi Wu) | ReaLHF | |
| Week 16 | 4/20 | Serving with Disaggregated Prefill-Decoding (Vikram Sharma Mailthody) [slides] | DistServe | HW7 Due |
| 4/22 | LLM Serving on Heterogeneous Hardware (Mingxing Zhang) | | |
| Week 17 | 4/27 | Final project presentation | | |
| 4/28 | | | Final report due |
| | Better KV Cache for LLM Serving (Yuhan Liu) [slides] | CacheGen CacheBlend | |
| | DistServe: Disaggregated Prefill-Decoding (Hao Zhang) [slides] | DistServe | |
| | App Stack and Model Serving[slides] | Triton, LightLLM | |
| | Triton for Kernel Optimization | JAX | |
| | Retrieval-augmented Language Models | RAG | |
| | Nearest Vector Search for Embeddings | HNSW | |
| | Multimodal LLMs | Flamingo | |
| | Efficient Streaming Language Models with Attention Sinks | Attention Sink | |