1/13 | Introduction to LLM[slides] | | |
| GPU Programming Basics 1[slides] | Chap 2,4 ofProgramming Massively Parallel Processors, 4th Ed | |
1/22 | GPU Programming Basics 2[slides] | Chap 3 ofProgramming Massively Parallel Processors, 4th Ed | |
1/27 | Learning algorithm and Auto Differentiation[slides] | Auto Diff survey Differentiable Programming | |
| Deep Learning Frameworks Design[slides] | Tensorflow | |
2/3 | Transformer[slides] | Attention is all you need | |
| Pre-trained LLMs[slides] | LLaMA, GPT3, Annotated Transformer | HW1 due |
2/10 | Tokenization [slides] | BPE, Sentence-Piece, VOLT | |
| LLM Decoding [slides] | Beam search | |
2/17 | GPU Acceleration[slides] | Chap 5,6 ofProgramming Massively Parallel Processors, 4th Ed | |
| Accelerating Transformer on GPU Part 1[slides] | LightSeq | |
2/24 | Accelerating Transformer on GPU Part 2[slides] | LightSeq2 | HW2 due |
| Distributed Model Training[slides] | | Project proposal due |
3/3 | spring break | | |
3/10 | Distributed Model Training II[slides] | DDP | |
| Distributed Model Training III[slides] | GPipe, Megatron-LM | |
3/17 | Model Quantization [slides] | | HW3 due |
| Model Quantization II [slides] | GPTQ | |
3/24 | Efficient fine-tuning for Large Models [slides] | CIAT, LORA, QLoRA | |
| Large models with Mixture-of-Expert [slides] | GShard, Switch Transformer, DeepSpeed-MOE, Deepseek-MoE | |
3/31 | Optimizing Attention for Modern Hardware (Tri Dao) [slides] | FlashAttention | |
| Communication Efficient Distributed Training [slides] | ZeRO (DeepSpeed) | HW4 due |
4/7 | LLM Serving with PageAttention (Woosuk Kwon) [slides] | vLLM | |
| Better KV Cache for LLM Serving (Yuhan Liu) [slides] | | Mid-term report due |
4/14 | DistServe: Disaggregated Prefill-Decoding (Hao Zhang) [slides] | DistServe | HW5 due |
| LLM serving with SGL (Ying Sheng) | | |
4/21 | Scalable LLM RL Training | | |
4/23 | Final project presentation | | |
4/28 | | | Final report due |
| App Stack and Model Serving[slides] | Triton, LightLLM | |
| GPU just-in-time compilation | JAX | |
| Speculative Decoding | Speculative Decoding | |
| Retrieval-augmented Language Models | RAG | |
| Nearest Vector Search for Embeddings | HNSW | |
| Multimodal LLMs | Flamingo | |
| Deepseek V3 and R1 | | |
| Efficient Streaming Language Models with Attention Sinks | Attention Sink | |
| Advanced Large Model Serving | Orca | |
| Dynamo | | |