1/13 | Introduction to LLM[slides] | | |
| GPU Programming Basics 1[slides] | Chap 2,4 ofProgramming Massively Parallel Processors, 4th Ed | |
1/22 | GPU Programming Basics 2[slides] | Chap 3 ofProgramming Massively Parallel Processors, 4th Ed | |
1/27 | Learning algorithm and Auto Differentiation[slides] | Auto Diff survey Differentiable Programming | |
| Deep Learning Frameworks Design[slides] | Tensorflow | |
2/3 | Transformer[slides] | Attention is all you need | |
| Pre-trained LLMs[slides] | LLaMA, GPT3, Annotated Transformer | HW1 due |
2/10 | Tokenization [slides] | BPE, Sentence-Piece, VOLT | |
| LLM Decoding [slides] | Beam search | |
2/17 | GPU Acceleration[slides] | Chap 5,6 ofProgramming Massively Parallel Processors, 4th Ed | |
| Accelerating Transformer on GPU Part 1[slides] | LightSeq | |
2/24 | Accelerating Transformer on GPU Part 2[slides] | LightSeq2 | HW2 due |
| Distributed Model Training[slides] | | Project proposal due |
3/3 | spring break | | |
3/10 | Distributed Model Training II[slides] | DDP | |
| Distributed Model Training III[slides] | GPipe, Megatron-LM | |
3/17 | Model Quantization and Compression | GPTQ | HW3 due |
| Efficient fine-tuning for Large Models | LORA, QLoRA | |
3/24 | Communication Efficient Distributed Training | ZeRO (DeepSpeed) | |
| Advanced Large Model Serving | Orca | |
3/31 | PageAttention | vLLM | HW4 due |
| GPU just-in-time compilation | JAX | |
4/7 | Large models with Mixture-of-Expert | DeepSpeed-MOE | Mid-term report due |
| Memory Optimization for LLMs | FlashAttention | |
4/14 | Long and Longer Context | RMT | |
| Efficient Streaming Language Models with Attention Sinks | Attention Sink | |
4/21 | Speculative Decoding | Speculative Decoding | |
| Retrieval-augmented Language Models | RAG | |
4/25 | Final project presentation | | |
4/26 | | | Final report due |
| App Stack and Model Serving[slides] | Triton, LightLLM | |
| Nearest Vector Search for Embeddings | HNSW | |
| Multimodal LLMs | Flamingo | |
| Deepseek V3 and R1 | | |
| RL training for LLM | | |