1/13 | Introduction to LLM [slides] | | HW1 out |
| GPU Programming Basics [slides] | Chap 2,3 of Programming Massively Parallel Processors, 3rd Ed | |
1/22 | Learning algorithm and Auto Differentiation [slides] | Auto Diff survery | |
1/27 | Deep Learning Frameworks Design Principles [slides] | Tensorflow | |
| Transformer [slides] | Attention is all you need | |
2/3 | Pre-trained LLMs [slides] | LLaMA, GPT3, Annotated Transformer | HW1 due / HW2 out |
| Tokenization and Decoding [slides] | BPE, Sentence-Piece, Beam search | |
2/10 | GPU Acceleration [slides] | Chap 4,5 of Programming Massively Parallel Processors, 3rd Ed | |
| Accelerating Transformer on GPU Part 1 [slides] | LightSeq | |
2/17 | Accelerating Transformer on GPU Part 2 [slides] | LightSeq2 | |
| Distributed Model Training [slides] | DDP | HW2 due / HW3 out |
2/24 | Distributed Model Training II [slides] | GPipe, Megatron-LM | |
| App Stack and Model Serving [slides] | Triton, LightLLM | Project proposal due |
3/3 | spring break | | |
3/10 | Model Quantization and Compression | GPTQ | |
| Efficient fine-tuning for Large Models | LORA, QLoRA | |
3/17 | Communication Efficient Distributed Training | ZeRO (DeepSpeed) | HW3 due / HW4 out |
| Advanced Large Model Serving | Orca | |
3/24 | PageAttention | vLLM | |
| GPU just-in-time compilation | JAX | |
3/31 | Large models with Mixture-of-Expert | DeepSpeed-MOE | HW4 due |
| Memory Optimization for LLMs | FlashAttention | |
4/7 | Long and Longer Context | RMT | Mid-term report due |
| Efficient Streaming Language Models with Attention Sinks | Attention Sink | |
4/14 | Speculative Decoding | Speculative Decoding | |
| Retrieval-augmented Language Models | RAG | |
4/21 | Nearest Vector Search for Embeddings | HNSW | |
| Multimodal LLMs | Flamingo | |
4/28 | Final project presentation | | |
4/29 | | | Final report due |