1/13 | Introduction to LLM [slides] | | |
| GPU Programming Basics 1 [slides] | Chap 2,4 of Programming Massively Parallel Processors, 4th Ed | |
1/22 | GPU Programming Basics 2 [slides] | Chap 3 of Programming Massively Parallel Processors, 4th Ed | |
1/27 | Learning algorithm and Auto Differentiation [slides] | Auto Diff survey | |
| Deep Learning Frameworks Design [slides] | Tensorflow | |
2/3 | Transformer [slides] | Attention is all you need | HW1 due |
| Pre-trained LLMs [slides] | LLaMA, GPT3, Annotated Transformer | |
2/10 | Tokenization and Decoding [slides] | BPE, Sentence-Piece, Beam search | |
| GPU Acceleration [slides] | Chap 5,6 of Programming Massively Parallel Processors, 4th Ed | |
| Accelerating Transformer on GPU Part 1 [slides] | LightSeq | |
2/17 | Accelerating Transformer on GPU Part 2 [slides] | LightSeq2 | |
| Distributed Model Training [slides] | DDP | HW2 due |
2/24 | Distributed Model Training II [slides] | GPipe, Megatron-LM | |
| App Stack and Model Serving [slides] | Triton, LightLLM | Project proposal due |
3/3 | spring break | | |
3/10 | Model Quantization and Compression | GPTQ | |
| Efficient fine-tuning for Large Models | LORA, QLoRA | |
3/17 | Communication Efficient Distributed Training | ZeRO (DeepSpeed) | HW3 due |
| Advanced Large Model Serving | Orca | |
3/24 | PageAttention | vLLM | |
| GPU just-in-time compilation | JAX | |
3/31 | Large models with Mixture-of-Expert | DeepSpeed-MOE | HW4 due |
| Memory Optimization for LLMs | FlashAttention | |
4/7 | Long and Longer Context | RMT | Mid-term report due |
| Efficient Streaming Language Models with Attention Sinks | Attention Sink | |
4/14 | Speculative Decoding | Speculative Decoding | |
| Retrieval-augmented Language Models | RAG | |
4/21 | Nearest Vector Search for Embeddings | HNSW | |
| Multimodal LLMs | Flamingo | |
4/28 | Final project presentation | | |
4/29 | | | Final report due |