1/17 | Introduction to LLM [slides] | | HW1 out |
1/22 | GPU Programming Basics [slides] | Chap 2,3 of Programming Massively Parallel Processors, 3rd Ed | |
| Learning algorithm and Auto Differentiation [slides] | Auto Diff survery | |
1/29 | Deep Learning Frameworks Design Principles [slides] | Tensorflow | |
| Transformer [slides] | Attention is all you need | |
2/5 | Pre-trained LLMs [slides] | LLaMA, GPT3, Annotated Transformer | HW1 due / HW2 out |
| Tokenization and Decoding [slides] | BPE, Sentence-Piece, Beam search | |
2/12 | GPU Acceleration [slides] | Chap 4,5 of Programming Massively Parallel Processors, 3rd Ed | |
| Accelerating Transformer on GPU Part 1 [slides] | LightSeq | |
2/19 | Accelerating Transformer on GPU Part 2 [slides] | LightSeq2 | |
| Distributed Model Training [slides] | DDP | HW2 due / HW3 out |
2/26 | Distributed Model Training II [slides] | GPipe, Megatron-LM | |
| App Stack and Model Serving [slides] | Triton, LightLLM | Project proposal due |
3/4 | spring break | | |
3/11 | Model Quantization and Compression [slides] | GPTQ | |
| Efficient fine-tuning for Large Models [slides] | LORA, QLoRA | |
3/18 | Communication Efficient Distributed Training [slides] | ZeRO (DeepSpeed) | HW3 due / HW4 out |
| Advanced Large Model Serving [slides] | Orca | |
3/25 | PageAttention [slides] | vLLM | |
| GPU just-in-time compilation [slides] | JAX | |
4/1 | Large models with Mixture-of-Expert [slides] | DeepSpeed-MOE | HW4 due |
| Memory Optimization for LLMs [slides] | FlashAttention | |
4/8 | Long and Longer Context [slides] | RMT | Mid-term report due |
| Efficient Streaming Language Models with Attention Sinks [slides] | Attention Sink | |
4/15 | Speculative Decoding [slides] | Speculative Decoding | |
| Retrieval-augmented Language Models [slides] | RAG | |
4/22 | Nearest Vector Search for Embeddings [slides] | HNSW | |
| Multimodal LLMs [slides] | Flamingo | |
4/29 | Final project presentation | | |
4/30 | | | Final report due |