Skip to main content

Syllabus

WeekDatesTopicReading/ContentHomework
Week 11/12Introduction to LLM[slides]
1/14GPU Programming Basics 1[slides]Chap 2,4 of Programming Massively Parallel Processors, 4th EdHW1 Released
1/16Recitation 1PSC Guidelines, Simple CUDA Demo
Week 21/19no class
1/21GPU Programming Basics 2[slides]Chap 3 of Programming Massively Parallel Processors, 4th Ed
Week 31/26GPU Acceleration[slides]Chap 5,6 of Programming Massively Parallel Processors, 4th Ed
1/28Deep Learning Frameworks and Auto Differentiation [slides]Tensorflow Auto Diff survey Differentiable ProgrammingHW1 Due, HW2 Released
1/30Recitation 2HW2, MiniTorch, More GPU
Week 42/2Transformer[slides]Attention is all you need
2/4Pre-trained LLMs[slides]LLaMA, GPT3, Annotated TransformerHW2 Due, HW3 Released
2/6Recitation 3Annotated Transformer
Week 52/9Tokenization and Embedding [slides]BPE, Sentence-Piece, VOLT
2/11Generation and Speculative Decoding [slides]
2/13Recitation 4Decoding
Week 62/16Accelerating Transformer on GPU Part 1[slides]LightSeq
2/18Accelerating Transformer on GPU Part 2[slides]LightSeq2HW3 Due
2/20Recitation 5LightSeqProject Team Due
Week 72/23TPU and Acceleration
2/25Deep Learning Compilation and JAX
2/27Project Proposal Due
Week 83/2spring break
Week 93/9Distributed Model Training[slides]
3/11Distributed Model Training II[slides]DDPHW4 Due
Week 103/16Distributed Model Training III[slides]GPipe, Megatron-LM
3/18Large models with Mixture-of-Expert [slides]GShard, Switch Transformer, DeepSpeed-MOE, Deepseek-MoE
3/20Recitation 6Distributed Training
Week 113/23Memory Optimization in Distributed Training [slides]ZeRO (DeepSpeed)
3/25Model Quantization [slides]HW5 Due
Week 123/30Optimizing Attention for Modern Hardware (Tri Dao) [slides]FlashAttention
4/1Model Quantization II [slides]GPTQMid-term Report Due
Week 134/6LLM serving with SGL [slides]SGLang
4/8Efficient LLM Inference with Paged Attention and vLLM (Woosuk Kwon) [slides]vLLMHW6 Due
Week 144/13Efficient fine-tuning for Large Models [slides]CIAT, LORA, QLoRA
4/15Efficient Reinforcement Learning System for LLMs (Yi Wu)ReaLHF
Week 154/20Serving with Disaggregated Prefill-Decoding (Vikram Sharma Mailthody) [slides]DistServeHW7 Due
4/22LLM Serving on Heterogeneous Hardware (Mingxing Zhang)
Week 174/27Final project presentation
4/28Final report due
Better KV Cache for LLM Serving (Yuhan Liu) [slides]CacheGen CacheBlend
DistServe: Disaggregated Prefill-Decoding (Hao Zhang) [slides]DistServe
App Stack and Model Serving[slides]Triton, LightLLM
Triton for Kernel OptimizationJAX
Retrieval-augmented Language ModelsRAG
Nearest Vector Search for EmbeddingsHNSW
Multimodal LLMsFlamingo
Efficient Streaming Language Models with Attention SinksAttention Sink