Skip to main content

Syllabus

WeekDatesTopicReading/ContentHomework
Week 18/24Introduction to LLM[slides]
8/26GPU Programming Basics 1[slides]Chap 2,4 of Programming Massively Parallel Processors, 4th EdHW1 Released
8/28Recitation 1PSC Guidelines, Simple CUDA Demo
Week 28/31GPU Programming Basics 2[slides]Chap 3 of Programming Massively Parallel Processors, 4th Ed
9/2GPU Acceleration[slides]Chap 5,6 of Programming Massively Parallel Processors, 4th Ed
Week 39/7no class
9/9Deep Learning Frameworks and Auto Differentiation [slides]Tensorflow Auto Diff survey Differentiable ProgrammingHW1 Due, HW2 Released
9/11Recitation 2HW2, MiniTorch, More GPU
Week 49/14Transformer[slides]Attention is all you need
9/16Pre-trained LLMs[slides]LLaMA, GPT3, Annotated TransformerHW2 Due, HW3 Released
9/18Recitation 3Annotated Transformer
Week 59/21Tokenization and Embedding [slides]BPE, Sentence-Piece, VOLT
9/23Generation and Speculative Decoding [slides]
9/25Recitation 4Decoding
Week 69/28Accelerating Transformer on GPU Part 1[slides]LightSeq
9/30Accelerating Transformer on GPU Part 2[slides]LightSeq2HW3 Due
10/2Recitation 5LightSeqProject Team Due
Week 710/5Guest Lecture by Srinath Mandalapu (Google): TPU and JAX [slides]
10/7Guest Lecture by Srinath Mandalapu (Google): Pallas and Splash Attention [slides]
10/9Project Proposal Due
Week 810/12spring break
Week 910/19Distributed Model Training[slides]
10/21Distributed Model Training II[slides]DDPHW4 Due
Week 1010/26Distributed Model Training III[slides]GPipe, Megatron-LM
10/28Large models with Mixture-of-Expert [slides]GShard, Switch Transformer, DeepSpeed-MOE, Deepseek-MoE
10/30Recitation 6Distributed Training
Week 1111/2Memory Optimization in Distributed Training [slides]ZeRO (DeepSpeed)
11/4Model Quantization [slides]HW5 Due
Week 1211/9Model Quantization II [slides]GPTQ
11/11Optimizing Attention for Modern Hardware (Tri Dao) [slides]FlashAttention FlashAttention2 FlashAttention3 FlashAttention4Mid-term Report Due
Week 1311/16LLM serving with SGL [slides]ORCA SGLang
11/18Efficient fine-tuning for Large Models [slides]CIAT, LORA, QLoRA
Week 1411/23Efficient LLM Inference with Paged Attention and vLLM (Woosuk Kwon) [slides]vLLMHW6 Due
11/25no class
Week 1511/30Efficient Reinforcement Learning System for LLMsReaLHF
12/2Serving with Disaggregated Prefill-Decoding [slides]DistServeHW7 Due
Week 1612/11Final project presentation
12/7Final report due
Better KV Cache for LLM Serving (Junchen Jiang) [slides]CacheGen CacheBlend
DistServe: Disaggregated Prefill-Decoding (Hao Zhang) [slides]DistServe
App Stack and Model Serving[slides]Triton, LightLLM
Triton for Kernel OptimizationJAX
Retrieval-augmented Language ModelsRAG
Nearest Vector Search for EmbeddingsHNSW
Multimodal LLMsFlamingo
Efficient Streaming Language Models with Attention SinksAttention Sink
LLM Serving on Heterogeneous Hardware [slides]Mooncake, kTransformer