Syllabus

Week	Dates	Topic	Reading/Content	Homework
Week 1	1/12	Introduction to LLM[slides]
	1/14	GPU Programming Basics 1[slides]	Chap 2,4 of Programming Massively Parallel Processors, 4th Ed	HW1 Released
	1/16	Recitation 1	PSC Guidelines, Simple CUDA Demo
Week 2	1/19	no class
	1/21	GPU Programming Basics 2[slides]	Chap 3 of Programming Massively Parallel Processors, 4th Ed
Week 3	1/26	GPU Acceleration[slides]	Chap 5,6 of Programming Massively Parallel Processors, 4th Ed
	1/28	Deep Learning Frameworks and Auto Differentiation [slides]	Tensorflow Auto Diff survey Differentiable Programming	HW1 Due, HW2 Released
	1/30	Recitation 2	HW2, MiniTorch, More GPU
Week 4	2/2	Transformer[slides]	Attention is all you need
	2/4	Pre-trained LLMs[slides]	LLaMA, GPT3, Annotated Transformer	HW2 Due, HW3 Released
	2/6	Recitation 3	Annotated Transformer
Week 5	2/9	Tokenization and Embedding [slides]	BPE, Sentence-Piece, VOLT
	2/11	Generation and Speculative Decoding [slides]
	2/13	Recitation 4	Decoding
Week 6	2/16	Accelerating Transformer on GPU Part 1[slides]	LightSeq
	2/18	Accelerating Transformer on GPU Part 2[slides]	LightSeq2	HW3 Due
	2/20	Recitation 5	LightSeq	Project Team Due
Week 7	2/23	TPU and Acceleration
	2/25	Deep Learning Compilation and JAX
	2/27			Project Proposal Due
Week 8	3/2	spring break
Week 9	3/9	Distributed Model Training[slides]
	3/11	Distributed Model Training II[slides]	DDP	HW4 Due
Week 10	3/16	Distributed Model Training III[slides]	GPipe, Megatron-LM
	3/18	Large models with Mixture-of-Expert [slides]	GShard, Switch Transformer, DeepSpeed-MOE, Deepseek-MoE
	3/20	Recitation 6	Distributed Training
Week 11	3/23	Memory Optimization in Distributed Training [slides]	ZeRO (DeepSpeed)
	3/25	Model Quantization [slides]		HW5 Due
Week 12	3/30	Optimizing Attention for Modern Hardware (Tri Dao) [slides]	FlashAttention
	4/1	Model Quantization II [slides]	GPTQ	Mid-term Report Due
Week 13	4/6	LLM serving with SGL [slides]	SGLang
	4/8	Efficient LLM Inference with Paged Attention and vLLM (Woosuk Kwon) [slides]	vLLM	HW6 Due
Week 14	4/13	Efficient fine-tuning for Large Models [slides]	CIAT, LORA, QLoRA
	4/15	Efficient Reinforcement Learning System for LLMs (Yi Wu)	ReaLHF
Week 15	4/20	Serving with Disaggregated Prefill-Decoding (Vikram Sharma Mailthody) [slides]	DistServe	HW7 Due
	4/22	LLM Serving on Heterogeneous Hardware (Mingxing Zhang)
Week 17	4/27	Final project presentation
	4/28			Final report due
		Better KV Cache for LLM Serving (Yuhan Liu) [slides]	CacheGen CacheBlend
		DistServe: Disaggregated Prefill-Decoding (Hao Zhang) [slides]	DistServe
		App Stack and Model Serving[slides]	Triton, LightLLM
		Triton for Kernel Optimization	JAX
		Retrieval-augmented Language Models	RAG
		Nearest Vector Search for Embeddings	HNSW
		Multimodal LLMs	Flamingo
		Efficient Streaming Language Models with Attention Sinks	Attention Sink