Syllabus

Week	Dates	Topic	Reading/Content	Homework
Week 1	8/24	Introduction to LLM[slides]
	8/26	GPU Programming Basics 1[slides]	Chap 2,4 of Programming Massively Parallel Processors, 4th Ed	HW1 Released
	8/28	Recitation 1	PSC Guidelines, Simple CUDA Demo
Week 2	8/31	GPU Programming Basics 2[slides]	Chap 3 of Programming Massively Parallel Processors, 4th Ed
	9/2	GPU Acceleration[slides]	Chap 5,6 of Programming Massively Parallel Processors, 4th Ed
Week 3	9/7	no class
	9/9	Deep Learning Frameworks and Auto Differentiation [slides]	Tensorflow Auto Diff survey Differentiable Programming	HW1 Due, HW2 Released
	9/11	Recitation 2	HW2, MiniTorch, More GPU
Week 4	9/14	Transformer[slides]	Attention is all you need
	9/16	Pre-trained LLMs[slides]	LLaMA, GPT3, Annotated Transformer	HW2 Due, HW3 Released
	9/18	Recitation 3	Annotated Transformer
Week 5	9/21	Tokenization and Embedding [slides]	BPE, Sentence-Piece, VOLT
	9/23	Generation and Speculative Decoding [slides]
	9/25	Recitation 4	Decoding
Week 6	9/28	Accelerating Transformer on GPU Part 1[slides]	LightSeq
	9/30	Accelerating Transformer on GPU Part 2[slides]	LightSeq2	HW3 Due
	10/2	Recitation 5	LightSeq	Project Team Due
Week 7	10/5	Guest Lecture by Srinath Mandalapu (Google): TPU and JAX [slides]
	10/7	Guest Lecture by Srinath Mandalapu (Google): Pallas and Splash Attention [slides]
	10/9			Project Proposal Due
Week 8	10/12	spring break
Week 9	10/19	Distributed Model Training[slides]
	10/21	Distributed Model Training II[slides]	DDP	HW4 Due
Week 10	10/26	Distributed Model Training III[slides]	GPipe, Megatron-LM
	10/28	Large models with Mixture-of-Expert [slides]	GShard, Switch Transformer, DeepSpeed-MOE, Deepseek-MoE
	10/30	Recitation 6	Distributed Training
Week 11	11/2	Memory Optimization in Distributed Training [slides]	ZeRO (DeepSpeed)
	11/4	Model Quantization [slides]		HW5 Due
Week 12	11/9	Model Quantization II [slides]	GPTQ
	11/11	Optimizing Attention for Modern Hardware (Tri Dao) [slides]	FlashAttention FlashAttention2 FlashAttention3 FlashAttention4	Mid-term Report Due
Week 13	11/16	LLM serving with SGL [slides]	ORCA SGLang
	11/18	Efficient fine-tuning for Large Models [slides]	CIAT, LORA, QLoRA
Week 14	11/23	Efficient LLM Inference with Paged Attention and vLLM (Woosuk Kwon) [slides]	vLLM	HW6 Due
	11/25	no class
Week 15	11/30	Efficient Reinforcement Learning System for LLMs	ReaLHF
	12/2	Serving with Disaggregated Prefill-Decoding [slides]	DistServe	HW7 Due
Week 16	12/11	Final project presentation
	12/7			Final report due
		Better KV Cache for LLM Serving (Junchen Jiang) [slides]	CacheGen CacheBlend
		DistServe: Disaggregated Prefill-Decoding (Hao Zhang) [slides]	DistServe
		App Stack and Model Serving[slides]	Triton, LightLLM
		Triton for Kernel Optimization	JAX
		Retrieval-augmented Language Models	RAG
		Nearest Vector Search for Embeddings	HNSW
		Multimodal LLMs	Flamingo
		Efficient Streaming Language Models with Attention Sinks	Attention Sink
		LLM Serving on Heterogeneous Hardware [slides]	Mooncake, kTransformer