Главная
Study mode:
on
1
intro: Let’s reproduce GPT-2 124M
2
exploring the GPT-2 124M OpenAI checkpoint
3
SECTION 1: implementing the GPT-2 nn.Module
4
loading the huggingface/GPT-2 parameters
5
implementing the forward pass to get logits
6
sampling init, prefix tokens, tokenization
7
sampling loop
8
sample, auto-detect the device
9
let’s train: data batches B,T → logits B,T,C
10
cross entropy loss
11
optimization loop: overfit a single batch
12
data loader lite
13
parameter sharing wte and lm_head
14
model initialization: std 0.02, residual init
15
SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms
16
Tensor Cores, timing the code, TF32 precision, 333ms
17
float16, gradient scalers, bfloat16, 300ms
18
torch.compile, Python overhead, kernel fusion, 130ms
19
flash attention, 96ms
20
nice/ugly numbers. vocab size 50257 → 50304, 93ms
21
SECTION 3: hyperpamaters, AdamW, gradient clipping
22
learning rate scheduler: warmup + cosine decay
23
batch size schedule, weight decay, FusedAdamW, 90ms
24
gradient accumulation
25
distributed data parallel DDP
26
datasets used in GPT-2, GPT-3, FineWeb EDU
27
validation data split, validation loss, sampling revive
28
evaluation: HellaSwag, starting the run
29
SECTION 4: results in the morning! GPT-2, GPT-3 repro
30
shoutout to llm.c, equivalent but faster code in raw C/CUDA
31
summary, phew, build-nanogpt github repo
Description:
Embark on a comprehensive 4-hour journey to reproduce GPT-2 (124M) from scratch in this in-depth video tutorial. Explore the entire process, from building the GPT-2 network to optimizing its training for maximum efficiency. Follow along as the instructor sets up the training run according to GPT-2 and GPT-3 paper specifications, initiates the process, and analyzes the results. Gain insights into model architecture, parameter loading, forward pass implementation, sampling techniques, and data handling. Dive into advanced topics such as mixed precision training, GPU optimization, gradient accumulation, and distributed data parallel processing. Learn about hyperparameter tuning, learning rate scheduling, and evaluation methods. By the end, you'll have a thorough understanding of building and training a GPT-2 model, with practical knowledge applicable to larger language models.

Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization

Andrej Karpathy
Add to list
0:00 / 0:00