Play all

intro: Let’s reproduce GPT-2 124M

exploring the GPT-2 124M OpenAI checkpoint

SECTION 1: implementing the GPT-2 nn.Module

loading the huggingface/GPT-2 parameters

implementing the forward pass to get logits

sampling init, prefix tokens, tokenization

sampling loop

sample, auto-detect the device

let’s train: data batches B,T → logits B,T,C

cross entropy loss

optimization loop: overfit a single batch

data loader lite

parameter sharing wte and lm_head

model initialization: std 0.02, residual init

SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms

Tensor Cores, timing the code, TF32 precision, 333ms

float16, gradient scalers, bfloat16, 300ms

torch.compile, Python overhead, kernel fusion, 130ms

flash attention, 96ms

nice/ugly numbers. vocab size 50257 → 50304, 93ms

SECTION 3: hyperpamaters, AdamW, gradient clipping

learning rate scheduler: warmup + cosine decay

batch size schedule, weight decay, FusedAdamW, 90ms

gradient accumulation

distributed data parallel DDP

datasets used in GPT-2, GPT-3, FineWeb EDU

validation data split, validation loss, sampling revive

evaluation: HellaSwag, starting the run

SECTION 4: results in the morning! GPT-2, GPT-3 repro

shoutout to llm.c, equivalent but faster code in raw C/CUDA

summary, phew, build-nanogpt github repo

Description:

Embark on a comprehensive 4-hour journey to reproduce GPT-2 (124M) from scratch in this in-depth video tutorial. Explore the entire process, from building the GPT-2 network to optimizing its training for maximum efficiency. Follow along as the instructor sets up the training run according to GPT-2 and GPT-3 paper specifications, initiates the process, and analyzes the results. Gain insights into model architecture, parameter loading, forward pass implementation, sampling techniques, and data handling. Dive into advanced topics such as mixed precision training, GPU optimization, gradient accumulation, and distributed data parallel processing. Learn about hyperparameter tuning, learning rate scheduling, and evaluation methods. By the end, you'll have a thorough understanding of building and training a GPT-2 model, with practical knowledge applicable to larger language models.

Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization

Andrej Karpathy

Add to list

#Computer Science #Artificial Intelligence #Natural Language Processing (NLP) #LLM (Large Language Model) #GPT-2 #Machine Learning #Deep Learning #Neural Networks #PyTorch #Transformer Models #Model Training

0:00 / 0:00