Play all

intro: ChatGPT, Transformers, nanoGPT, Shakespeare

reading and exploring the data

tokenization, train/val split

data loader: batches of chunks of data

simplest baseline: bigram language model, loss, generation

training the bigram model

port our code to a script

version 1: averaging past context with for loops, the weakest form of aggregation

the trick in self-attention: matrix multiply as weighted aggregation

version 2: using matrix multiply

version 3: adding softmax

minor code cleanup

positional encoding

THE CRUX OF THE VIDEO: version 4: self-attention

note 1: attention as communication

note 2: attention has no notion of space, operates over sets

note 3: there is no communication across batch dimension

note 4: encoder blocks vs. decoder blocks

note 5: attention vs. self-attention vs. cross-attention

note 6: "scaled" self-attention. why divide by sqrthead_size

inserting a single self-attention block to our network

multi-headed self-attention

feedforward layers of transformer block

residual connections

layernorm and its relationship to our previous batchnorm

scaling up the model! creating a few variables. adding dropout

encoder vs. decoder vs. both ? Transformers

super quick walkthrough of nanoGPT, batched multi-headed self-attention

back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF

conclusions

Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :

Oops I should be using the head_size for the normalization, not C

Description:

Dive into a comprehensive tutorial on building a Generatively Pretrained Transformer (GPT) from scratch, following the "Attention is All You Need" paper and OpenAI's GPT-2/GPT-3 models. Explore the connections to ChatGPT and watch GitHub Copilot assist in writing GPT code. Begin with an introduction to ChatGPT, Transformers, nanoGPT, and Shakespeare, then progress through data exploration, tokenization, and implementing a baseline bigram language model. Delve into the core concepts of self-attention, including matrix multiplication for weighted aggregation, positional encoding, and multi-headed attention. Build the Transformer architecture step-by-step, incorporating feedforward layers, residual connections, and layer normalization. Conclude with insights on encoder vs. decoder Transformers, a walkthrough of nanoGPT, and discussions on pretraining, fine-tuning, and RLHF in the context of ChatGPT and GPT-3.

Let's Build GPT - From Scratch, in Code, Spelled Out

Andrej Karpathy

Add to list

#Computer Science #Artificial Intelligence #Natural Language Processing (NLP) #LLM (Large Language Model) #ChatGPT #Deep Learning #Transformer Architecture #Self-Attention

0:00 / 0:00