simplest baseline: bigram language model, loss, generation
6
training the bigram model
7
port our code to a script
8
version 1: averaging past context with for loops, the weakest form of aggregation
9
the trick in self-attention: matrix multiply as weighted aggregation
10
version 2: using matrix multiply
11
version 3: adding softmax
12
minor code cleanup
13
positional encoding
14
THE CRUX OF THE VIDEO: version 4: self-attention
15
note 1: attention as communication
16
note 2: attention has no notion of space, operates over sets
17
note 3: there is no communication across batch dimension
18
note 4: encoder blocks vs. decoder blocks
19
note 5: attention vs. self-attention vs. cross-attention
20
note 6: "scaled" self-attention. why divide by sqrthead_size
21
inserting a single self-attention block to our network
22
multi-headed self-attention
23
feedforward layers of transformer block
24
residual connections
25
layernorm and its relationship to our previous batchnorm
26
scaling up the model! creating a few variables. adding dropout
27
encoder vs. decoder vs. both ? Transformers
28
super quick walkthrough of nanoGPT, batched multi-headed self-attention
29
back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF
30
conclusions
31
Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :
32
Oops I should be using the head_size for the normalization, not C
Description:
Dive into a comprehensive tutorial on building a Generatively Pretrained Transformer (GPT) from scratch, following the "Attention is All You Need" paper and OpenAI's GPT-2/GPT-3 models. Explore the connections to ChatGPT and watch GitHub Copilot assist in writing GPT code. Begin with an introduction to ChatGPT, Transformers, nanoGPT, and Shakespeare, then progress through data exploration, tokenization, and implementing a baseline bigram language model. Delve into the core concepts of self-attention, including matrix multiplication for weighted aggregation, positional encoding, and multi-headed attention. Build the Transformer architecture step-by-step, incorporating feedforward layers, residual connections, and layer normalization. Conclude with insights on encoder vs. decoder Transformers, a walkthrough of nanoGPT, and discussions on pretraining, fine-tuning, and RLHF in the context of ChatGPT and GPT-3.
Let's Build GPT - From Scratch, in Code, Spelled Out