IllustratorsLeak
3blue1brown
3blue1brown

patreon


First chapter for transformers (draft)

Hey folks,

This is a draft for the first of three chapters I'll add to the miniseries on deep learning, about transformers. I imagine the title for this first part to be something like "But what is a GPT?", where the goal is to offer a starting point for someone with minimal background knowledge, but who is curious to see the details.

As always, let me know if there are things you'd change. I need to re-record parts of the narration anyway, partly due to audio quirks, and partly because of a few sections I'd like to describe differently. There are some blank spots in the animations that I'll fill with previews for the following chapter on the attention blocks, which I'll be animating over the next week or two.

Grant

Comments

In terms of always looking backwards, that makes sense in terms of it not seeing the answer that it would predict in the future. The training data does contain the next word, so is it correct that each next word is in effect part of the training set answers? It might help explain that the training works regardless of the language syntax - e.g. the Query, Key, Value technique doesn't depend on the adjective proceeding the noun. For example adverbs in English usually follow the verb and adjectives in French often follow the noun. Only looking backward still works. There isn't any ingrained knowledge of syntax in the model, it learns without needing to know about syntax and it really doesn't learn syntax specifically, it just learned to predict the next word. Is this correct?

Guy

It is an excellent description. One question I had on part 2 - he deeper dive on attention is on the Wq, Wk vectors - where do these vectors come from? They are sets of tunable parameters. . but what tunes them? I may have lost track. I follow where the Embedding, Query and Key vectors come from, but not the corresponding weight vectors. Are the weight vectors generated by the neural networks after submitting the Embedding, Query and Key vectors? Just wildly guessing

Guy

Great explanation Ill probably have my students see it next semester when its public :)

Benjamín Valdés

I had a question from the first chapter. If the model only had initial words of around 50k (GPT3), IRL then how would it know every word in the world. For e.g., How does a model know about Snape which is not a very common word, tbh it's existence in 50k is very less considering a lot of common words out there. You may that its trained with text of Harry Potter. But then HP doesnt just have 50k unique words right ? Supposing there would be more than 50k words even in an Oxford dictionary and obviously Snape isnt there. It's unclear what's the significance of that 50k initial set and how to select that initial set? Hope I made a clear question...

Sai Nishanth

This is a really nice interpretation! Another way I look at it is as a transformation across a single dimensional number line. Initially the logits are spread out across it in both positive and negative directions. And the likelihood of each possible term is already encoded within the value. When we apply the exponentiation (e^x1 for eg) it just "shifts" all the points to just the positive half, and since it's not a linear function, the distance between points also changes. But the overall relative information of likelihood between and among the shifted points is maintained. After that it's just a basic normalization step to get the probability dist. I also wondered how the word is actually chosen, which I thought would be cool to share as well (From Gemini): Greedy Decoding: This is a simpler approach where the model picks the word with the highest probability from the distribution as the next word in the sequence. This method is efficient but can sometimes lead to repetitive or suboptimal outputs. Sampling: This method involves randomly selecting a word from the probability distribution. Each word's chance of being chosen is proportional to its predicted probability. This approach can introduce more variation and potentially lead to more creative or natural-sounding outputs. However, it can also be less predictable than greedy decoding.

Tharun Raj

Edit: wrong video for the comment. But thank your for your great explanation :)

unhappy with patreon

Yeah, as with the other examples I was going with what a machine translation tool gave, which is of course regularly imperfect. I might as well swap it out for the correct version.

3blue1brown

Data can be positive and negative as well, I was just opting to display things without signs to keep the vectors thinner. I may be underappreciating how seriously one might take the particular values on the screen.

3blue1brown

Yes, good catch!

3blue1brown

Thanks, I'll add a little comment to emphasize different people pronounce it differently.

3blue1brown

This is excellent, as always, but now I'm left hanging half way through, really wanting the rest :D As others have said, best high level description of transformers I've seen so far.

Quirkz

Thank you! To be honest, I had to rewatch the minute between 17:50 and 18:50 several times. Probably because I got distracted by Harry Potter and got carried away.

Ekaterina Korneeva

THANK YOU! I've been waiting for this ever since ChatGPT came out in 2022 Ironically the Chinese translation of "Attention is all you need" at 1:10 isn't quite apt. The Chinese text reads "Attention is what you need", the meaning of "all" is missing. That however, is indeed the first answer GPT3.5 gives me. After further prompt, it gave me the correct translation 注意力就是你所需要的一切. So I guess it's an editorial choice whether you want to use the current version. https://chat.openai.com/c/48bf14ea-e2a3-4a12-9c50-76fd7342b563

It's Wednesday My Dudes

As someone who works on novel transformer architectures this is a great intro. So glad that you're creating explanations here so I'll be able to send them to my family for a thorough explanation of what I work on haha.

Chris Duvarney

9:03 text brightness flash

C.J. Smith

Also code vertical spacing issues at 7:32.

C.J. Smith

7:10 just to make sure it gets fixed in the editing, the word "Learn" starts to gradually travel left, then all the sudden snap over instead of continue its gradual transition.

C.J. Smith

This is fantastic---thank you so much!

J. Dmitri Gallow

Great video! Best explanation of Transformers that I've seen thus far! One way I like to think about SoftMax is that it forces the network to "make a decision", in a kind of pseudo-Bayesian way. ie. it's forces the network to make it's "best estimate" about what it's answer is (the probability distribution), given it's knowledge (the logits).

Mark Matthews

I've been waiting for you to use your wonderful math diagrams to illustrate Attention and Transformers. Now let me watch this over and over to try to get a grasp of whats going on, and I need to re-read the Google paper "Attention is all you need"

John T. Draper

I have a question re: SoftMax. In the equations rendered in the blue background, you are referencing upper case N and lower case "n". Are they the same, for instance in the summation. n=0. to N-1 (I can't draw the summation symbol. Is N = n in the formulas. Like e ^ x(n), here it's lower case. Are they the same?

John T. Draper

I have just accepted is that there just isn't a "correct" way to say this simply because there is no convention. I like both "log-it" and "low-jit" personally.

Jake Ehrlich

Great Video! I cannot wait for some deep dives on the attention mechanisms and the intuitions behind it, from 3b1b.

Kevin

(casual conversations) nope, but this is more from hopelessness at changing convention than "logistic" being the best word for the concept. (myself, or if I could rewrite history) "expit regression", expit being the sigmoid or the inverse of the logit function, and using exponents. As above "expₙit" fits well when using a known base n other than e. expₙit(x) = (n^x) / (n^x + 1)

James Barry

So would you call it “log”istic regression?

Eric Severson

Very nice introduction to a deep topic. Perhaps this will be addressed later, but the weights seemed to be both positive and negative whereas the "inputs" and intermediates seem to be positive only. Is this because of the non-linear function applied after the tensor multiplication? When talking about the conversion to the probability distribution at the end, the example *did* include both positive and negative numbers, so perhaps I just missed something.

Shaeeyaa

Looks great! I’m excited for the rest of the series. By the way, is there a mistake at 15:16? Perhaps I’m misunderstanding embeddings, but it seems like it should be Sushi + Germany - Japan rather than Sushi + Japan - Germany, no?

Rohen Giralt

A fantastic work so far. It,s very timely as well while I dive deeper into AI Safety with the AISF course on Alignment.

Pawan

BTW, DM me if you'd like to have a chat with mech interp people at Anthropic if you don't already have an in. No promises but I think they would generally be happy to help explain things. I would be absolutely *ecstatic* if you were able to help people understand/visualize some mech interp concepts.

Jake Ehrlich

Agreed, that Anthropic paper is going pretty high up on the list of links I'll include in the description. Everything Chis Olah puts out is fantastic, including that sequence.

3blue1brown

I agree with the suggestions. The way I have it planned, the relevant computational complexity associated with the context size comes up most naturally when talking about attention and attention patterns.

3blue1brown

As a transformer novice, there are some points that I would like expounded in this upcoming series. First, please take advantage of your prior work on machine learning by

William Walters

I’ve heard 4 different pronunciations from researchers. “Log-it”, “lodge-it”, “low-jit”, and “low-git” (last one is the least common).

Jake Ehrlich

The context length is not really fixed by the architecture, its fixed by practical considerations like memory/time. MLP is element-wise over each embedding/residual and each Q/K/V is as well and attention is defined for any length. So once a model is trained you can use it on any length of context, even if it was only trained on shorter sequences. In practice we also vary this Another note that I have, in modern understanding of transformers you think about the "residual stream", see transformer circuits: https://transformer-circuits.pub/2021/framework/index.html The original 2017 paper gives the view that the residual "skips" are just there to help the network learn. The modern view is that this is more fundamental than we previously thought. Overall good stuff!

Jake Ehrlich

Excited to see more!

Mason Boeman

I know you prefer talking about words rather than tokens, but a video sitting alongside these that describes tokenization might be reassuring to people who distruct things appearing as if by magic. This video is great, though. Don’t try to squeeze tokenization in as well.

Bob Dowling

Excellent video! ♥ nit to pick: I prefer to pronounce "logits" as "log-(N if the base is known)-its" with a hard 'g' rather than as lodge-its. In my mind "logits" means "log-odds-but-do-not-care-about-base" logₙits(x) = logₙ(p(x) / (1-p(x)))

James Barry

Great video! Definitely one of the best videos explaining LLMs. When you talk about GPT3's content window, you might want to say mention how context windows are increasing (e.g., GPT-4 Turbo has a 128k context window), then explain more what it means conceptually/intuitively, and technically (e.g., which aspects of the network has to change to have a larger context window). It might also be helpful to say something about computational costs (e.g., how much training data/epochs) associated with training LLMs. Finally, it would be cool if you can say a little about how different LLMs differ (e.g., mistral vs llama vs gpt vs grok) Thanks for yet another great video!

Hause Lin

Great video. Best video I've seen to better understand LLMs.

Caleb Pheloung

Wow, that was a lot to chew on and digest Great job.

Gregor Shapiro

This is great. I don't have time to preview it now, but this looks fascinating. Thanks. (I'm still recovering from the colliding-blocks-computing-Pi videos!) Edit: Okay, maybe I do have some time, and this IS fascinating, especially the attention blocks. ...Maybe color the data in green, since weights near zero would probably end up looking grey? Really looking forward to this series. Thanks!

M. Eric Carr


More Creators