Hello-GPT: building GPT from zero

February 02, 2024

Hello-GPT: building GPT from zero

Emerging technologies have a tendency to be perceived as a mystery or as some sort of magic.

Indeed, first humans who managed to produce fire can be easily seen by peers as wizards -- even though the underlying principles get well-understood and explained by the following generations.

GPTs became a buzzword as of a late, a buzzword easy to ignite fear or worship. Today I hope that together we can unravel bits of the mystery and see the technology for what it is. And we can stand in awe with a deeper understanding of the awesomeness of how it works (and that it works in the first place).

As humans, to truly understand we pretty much always need to get our hands on the problem. Hence, now and then I shall refer to the publicly available implementation of a GPT-2 like solution I posted on GitHub: https://github.com/ilkadi/hello-gpt

Let's get started!

Acknowledgements

This journey wouldn't be possible without the kind heroes sharing their wisdom:

LLM-roadmap by Maxime Labonne was a nice starting point.
Neural Networks: Zero to Hero and NanoGPT by Andrey Karpathy.
3Blue1Brown is the go-to place to recall the Math and feel it alive.

Foundations

Elements of Information Theory

In one of my previous posts What is text, really -- I touched on the topic of our (human) psychological bias towards the text. That is, that words are well defined, that "the language" we use is important (that is - it carries a well-defined informational load). While in practice and on a few simple examples we can firmly deduce that words and text are not well defined, meanings are not well defined, as well as the languages themselves are not well defined. This fact is usually pretty obvious to people who fluently speak two languages or more: many words are not directly translatable; even more than that -- the actual concepts we hold are only partially explainable by words. To give you a fresh and simple example: the concept of "love" is experienced individually by all human beings and the actual memories associated with that experience are unique for all humans. Many relations hit rocks eventually because of the failure to synchronise the deeper concept between the two sides and assuming that the word-label "love" should mean to others same as it means to us. To make it even more interesting for those who are curious, compare definitions of English words "to love", "to like", "to want" with Greek's words definitions of "Agápe (ἀγάπη)", "Éros (ἔρως)", "Philia (φιλία)", "Storge (στοργή)" and "Philautia (φιλαυτία)". On the very surface 3 does not equal 5.

Let's have a look into the concept of information from another angle.
Allow me to use characters:
[a, b, c, d, e, f, g, h, i, j] instead of digits
[1, 2, 3, 4, 5, 6, 7, 8, 9, 0].
Once you know my assumption, you can easily understand the following:
b + b = d; or "as easy as bxb"; or "it was a year bjbd".
It might feel inconvenient at first, but we all felt inconvenient at primary school when learning digits and writing in the first place. I can choose any symbols instead of digits, but the actual information is the relative position of the digit-symbol in the sequence of ten -- not the symbol itself; at least from the perspective of what we use the digits for.

Why all of this matters in the context of the GPT topic? The example of "love" concept vs word-label was to highlight the encoding/decoding aspects of text and speech we rely on day-to-day basis. Both encoding into words (of the actual information we wish to pass by speaking) and decoding from words (the perceived information about the concept another person aims to pass) have a certain loss of information. On illustration above, GPT4 and DALL-E were trying to illustrate the information as a river flow, where the person downstream is the receiver (and decoder) of it; and the upstream represents the information coming from the encoder, with digital effects aiming to show the loss of the information in that process. What information actually is matters, because we want models to learn the deeper underlying concepts based on words they find: as the training data has many variations of the word "love", those variations can be used to scope the meaning through the relative concepts.

The second example was to show that our choice of alphabet and dictionary is arbitrary. This means that for our implementational purposes we can use other ways to encode/decode the actual information. While switch from digits to alphabet might be easy to swallow, in practice we are going to rely on floating point numbers. Let's have a closer look.

Data encoding and gradient descent

In one of the earlier posts NLP, simply -- we have talked about the way computers see characters in text. Among the others we talked about encodings like ASCII, and how what we know as letter "a" is seen by computer as a decimal number "97" and is factually stored and processed by computer as high-low electrical signals encoded in binary form as "01100001" (in powers of two: 64 + 32 + 1).

It would seem that it is natural to build models entirely on integer numbers -- after all, what we give as an input to the model and what we expect as output from the model are integers (encoded characters from ASCII). On top of that, some hardware wizards would also know that binary/integer computations are much more efficient than floating point ones. So why do we use floating point numbers?

As we are going to see it pretty soon, under the hood machine learning relies on the algorithmical concept of optimisation. And it so happens that integer optimisation is NP-hard. Which is another way to say that we cannot really compute it. But probably I fast-forward too much here. Let's zoom out a bit.

What we have assumed so far is that:
(1) text can be represented as integer numbers,
(2) that what we call a machine learning model or AI has input in form of the sequence of text-integers of length limited by what we are going to call "block size",
(3) the model produces output text-integer sequence which we can read on computer screens,
(4) there is something what we can vaguely refer as "model knowledge" which is acquired by the model through the process known as training,
(5) and what I was saying just above was simply a reference that the model knowledge has to be stored as floats due to practical reasons of algorithmical optimisation.

Why and what do we need to optimise in the first place?
Well, the idea is that the "knowledge" exists as an information, therefore it can be encoded, therefore we can put it into a box and interact with it (because the interaction itself is also a part of "knowledge"). Now, the "knowledge" is all the time in flux and is too vast for a human or group of humans to encode in feasible way.
Imagine trying to describe with words in high level of detail a human face, so that it can be reconstructed independently by somebody else reading this description. It is pretty much impossible to do well, although this is what photo-robots used to be before the photography was invented.
The idea of machine learning is to use a stretch foil, which can be applied to the face and remember the shape and form at the same time. The stretch of the foil itself can be imagined as optimisation step aiming to shape the foil after the face. In other words, to solve problems, instead of descriptively reconstructing "the knowledge" to be programmed into computers, we can use a "foil"-like process over the known data to deduce it correctly.

Typically, untrained ML-models start in the white-sheet blank state (random float numbers of proper arrangements). For the model to learn, we need to have a way to evaluate its knowledge. In the case of GPT, the very simple method we can start with is to check how confident is the model about what it is doing. Indeed, if our doctor is not that sure about what they are doing we naturally would use this certainty metric to evaluate their competence. Of course, there are extremes to keep in mind: like that of a confident ignorance (not knowing and because of not-knowng being confident that the answer is the best). But let's use it carefully and then we can:
(1) start with the blank state of model knowledge,
(2) ask the model to make predictions (produce outputs) on some data,
(3) check how confident the model was,
(4) look throughout the model's knowledge space what values of information would make it more confident on giving the answer to our input data,
(5) update knowledge-values subtly according to the recommendation of the previous step,
(6) our model now has knowledge, which is more correct for sample input data we check it against,
(7) we repeat the whole process till the model is awesome.

The step 4 is crucial for the whole enterprise to succeed. Indeed, how can we tell among all of the vastness of model parameters (billions to trillions in modern AI) what combinations of their values would make the knowledge "better"?
This is where the use of floats and the gradient descent come in. Without going into details, I can only say that we are able to calculate how much each piece of model knowledge contributes to "badness" of a particular answer. Thanks to this metric, we have a direction towards knowledge improvement, although we don't know the exact distance we need to walk at that direction.

With that direction we can make a small knowledge-values step, which we know to lead us to a better place. If the step is too tiny, we are going to check direction too often; if the step is too big -- we might accidentally walk over the peak of the mountain.

If things are set just alright, eventually the model finds a sweet spot -- a place where it is as confident as it can be. For gradient descent this spot is known as local minimum, and it can be understood as a mountain peak -- no matter which direction we chose to take a step, the confidence of knowledge would decrease. And just like with mountains -- being on the peak of the mountain doesn't mean to be on the peak of the tallest mountain. But let's be happy with simply finding the peak for now.

There is a one small flaw in the mountain hike analogy: a hiker takes a step in well-defined space towards the peak, while with machine learning and gradient descent it is more like we morph the whole space to match the hiker. We still look for the "peak", it's just that the "peak" is the optimal state of the space as a whole (remember our foil to replicate face analogy above).

I much encourage you to watch 3Blue1Brown YouTube channel, as it is hard to explain it in a more digestible form than what he did.

Elements of Graph Theory and Calculus

I wasn't entirely honest with you when I compared the AI knowledge to the box full of floating-point numbers. In reality we don't only have "words" or "characters" but also relations between them. We know who does what with whom and where in text, based on those relations.
When we speak about the knowledge itself it should be pretty intuitive: discrete Math is more relevant to Computer Science field than let's say Quantum Mechanics or Antique History. If we would try to model it, imagining that our GPT answers computer science question, we probably would see some picture like this:

Our "knowledge" is encoded as float, but we also introduce a bit of structure into knowledge, so that the model can also learn the relevance of the context. What we would hope it can learn is that Antique History is less relevant for computer science domain, although there are some overlaps -- like in all of the knowledge. So, the model would find antique knowledge relevance as "0.02" and discrete math as "0.7". We show that concepts are related by existing arrows which we are going to call edges, and we are going to call each of those concepts a node. "0.02" on the arrow we are going to call a weight on the edge between those two nodes. Since our edge has a direction represented by an arrow, we can be more specific and say that we have a business here with a directed edge. In graph theory the whole of it is a graph: a collection of nodes optionally linked to each other with edges.

It is all nice. Indeed, we can now say that we have our mythical "knowledge" of floating-point numbers, and we have a way to conceptually relate it to each other. But how should we apply it all to build an AI? Should we try to make the model learn the optimal topology (which edges should exist, and which shouldn't) in our graph?

Unfortunately, if we are to try to make model consider in exist/not-exist terms graph's edges, we hit the same NP-hardness computation challenges as we would 'of hit by using integers to store knowledge. The essence of this type of problems is that we need to try all of the possible combinations (in our case variety of combinations of edges) and each combination can behave very differently, and we cannot re-use our learnings from other combinations feasibly -- therefore the learning is forever.

To add to our troubles, we cannot really learn if all what our knowledge is -- is the box of numbers. The gradient descent works for functions and not for numbers. Those functions for gradient descent to work should satisfy differentiability. Differentiability is not as scary concept as it might sound -- it simply means that our space is smooth and complete, just like what we rely on in our normal reality when we open a shop door -- we are inside of a shop and not somewhere on Mars. Differentiability is a more abstract and formal way to define such functions and spaces created by such smooth functions.

Putting together, the above implies that:
(1) We are better to define the edges of our model-graph manually for practical reasons -- so that the expectation on model ability to learn is computationally feasible,
(2) The "knowledge numbers" which we were assuming that would represent the knowledge of a model -- those numbers should be arguments of some reasonable functions rather than 'just' numbers,
(3) Our model in the end has to be a statically pre-defined graph with learnable weights on edges (to allow the model to learn the relative relevance of concepts) and nodes of the graph standing for some simple one argument functions (where the argument of a function encodes the "knowledge").

Elements of Algebra

We are almost there, I promise.
It is the last piece of foundations we need to cover.
So far, we have constructed analytically and with common sense a mathematical model which should be capable of learning any knowledge, with the exception of covering some mundane details.
We saw how digits can be represented as characters and vice versa;
we reflected on the part of words being an encoded form of the information rather than the information itself;
as the result we abstracted away the notion of words or language for the sake of a more general number-based representation of information;
we remember that numbers matter not because of digits but because of their relative positions and our ability to distinguish between them;
we derived conceptually how a "foil" of adaptive information that can be "pressed" into the face of the reality to collect the knowledge;
we reflected on the notion of dynamism in the actual knowledge and introduced graphs with arrow-edges and information-nodes to enable our model to learn comprehensively;
finally, we considered the practical aspect of what ability to learn means and we came to the conclusion that it makes much more sense to keep nodes as functions and the "node knowledge" as an argument to that function.
It all makes sense and promises to work.

There is just one last boring thing -- how many nodes and edges do we need and how are we going to compute it?
The problem is that if we are to compute per node and per edge in step-by-step basis, it is going to take a really loong time. If we are to assume that all compute operations cost the same amount of time (they are not of course), and if compute in sequential manner -- we need to complete at least as many operations as we have nodes and edges in our graph. For GPT4, the number of parameters is rumoured to be around 1.76 trillion, which would roughly translate to 30 min of 3.4GHz single core compute time to predict a single token (roughly -- a single character). And we are talking about the use of "ready knowledge", to actually train the model we need to calculate derivatives and run predictions over and over again which is much more demanding enterprise.

This is why parallel computing and GPUs in particular are so widely used in machine learning applications. Modern GPUs count compute units in thousands (around 16k for NVDIA H100 as an example), and if we connect a few thousands of such GPUs in a highly efficient cluster of few thousands of nodes -- we start to be able to train and run models like GPT4.

But one does not simply run things in parallel -- there is a missing step, as we have finished on a graph with nodes. This is where matrices and vectors come in.

It so happens that matrices are a very convenient way to represent graphs. On the illustration above we extract the edge representation of a graph into a table. Matrix is a more subtle concept than a table however -- so please keep that as a note in mind. But we can loosely think of matrix as a table for now.

What it allows us to do is to compute values in parallel all at once, passing different numbers to different compute units and getting results from all of them at the same time when they complete the calculation. Instead of working on single numbers we now start to work more abstractly on groups of numbers -- a single "rows" or "colums" are known as vectors, the two-dimensional vector is a matrix, and the general term for such entity is tensor. And yes, we need more than two dimensions for our practical graphs!

Now, when we know how to convert our model-graphs into tensors and how we should use tensors to efficiently compute on GPUs (or TPUs, VPUs etc.), we have a computationally feasible way to train and run our models.

Hello GPT

Hopefully by now you have an intuition that the whole idea of building the AI is theoretically possible. And hopefully you feel a little sour about the omitted details -- like "how we would know that the statically arranged blank graph has the right capacity to solve the problem we want it to solve?" Or, "how can we know has the model just "memorized" the right answers or it understood the underlying concepts?"

Indeed, the structure of the graph, the size and the quality of the available data, the limitations on computation capacity (who is going to invest into something what doesn't give a certain result but rather fails over and over again throughout history -- as it was in the case of many AI applications?) -- those were all significant obstacles on the path. Let's have a look into what helped to move things forward.

Transformers and Decoder Transformer

An illustration of decoder transformer by DALL-E.

The actual breakthrough if one to be named was in the static shape of the graph. I found the AI attempt to generate the illustration to transformer amusing, so I added it above. But let's have a look on the fragment of the amazing LLM Visualization (bbycroft.net):

Transformer itself consists of the two major parts -- the self-attention and multi-layer perceptron (MLP). The idea behind the self-attention is as the name suggests: to find the central attention bits of information in the incoming sequence. Just like we find the key part of a sentence, or a key sentence in the long chunk of text. The interesting part is that the transformer architecture envisions have multiple "heads" to learn different dimensions of dependencies in the input sequence (like grammar vs morphology vs sarcasm).

Each of those heads divides input (which is not the raw input, but a pre-processed data we skip for now) into three parts: queries, keys and values. The model learns weights for each of those three parts in each head. The attention itself relies on the dot product between two vectors (very roughly -- a pair-wise multiplication of two columns, and addition of results of those multiplications -- so that there is a single number representing the combination of two columns).

Importantly, the transformer used in GPTs is a decoder type -- in simplified terms it means that the model does not learn "future" context but only the "past" context (for the token in the middle of the sequence, the model looks for what was before it and ignores what comes after it; but as the model does it for each token -- the whole picture for the sequence is still assembled).

Finally, we combine the output of multiple heads of the transformer back -- with the use of projection layer. So many words, so little code: hello_gpt/model/self_attention.py

While self-attention is used to learn dependencies in the sequence, there is still a need to learn more static stuffs somewhere. Multi-Layer Perceptron (MLP) is exactly that. Using our graph terminology from the previous chapter, the scary-sounding MLP is a simple triple: a column of nodes, edges to it and edges from it. In its simplest form MLP is that: --> O --> (edge with weight, node with argument, edge with weight).

In MLP, model learns the weights of edges (linear function). At the same time, the "node" in MLP is represented by a so-called activation function. There are various considerations in picking the exact function, they impact model's ability to learn, speed of learning or how likely the model is to get stack during the learning. Essentially, the model learns the parameter to the chosen by design function (GELU in our case): hello_gpt/model/mlp.py

There are probably two more things to mention here. In code you can see a dropout call. Dropout is a special mechanism "dropping" random nodes in each learning cycle. It is done to ensure robustness of the model.
In code you also can see a "forward" method. Typically, we refer to "forward" and "backward" directions for an ML model. Forward means the normal prediction mode, where the model is provided an input, and it is to produce some output with the already learned weights. Backward pass is the part of the training process, when gradient descent calculates individual influences from the loss function value. Backward pass starts from the end, where it calculates a loss and walks leftwards with derivatives distributing the respective influence among all of the parameters. Importantly, PyTorch maintains the underlying structures to calculate gradients for backward passes, so you don't see the manual gradient implementation here.

Now, with the self-attention and MLP modules -- we can assemble our Transformer: hello_gpt/model/decoder_transformer.py

And it is again deceivingly simple -- we add our modules together.
From new things are two normalisation layers -- they are to help with floating point numerical issues and to keep the learning on track (numerical issues happen on rounding of extreme values in computer).

You probably have noticed that we sum outputs between steps, like the output from the self-attention is added to the input vector (which is the input tensor of the exact same dimensions as the output of self-attention layer has elements on respective positions summed as pairs). This is actually one of the more recent techniques helping the model to learn more efficiently as it helps to smooth out the gradient on initial stages and to keep a good learning tempo. As the model learns the residual connection influence over the output drops.

Embeddings

Before building the actual model, we need to prepare our input sequence data.
In the NLP post, we were talking about all of those complex manually arranged steps -- of splitting text into sentences, sentences into word-like tokens, tokens later being passed to a named entity recognition module and so forth. Each step required a dedicated module with a dedicated model and there were informational losses which we couldn't recover easily. How the GPT architecture solves this problem?

First of all, GPT limits the number of characters considered by the model to a sequence of limited size, known as a block size. We can roughly compare the block size to the sentence-splitting from the traditional NLP, with the similarity being in independent analysis of each "sentence". Similarities end here however, as GPT's block size is not a sentence but a block of characters -- hence the name.

Secondly, GPT's tokens are something in between a single character and a word. Those tokens are essentially numbers encoding a group of characters in lossless and intelligent way. Intelligence comes from dataset-driven encoding of tokens, what allows to encode longer more commonly occurring sequences of characters of a particular language into the encoding. For instance, a word 'encoding' on a big English dataset is likely to split into three tokens 'en', 'cod' and 'ing'. You can read more in the tiktoken examples and documentation: https://github.com/openai/tiktoken

With this in mind, lets return to the top of the bbycroft's visualisation.
We had input text (sequence of characters), which we convert into tokens our model recognises.
The number of input tokens our model is able to have a look at simultaneously is known as a block size.

The next parameter is the size of the embedding (a height of those matrices on the illustration). It can be seen as a dimension helping to represent the nuances of individual tokens. And we have two distinct embeddings: token embedding (number of total supported tokens x embedding size) and position embedding (max supported block size x embedding size).

With those combined, we can calculate our input embedding for the sequence of characters passed to the model: we simply extract a column-vector from token-embedding matrix using token itself as an access index; similarly, we extract a column-vector from position-embedding matrix using token's position in the block as an access index to the position embedding matrix; what is left is to add elements of two columns. This is what we use as an input to our Transformer module we discussed above.

Logits

On the output from our transformers, we receive information of higher dimensionality compared to that of the input sequence. What we need to do at the very end is to map it back to a vocabulary-sized space.

What the last layer receives can be imagined as a multi-dimensional space of scores assigned to the probability of a particular token to be the next token in the sequence.
Language Model Head of dimensions (embedding size x number of supported tokens) has learned weights which interpret input scores into probabilities based on which the final prediction of the next token is made.
The returned token in the very end is either the most probable, or probabilistically chosen one among the top probable choices depending on the implementation.

Logits layer is the primary space for tuning trained model outputs, as presumably all what a well-trained model needs to adopt to a particular conversation is its jargon.

Assembling the model

Now we are ready to assemble our model in its generic form: hello_gpt/model/model.py

The vocabulary size for GPT2 is 50257 unique tokens. For comparison, on our illustration above the vocabulary size was three. GPT3 and GPT4 have no problems with responding in different languages or switching communication between the two -- only implying that token space includes all supported languages (46 languages for GPT-3). So, the number of unique tokens should be much higher, although not as high as some derive (their estimate of 14kk tokens was based on the number of words in supported languages, while we know that model tokens are not the words themselves but parts of words -- the practical trick helping to manage the vocabulary space).

You can also see the number of heads and of layers in the default config -- those are typically the parameters used for model scaling, like the differences between `gpt2` (12 layers, 12 heads, 768 embedding size) and `gpt2-xl` (48 layers, 25 heads, 1600 embedding size). Number of heads increases the cross-dimensional strength of a single transformer, while number of layers allows the model to learn deeper and deeper associations in data.

Otherwise, our model is "just" it -- the sequence of modules we have talked about, with normalisation and dropout layers added to assist the learning. Please note that this model is not an assistant: it is a model predicting a sequence of characters based on the sequences prompted by user. Assistant takes more steps to build, and this model is just the first step in the process. But it is still fun to see it auto filling text. For example, this is what the model trained from scratch on all works of Edgar Poe has produced:

Starting the interactive console..
>: hey
hey would have been beyond what is
      the essence in question.”

      “There are merely a genius,” says Le Soleil, “but it really
      may be in the reason possible reason.”

      “That is a good time,” said Dupin.

      “I have, of course; and I have, first, as it is, a common
      case, that this gentleman is a fool, I may, in a case, as well as
      an observation of the metaphysician; for to write at once, I shall
      attempt to bid you can above him again.”

      At length, looking towards him, and looking for breath I thought
      to keep him immediately

It might look like it is not much, but I personally appreciate the correctness of the words, reasonable sentence structure and Edgar Poe's styling -- especially that it took around 5 hours to train on laptop's GPU.

MLops

I would like to conclude with a few remarks on MLops for the Large Language Models.

I found it fascinating how greatly the associated meaning of a term "MLops" varies across specific applications. For more traditional-style prediction applications (like those which tell you what you might want to add to your shopping basket or generate targeted advertisement) -- mlops often gets meaning of "devops for ML models". Just like in one of our previous chats Kubeflow for the Poor, the primary challenges are around the frequency and security of deployments, compatibility of model deployments with the wider infrastructure of the organisation. Model decay, model error analytics are all crucial parts for success of those projects and are important day-to-day parts of model lifecycle.

With LLMs, big part of MLops is the low-level optimisation of training and running the model. For software developers here, you might remember the magic of passing by reference or by value kind of discussions, or even more -- the number of register copies on the CPU, or the computation cost differences between multiplication and addition (and how CPUs compare). MLops for LLMs is kind of this, but on tensors instead of variables. But it also includes efficient setup of infrastructure (groups of gpu-nodes running the model), monitoring of the speed of model training, choice of parameter sizes to efficiently fit into the particular hardware etc.

In other words, machine learning engineering varies from specialised low-level compute optimisations, to builds of comprehensive data pipelines and analytics systems, to devops-style platform creation. No wonder organisations are confused about whom to hire and how to find the people they need.

Search This Blog

Cyber Observer

Hello-GPT: building GPT from zero

Acknowledgements

Foundations

Elements of Information Theory

Data encoding and gradient descent

Elements of Graph Theory and Calculus

Elements of Algebra

Hello GPT

Transformers and Decoder Transformer

Embeddings

Logits

Assembling the model

MLops

Comments

Post a Comment

Popular Posts

S-aa-S vs AI-aa-S, yet again

Getting a 'feel' of LLMs