HOW THE LLM TRANSFORMER WORKS
05.04.2026
Attempts to understand the processes of training or operation of neural networks often stop as a result of an imbalance between human imagination and the unevenness of mathematical tracing of data in software frameworks that control these processes.
We also tried to somewhat simplify for you the cognition of the methodology of operation of neural network programs, which some call artificial intelligence.
In this article, without formulas, we tried to analyze the general principles of training and operation of the LLM transformer at the level of GPT-3.
So, where do we begin?
We begin with dividing texts into corpora. Links to programs that do this can be found here. For now, this is not of great importance for theory.
Next, we create with a tokenizer a dictionary of tokens, where each token is given an ID from 0 to the last one, for example, 50,257. These IDs are numbers immutable and replace words for us in mathematics. They are with us in all processes of training and operation of LLM.
Next, we create a table of unique vectors of positions (addresses of tokens in a sentence), for example, 128 thousand rows (by the size of the context window of query input) with 12 thousand numbers in each. For our query, they will be numbered from 0, 1, 2, 3…128 thousand for each input token. That is, the 128,001 token does not get into the model. If the conversation is long, the model can forget its beginning due to such limitations (now it is being solved by various methods of attention).
Position vectors are formed from numbers created by sinusoid algorithms for uniqueness. Position vectors are immutable digital signatures of the token position in a sentence. The numbers of position vectors are added element-wise (1 number of the token vector + 1 number of the position vector) to 128 thousand token vectors that fall into the query window. Position vectors are important mostly for the order of words in sentences, since they allow determining the order of words, the distance between words, etc. In modern transformers, positions are determined without tables of unique vectors - by the angle of rotation of points - numbers of each vector in pairs. Indices “p” of the token position in order (0, 1, 2, …) and indices “i” of the position of the number in the token vector pass through a formula with sines and cosines (as do position vectors) and undergo small changes - noise for semantics and important markers for token positions.
Next, we create a table of embeddings: vectors of our token IDs with 12,288 weights (model parameters - for 96 heads (about heads - below) with 128 numbers each - a large table 50,000 x 12,288) for each token. The numbers of each group (of 128) are processed by matrices of a specific head of the model after its training, which is responsible for a certain domain of words (grammar, connections between sentence members, connections of a specific sentence member, etc.). At first, the numbers in embeddings are random, as are the weights in matrices (about matrices - below), however then they are grouped under matrices of specific heads.
During training, we feed the model ID tokens in the form of vectors simultaneously in such a sequence as they sound in the sentences of our language. And it is precisely this frequently repeated for the model clear sequence of IDs that plays an important role in forming connections between tokens, which consist in the similarity of numbers in their vectors (12 thousand). For example, the vector of the word ID 456 “house” [0.5,… 0.8] will be similar to the vector of ID 217 of the word “building” [0.54,...0.79]. When we enter a word into the query chat, the tokenizer searches for it in the dictionary and outputs the ID of its tokens.
For example, the sentence "cat and dog" is a table of three rows 30 [0.9...0.76], 450 [0.22...0.15] and 21 [0.4...0.14]. At first, the model does not know that these token numbers are closely related, however with each subsequent feeding of this order of tokens, the model passes their numbers in the same order through special mathematical formulas that allow revealing numerical interdependence, in our example, between the first, second and third token exactly in this order. If we swap the tokens, the results of calculations will be different, and the structure of interdependencies of each of the three tokens with the other two, and even with itself, will be different. These are dynamic attention weights that are created during the processing of tokens in their given sequence by constantly comparing token vectors with each other.
Stable interdependence of the results of mathematical calculations of model training forms numbers that we call constant matrix weights. The dynamic numbers of attention weights of one token close in sequence to another are the largest in the token vector. That is, the token “cat” has the largest attention weight number to the word “dog” among other tokens, and the token "dog" - to the token “cat”. But only with such a sequence of token input as the sentence "cat and dog".
It is exactly the ergonomics (selection of the path of least resistance) of computations when using these dynamic weights that reveals the order of tokens we need as the most stable, and the model performs it by extracting tokens from the dictionary exactly in this order. When the model generates text, it uses constant weights and dynamic attention weights to find such a token which, when combined with the previous ones, will create the most stable and logical structure according to billions of variants of connections (constant matrix weights) that the model has learned.
After 12 or 96 multiplications (layers) on the matrix, the initial numbers of the sentence "cat and dog" are transformed into such a final set of numbers which, when applied to the general dictionary of tokens, will produce the highest value (maximum activation) opposite the index of a specific ID token (for example, the word "cat").
Now let's consider the heads, in which there are three weight matrices: the query matrix Q answers the question “What word am I looking for”, the key matrix K answers the question “Who am I” (for example - “building”), and the value matrix V answers the question "What information do I transmit if I match the query?". These weight matrices are columns of weight vectors of an already trained model. The number of row-vectors in a column is always equal to the number of numbers in the vector of our tokens. The heads look only at the numbers of vectors that are created specifically for them. These are groups of numbers in the vector (mainly 128) and they are multiplied by the matrices of the corresponding heads.
The vector of each token of a word is multiplied by all vectors of the three matrices in each head. The head learns to connect words as parts of a sentence. For each such connection (for example: action-who acts - Peter mows, or punctuation, or word order) there is its own head.
We multiply each number of the token vector (we have 12k of them) by each number of the row vector in the matrix. If the matrix vector has 128 numbers in a row (for one head), we get 128 new numbers. Each obtained number of each row we add positionally, element by element, to other rows from our 12k rows. That is, 1 number of the first row to the first numbers of all 12,287 rows. Each column of 12k numbers in the matrix is a set of weights of the trained model (a filter). The largest results pass through, the smaller ones disappear because they are turned into 0 by the program. Each row of 12k is a feature.
From the results of adding numbers in all 128 columns, we create a new vector of 128 numbers. These are dot products. The number of values in such a vector depends on the model architecture: it can be 256, 512, 768 or even more in large models. But the principle is unchanged — it is one array of numbers that is the result of all the mathematics of matrix multiplications.
By multiplying the token vectors of a word by 3 matrices, we obtain 3 vectors Q, K and V for each token of the word, that is, we establish a correspondence of the token vectors of the word to certain vectors in the matrix.
During training, the weights in this last matrix are selected as digital filters — they mathematically recognize specific patterns of numbers in the final vector. Through the dot product, the token vocabulary matrix detects the maximum similarity between the computation result and a specific word ID, which allows the model to accurately point to the next token.
That is, we fed the sequence A and B — the formulas with weights produced vector V, the vocabulary matrix reacted to V and produced token ID C. If the model made a mistake, we change the weights (numbers in the matrices) so that next time, with the same V, the correct ID is selected. This is training.
Now this vector should be converted back into a token ID through projection onto the vocabulary.
The model has a final weight matrix. Its size: [128 (our new vector) by the number of tokens in the vocabulary (we have 50k)]. That is, the new vectors of each head with 128 numbers are assembled into one large vector of 12,288 numbers, which is multiplied by this huge table — compared (via dot product) with each row of this matrix. Each row of this matrix corresponds to a specific ID from the vocabulary.
For example, if the final vector by its composition of numbers is similar to row No. 450 in this matrix, then the output will be a large number. If it is not similar — a small or negative number will be obtained.
After this multiplication, we obtain an array with length equal to the entire vocabulary (50,000 numbers). Opposite ID 10 — number 0.1, opposite ID 450 — number 12.5 (this is the largest number), opposite ID 2100 — number -2.3. The model takes token ID 450 and outputs it as the next word.
If the article contains inaccuracies or errors, you can write to us or leave a comment under the article. We will be grateful to you for this, as well as those who are just learning to work with neural networks. About training models with data tracing, we have tables on the main page of the site.
On the ArXiv website there is an interesting article about the physical interpretation of attention in transformers. In the article, attention is considered as a system similar to thermodynamic equilibrium (a stable system state). In this physical interpretation, tokens are possible states of the system, attention weights are probabilities, and their distribution corresponds to the distribution of particles over energy levels in statistical physics.
Visit the SAID-Test to practice identifying fake generations.
said correspondent🌐
You can create a separate thread on the community forum.
Comments