What an LLM is really doing. A field guide to the transformer, one sentence at a time.

Press enter on a prompt and it can feel like the model understood you, thought it over, and replied. What it actually did is stranger and far simpler. A large language model has one trick: given the text so far, it guesses the next token. Then it adds that token to the text and guesses again. Every billion parameters exists to make that one guess good.

Almost all modern models share the same shape under the hood: a stack of transformer blocks, repeated. Learn the machinery once and you can read most model cards and papers and know which part each section is talking about. So let us follow one sentence the whole way through.

We trace a single sentence, “The cat that I saw yesterday was sleeping,” from raw text to a predicted next word, and explain each stage by the problem it exists to solve.

The full transformer pipeline, top to bottom: text, tokenizer, token IDs, embedding lookup, positional encoding, a repeated block of attention and feed-forward, then unembedding, softmax, and the next token. — The whole pipeline. Text becomes integers, integers become vectors, a stack of identical blocks refines those vectors, and a final step turns the last vector into a next-token guess.

Nine stages. Each one cleans up a problem the previous stage left behind. Here we go.

tokensembedpositionattentionmulti-headFFNresidualpredict

1. Tokens: text has to become numbers

The problem

A neural network does arithmetic on numbers. It cannot operate on the letters c-a-t. Something has to turn your prompt into integers first.

That something is the tokenizer. It chops the string into pieces and replaces each piece with an integer that points to a row in a fixed list called the vocabulary. Modern vocabularies hold tens of thousands to a few hundred thousand pieces.

The pieces are usually not whole words, they are subword fragments. “tokenization” might become [“token”, “ization”]; “running” might become [“run”, “ning”]. Whole-word vocabularies would be enormous and still miss new words. Letter-by-letter vocabularies would be tiny but force the model to learn everything from scratch. Subwords sit in the middle: common chunks get their own token, rare words are assembled from smaller ones.

The word strawberry split into the subword tokens str, aw, and berry, each mapped to an integer ID; the model only sees the numbers. — One word, split into subword pieces, each mapped to an integer ID. From here on the model works only with the numbers.

This is why the famous “how many R’s in strawberry” trick stumped models for years. The model never sees s-t-r-a-w-b-e-r-r-y. It sees a couple of token IDs that happen to spell the word to a human. Counting letters it cannot see is hard. Reasoning models do better now, mostly by spelling the word out token by token first, but the root cause was always tokenization, not arithmetic.

What it costsFewer tokens means less work. The tokenizer choice drives both running cost and how well a model handles languages other than English.

tokensembedpositionattentionmulti-headFFNresidualpredict

2. Embeddings: numbers have to carry meaning

The problem

An ID like 9059 is just a row number. It says nothing about what the token means or how it relates to anything else.

Meaning comes from the embedding matrix: a giant lookup table with one row per vocabulary entry, where each row is a long list of numbers (a vector). In a 7B-class model that row is often around 4,096 numbers wide. The tokenizer hands over an ID, the model grabs that row, and from now on the token is that vector.

The striking part is what training does to this table on its own. Tokens with similar meanings end up with nearby vectors. “king” lands near “queen,” “Paris” near “France.” Nobody coded that. It falls out of learning to predict text well. You can even do clumsy arithmetic in the space: king minus man plus woman lands close to queen.

A 2D sketch of embedding space: man to king runs parallel to woman to queen, and country-capital pairs like Paris-France and Tokyo-Japan cluster together. — Meaning becomes geometry. Related words sit close together, and consistent relationships (man to king, woman to queen) show up as parallel offsets.

One catch. At this stage “dog” has the same vector whether it is the first word of your prompt or the fifth. The meaning is there, the position is not. That is the next problem.

tokensembedpositionattentionmulti-headFFNresidualpredict

3. Position: order has to be injected

The problem

Attention, the next stage, has no built-in sense of order. Without help it cannot tell “dog bites man” from “man bites dog.”

The 2017 transformer fixed this by adding a position pattern to each token’s vector, built from sine and cosine waves so position 1 looked different from position 5. It worked, but it forced one set of numbers to carry both meaning and place, and learned position tables did not stretch past the lengths seen in training.

Most models now use Rotary Position Embeddings (RoPE). Instead of adding a position vector, RoPE rotates each token’s vector by an angle that grows with its position. When two tokens are later compared in attention, what matters is the difference between their rotations, which is exactly the relative distance between them.

The same vector rotated by a larger angle as its position grows, shown at position 1, position 10, and position 100 (RoPE). — RoPE encodes position as rotation. A token early in the sequence is barely turned; a token far along is turned much more. Attention reads the angle between them.

Even with good position handling, long prompts show a documented “lost in the middle” effect: models lean on the start and end of a long context more than the middle. That is why “put the important thing first, repeat it at the end” is real advice, not superstition.

tokensembedpositionattentionmulti-headFFNresidualpredict

4. Attention: tokens have to share information

The problem

Each token is still an island. “was” needs to know what was sleeping, but nothing has let it look back at “cat” yet.

Attention is the mechanism that gave the architecture its name. Inside every block, each token looks at the other tokens it is allowed to see and decides which ones matter. It does this by casting each token into three roles, three learned vectors: a Query, a Key, and a Value.

A token’s Query is compared against every Key it can see using a scaled dot product, which scores how well two vectors line up. Softmax turns those scores into weights that sum to 1, and the output is a weighted average of the matching Value vectors. High match, high weight, big contribution.

The token was sends a query back to earlier tokens and attends most strongly to its subject, cat, while future tokens are masked. — Processing “was,” the model’s query lines up most strongly with the key for “cat,” several tokens back. The subject dominates the weighted sum. Future tokens are hidden.

In a left-to-right model there is a rule: a token may only attend to itself and the tokens before it. Future positions are masked, given weights so low they vanish after softmax. That is what keeps the model honest while it predicts the next word.

One of the cleanest findings in interpretability is the induction head: an attention head that spots a pattern like “A B … A” and predicts B again. It is a big part of how a model picks up a pattern from your prompt and continues it, the thing we loosely call in-context learning.

What it costsEvery token compares against every other, so doubling the prompt roughly quadruples the work. This one fact drives much of modern systems research: FlashAttention, sparse and linear attention.

tokensembedpositionattentionmulti-headFFNresidualpredict

5. Multi-head attention: one view is not enough

The problem

Language has many relationships running at once: subject and verb, pronoun and name, local phrase and long-range reference. A single attention pass can only learn one way of looking.

So the model runs attention many times in parallel. Each pass is a head with its own learned projections. A point most tutorials get wrong: a head does not get a fixed slice of the token vector. It learns its own small projection of the full vector. With 4,096 numbers and 32 heads, each head works in its own roughly 128-wide space. A different view of the same token, not a different chunk of it.

The token was read in parallel by several attention heads (grammar, coreference, position, induction), whose outputs are concatenated and mixed into one vector. — Several heads read the same token in parallel, each tuned to a different kind of relationship. Their outputs are concatenated and mixed back into one vector.

Nobody assigns jobs to heads, the specialization emerges. Researchers have found heads that track grammar, heads that resolve which name a pronoun points to, heads that follow position, induction heads, and many more. A layer might have 32 heads; a frontier model has dozens of layers; so a model carries thousands of heads, each adding its own learned view.

That memory cost drove a real change. In Grouped-Query Attention (GQA) many query heads share a smaller set of key/value heads (LLaMA-2 70B uses 64 query heads but 8 key/value heads). DeepSeek pushed further with Multi-head Latent Attention, which compresses the key/value cache into a small latent vector per token. Similar quality, far less memory, which is a big reason it serves so cheaply.

tokensembedpositionattentionmulti-headFFNresidualpredict

6. The feed-forward network: each token thinks on its own

The problem

Attention moves information between tokens, but does little heavy computation within a token. Something has to actually process each token’s updated vector.

That is the feed-forward network, the quieter half of every block. It runs on each token independently, with no cross-token mixing, and does three things: expand the vector to a larger width, apply a non-linearity, then compress it back.

A token vector expanded to a larger width, passed through a non-linearity, then compressed back to its original size, run independently on each token. — Per token: expand, bend, compress. The non-linearity in the middle is what lets stacked layers do something richer than a single matrix multiply.

Most of a dense model’s parameters live here, not in attention, and they are not generic. This is where much of the model’s stored knowledge sits. Researchers have found single neurons that fire on a specific concept, and methods like ROME can edit a fact, moving the Eiffel Tower from Paris to Rome, by a targeted tweak to one feed-forward weight matrix, with no retraining.

The largest models now often replace one feed-forward network with many. A Mixture of Experts layer holds a set of expert networks plus a tiny router that sends each token to only a couple of them.

A router sending each token through only two of eight feed-forward experts, so total parameters grow while compute per token stays small. — A router activates only a few experts per token. DeepSeek V3 carries 256 experts per layer and fires 8, giving 671B total parameters but about 37B active for any given token.

What it costsMixture of Experts buys huge parameter counts without a matching jump in compute per token, because only a few experts run. It pays for that in memory and routing complexity instead.

tokensembedpositionattentionmulti-headFFNresidualpredict

7. The residual stream: how a deep stack stays trainable

The problem

Stack thirty or a hundred of these blocks and two things break: the training signal fades before it reaches the early layers, and the running numbers either blow up or collapse.

The first fix is the residual connection. A block’s output is added to its input rather than replacing it. Across the stack this becomes a running sum, the residual stream, with a direct additive path from the input embedding all the way to the end. The trick predates transformers; it came from ResNet in 2015 and made hundred-layer image networks trainable for the first time.

A vertical residual stream from input embedding to final vector; attention and the feed-forward network branch off and add their output back into the stream. — Attention and the feed-forward network do not overwrite the token. They write their results back into a shared stream that every later layer reads from and adds to.

The second fix is normalization, which rescales each token’s vector back into a stable range between sub-blocks so the running sum neither explodes nor vanishes. Two refinements stuck: applying it before each sub-block instead of after (pre-norm), and a cheaper variant called RMSNorm that keeps only the rescaling and drops the mean-centering.

tokensembedpositionattentionmulti-headFFNresidualpredict

8. Predicting the next token: the loop that writes

After the last block, every token has a final vector. To generate, the model takes the final vector of the last token, and an unembedding step turns it into one raw score per vocabulary entry. With a 100,000-token vocabulary that is 100,000 numbers, called logits.

Softmax turns the logits into a probability distribution, and the model samples from it. It usually does not just take the top choice. Decoding settings shape the output: temperature sharpens or flattens the distribution, top-k and top-p trim it to the plausible candidates. Same model, precise in one setting and playful in another.

Then the chosen token is appended and the whole thing runs again on the longer sequence, reusing the KV cache so the prefix is not recomputed. New vector, new logits, new token. A paragraph is just this loop, one token at a time, until an end token or a length limit. A neat speedup, speculative decoding, lets a small fast model draft several tokens while the big model verifies them in parallel, accepting the run when it agrees. Done right the output matches the big model exactly, but faster.

This single objective, predict the next token, is the entire training signal for a base model. It is never directly taught facts, conversation, or reasoning.

Everything human-feeling about a chatbot is added afterward, in post-training: instruction tuning, learning from human preferences, safety. The biggest shift of the last two years lives here too. Reasoning models like OpenAI’s o-series and DeepSeek R1 are taught, with reinforcement learning and special “thinking” tokens, to spend more compute at inference, writing a long internal chain of thought before answering. The architecture barely changes. The model simply learns that thinking longer produces better next tokens.

So what is actually different between GPT, Claude, Gemini, and the open models?

Less than the branding suggests, at this level. They sit in the same design space: tokenize, embed, add position, stack blocks of multi-head attention and feed-forward, carry a residual stream, normalize, predict the next token. What varies is three things:

The trained weights. Different data, different scale, different quality. This is most of the difference you actually feel.
The configuration. Layer count, vocabulary size, head count, parameter count, dense or Mixture of Experts.
The post-training. Instruction tuning, preference learning, safety, and now reasoning, all applied on top of the base model.

The interesting thing is how much the field converged. Independent teams arrived at nearly the same modern stack: pre-norm placement, RMSNorm, RoPE, SwiGLU, grouped or latent attention, and Mixture of Experts in the largest models. None of it arrived at once. It accumulated over about five years on top of the 2017 design.

Will it last? Maybe not forever. State-space models like Mamba offer linear-time sequence handling, attractive for very long inputs, and hybrids like Jamba interleave a few attention layers among many state-space layers (plus Mixture of Experts) to hold long contexts cheaply. So far the read is that hybrids win on long, memory-bound serving while attention still wins on associative recall and in-context learning. The safe bet is that the core problems in this post, turning text into numbers, giving numbers meaning and order, letting tokens share information, processing each one, keeping a deep stack trainable, and predicting the next token, outlive whatever architecture is fashionable.

If you have read this far, you can open almost any model card and know what each line is talking about. That was the whole goal.

At Kensink Labs we build on top of these models for a living, and the teams that ship reliable AI tend to be the ones who understand the machine they are renting, not the ones chasing the largest number on a leaderboard. Knowing where facts live, why context windows behave oddly, and what post-training can and cannot fix is what separates a demo from a product.

Sources and further reading

Vaswani et al., Attention Is All You Need (2017). The original transformer.
Su et al., RoFormer / RoPE (2021). Rotary position embeddings.
Liu et al., Lost in the Middle (2023). Long-context position bias.
Olsson et al. (Anthropic), In-context Learning and Induction Heads (2022).
Meng et al., ROME: Locating and Editing Factual Associations (2022).
DeepSeek-AI, DeepSeek-V3 Technical Report (2024). MLA and large-scale Mixture of Experts.
Lieber et al., Jamba: A Hybrid Transformer-Mamba Model (2024).
Sebastian Raschka, The Big LLM Architecture Comparison (2025).

What an LLM is really doing.A field guide to the transformer, one sentence at a time.

1. Tokens: text has to become numbers

2. Embeddings: numbers have to carry meaning

3. Position: order has to be injected

4. Attention: tokens have to share information

5. Multi-head attention: one view is not enough

6. The feed-forward network: each token thinks on its own

7. The residual stream: how a deep stack stays trainable

8. Predicting the next token: the loop that writes

So what is actually different between GPT, Claude, Gemini, and the open models?

Sources and further reading

Got a project that needs this thinking?

What an LLM is really doing.
A field guide to the transformer, one sentence at a time.