What Is a Transformer?

A Plain-English Guide to the Architecture Behind Modern AI

Rustem

CEO

Insight

What Is a Transformer?

Every time you use ChatGPT, Gemini, Claude, or any modern AI assistant, you're interacting with something built on a single foundational idea: the Transformer. First introduced in a 2017 paper called "Attention Is All You Need," this architecture quietly became the engine powering almost every large language model (LLM) in existence today.

But what actually is a Transformer, and why did it replace everything that came before it? Let's walk through it from the ground up.

First, what problem are we solving?

Natural language processing — the field of getting computers to understand and generate text — breaks down into three broad categories of tasks.

The first is classification: given a piece of text, predict a label. Is this movie review positive or negative? What language is this written in? What does the user want to do?

The second is multi-classification: predict multiple things from a single input. Named entity recognition, for example, asks the model to identify which words in a sentence are locations, people, or organizations.

The third, and most relevant today, is generation: take text in, produce text out. This is machine translation, question answering, summarization, and code generation — everything that makes modern AI assistants feel like assistants.

The problem with words: how models read text

Before any model can process language, it needs to convert text into numbers. This is called tokenization — the act of splitting text into discrete units.

The simplest approach is word-level tokenization: split on spaces. But this creates problems. "Bear" and "bears" become completely separate, unrelated tokens. Words you never saw during training get marked as unknown.

A more sophisticated approach is subword tokenization. Instead of treating every word as a unit, this method breaks words down into their roots and common fragments. "Bears" becomes "bear" plus a suffix. This means the model can recognize words it has technically never seen in full, because it knows the parts. The tradeoff: your sequences get longer, and longer sequences mean more computation.

Character-level tokenization solves the unknown word problem entirely — every character is a valid token — but the sequences become extremely long, and it's very hard to build meaning from individual letters.

Most modern models use subword tokenization, which hits a useful middle ground between all of these.

Turning tokens into meaning: embeddings

Once you have tokens, you need to represent them as vectors — lists of numbers that a neural network can work with.

The naïve way is a "one-hot" vector: a list with a single 1 and everything else 0, one slot per word in the vocabulary. This is clean and simple, but it has a fatal flaw: every word is equally dissimilar to every other word. "King" and "queen" would be just as different as "king" and "refrigerator."

The solution, popularized by a 2013 model called word2vec, is to learn embeddings from data. The model is trained on a proxy task — predict the next word — and as a byproduct, it learns to place similar words close together in vector space. The classic demonstration: king minus man plus woman equals queen. The math works out because the embeddings have captured genuine semantic relationships.

But word2vec has a significant limitation: each word gets one fixed representation, regardless of context. The word "bank" has the same embedding whether you're talking about a river bank or a financial institution.

Enter the RNN — and its fatal flaw

Recurrent neural networks (RNNs) were the dominant architecture before Transformers. An RNN processes tokens one at a time, maintaining a "hidden state" — a running summary of everything it has seen so far. Each new token updates this summary.

This allowed models to capture word order and context, which word2vec couldn't do. But RNNs had a critical weakness: they struggled with long sequences. The information from earlier in the sentence would get diluted or lost by the time the model reached the end. This is known as the vanishing gradient problem — when training, the signal used to update the model's weights has to travel backward through every step in the sequence, and it tends to shrink to near-zero over long chains.

LSTMs (Long Short-Term Memory networks) were invented to address this, introducing a separate "cell state" designed to carry important information over longer distances. They helped, but didn't fully solve the problem.

The other major issue with RNNs: they're slow to train. Because each step depends on the previous one, you can't parallelize the computation. Training on long sequences means waiting — a lot.

Attention: the insight that changed everything

The idea behind attention is simple and powerful: instead of forcing information to travel step-by-step through a sequence, give every token a direct connection to every other token.

When an RNN tries to translate "a cute teddy bear is reading" into French, the model has to carry the meaning of the first word all the way through to the end. Attention says: when you're generating the output for a given word, just look directly at the relevant parts of the input, all at once.

This was introduced in 2014 as an add-on to existing RNN architectures. But the 2017 Transformer paper went further: what if you got rid of the sequential processing entirely and used attention for everything?

Self-attention: the core of the Transformer

Self-attention is the mechanism that lets every token in a sequence look at every other token simultaneously, and compute a new representation of itself that's informed by its context.

Here's how it works mechanically. For each token, the model computes three things: a query, a key, and a value. These are learned projections of the token's embedding into different spaces.

To compute the new representation of, say, "teddy bear," the model takes the query for "teddy bear" and compares it (via dot product) against the keys of every other token. This produces a set of scores: how relevant is each other token to understanding "teddy bear" right now? These scores are passed through a softmax function to produce a probability distribution — a set of weights that sum to one. Finally, the model takes a weighted sum of the value vectors, weighted by those attention scores. The result is a new representation of "teddy bear" that has been shaped by everything else in the sentence.

This is why "bank" can have different representations in different contexts: the attention scores will be completely different depending on whether the surrounding words are "river" and "water" or "money" and "account."

The formula is: Attention = softmax(QK^T / √dk) × V. The division by the square root of dk is a scaling factor to prevent the dot products from growing too large as the embedding dimension increases, which would push the softmax outputs into regions with very small gradients.

Multi-head attention: learning to see in multiple ways

Rather than computing attention once, the Transformer computes it multiple times in parallel, each time with different learned projection matrices. These are called attention heads.

Each head learns to pay attention to different aspects of the sentence: one might focus on grammatical relationships, another on semantic similarity, another on positional patterns. The outputs from all heads are concatenated and projected back to the original dimension.

Why doesn't every head learn the same thing? Because gradient descent has no incentive to make them identical — the model learns that specialization is useful for prediction, and so it specializes.

Positional encoding: teaching the model about order

Self-attention, as described so far, is order-agnostic. It doesn't matter whether "not" comes before or after "good" — the attention computation treats them the same. That's a problem for language.

The Transformer solves this with positional encodings: vectors that represent each position in the sequence, added directly to the token embeddings before the attention computation. Now the model can distinguish "the dog bit the man" from "the man bit the dog," because the token embeddings carry positional information.

The full architecture: encoder and decoder

The original Transformer, built for machine translation, has two halves.

The encoder processes the input text. Tokens are embedded, positional encodings are added, and then the sequence passes through a stack of encoder layers. Each layer has a self-attention component followed by a feed-forward network — a standard neural network applied independently to each token, designed to add representational richness. The output is a set of context-aware embeddings, one per input token.

The decoder generates the output text, one token at a time. It starts with a special beginning-of-sequence token and works autoregressively — each token it predicts becomes part of the input for predicting the next one.

The decoder has three sub-components per layer. First, causal self-attention: each generated token can attend to the tokens generated before it, but not to future ones (which don't exist yet). Second, cross-attention: the decoder's current state is used as the query, and the encoder's output provides the keys and values. This is how the decoder "reads" the input sentence to decide what to generate next. Third, a feed-forward network.

At the very end, a linear projection and softmax layer turn the final representation into a probability distribution over the vocabulary — the model's best guess at the next word. The process repeats until the model generates an end-of-sequence token.

Why this mattered so much

The Transformer solved three problems at once. First, it eliminated the vanishing gradient problem by allowing direct connections between any two positions — no matter how far apart. Second, it enabled parallelization during training, since all tokens can be processed simultaneously rather than sequentially. Third, it produced context-sensitive representations as a natural consequence of the attention mechanism.

When the authors applied this architecture to machine translation in 2017, they got results that exceeded existing approaches while training faster. The rest, as they say, is history — GPT, BERT, and every major language model since then is a Transformer variant.

Understanding the Transformer doesn't require a PhD. It requires understanding one idea: tokens should be represented not in isolation, but in relation to everything around them. The whole architecture is an elaboration of that single insight.

Share on social media

How to Actually Evaluate Your AI System?

Featured

How LLMs Are Actually Trained

Insight

How AI Models Learn to Be Helpful, Harmless, and Honest

Management