How LLMs Are Actually Trained

From Raw Data to a Useful AI Assistant

Fiona Jake

Content Designer

Insight

Most people interact with AI assistants every day without thinking about how they came to exist. You type a question, you get an answer. But behind that interaction lies a training process that is staggeringly complex, expensive, and surprisingly nuanced. This article walks through it from start to finish.

The Old Way: One Model Per Task

Not long ago, the standard approach to machine learning was straightforward: pick a task, gather labeled data for that task, train a model, ship it. Spam detection? Train a spam classifier. Sentiment analysis? Train a sentiment model. Each task got its own model, trained from scratch.

This works, but it wastes something valuable. All language tasks share a common foundation — understanding how words relate to each other, how sentences are structured, what concepts mean. Training a new model from scratch for every task ignores everything learned from tasks before it.

The solution is called transfer learning: first build a model that understands language in general, then adapt it to whatever specific task you care about. This is the paradigm that modern LLMs are built on.

Stage 1: Pre-Training — Teaching a Model What Language Is

Pre-training is the first and most expensive stage. The goal is simple: train a model to predict the next token on an enormous amount of text. By "enormous," the scale is hard to grasp. GPT-3 was trained on 300 billion tokens. Llama 3 was trained on 15 trillion. Common Crawl alone scrapes roughly three billion web pages per month. The training data includes web text, Wikipedia, Reddit, GitHub, Stack Overflow, books — essentially everything written down.

The model being trained is a decoder-only Transformer. It takes text as input and tries to predict what comes next, again and again, across hundreds of billions of examples. There's no labeled data in the traditional sense — the text itself is the label, since every token is the correct "answer" for the token before it.

Two units of measurement come up constantly when discussing pre-training. The first is FLOPs (floating point operations) — a measure of total compute required. Training a large LLM runs to around 10^25 FLOPs. As a rough approximation, this scales with the product of the number of model parameters and the number of training tokens. The second is FLOPS (floating point operations per second) — a measure of how fast a given piece of hardware can work. This distinction matters because the two are often confused in papers.

Scaling laws and the Chinchilla finding

A 2020 paper established something now called "scaling laws": more compute, more data, and more parameters all independently improve model performance on next-token prediction. This led to years of companies simply building bigger and bigger models.

But a follow-up question emerged: given a fixed compute budget, what's the optimal split between model size and dataset size? The answer, from a paper nicknamed "Chinchilla," was that the two should scale together — roughly 20 training tokens per model parameter is optimal. By this metric, GPT-3 (175 billion parameters, 300 billion tokens) was undertrained. The compute was there, but the data wasn't big enough relative to the model size.

The challenges of pre-training

Pre-training is expensive. Not "expensive" as in a few thousand dollars — expensive as in tens to hundreds of millions of dollars per run. It's environmentally costly, and the ecological impact has started to appear in papers alongside the compute numbers.

Pre-training also creates a knowledge cutoff. A model can only know what was in its training data. Events that happened after the cutoff date simply don't exist to the model. This is why model cards always list a knowledge cutoff date — for GPT-5, that was reportedly September 30th.

There's also the risk of memorization. A model trained on the entire internet has likely seen the same text repeated many times, and there's always a chance it reproduces that text rather than generating something new.

Making Pre-Training Possible: Distributed Training

Here's the engineering problem: a large LLM might have hundreds of billions of parameters. Training it requires storing not just the weights themselves but also activations (intermediate values computed during the forward pass), gradients (computed during the backward pass), and optimizer states like the running averages maintained by the Adam optimizer. A high-end GPU like an H100 has 80 GB of memory. That's not enough for a large model.

The answer is to train across many GPUs simultaneously. There are two broad families of techniques.

Data parallelism divides the training data across GPUs, giving each one a copy of the model but a different batch of data. Gradients are averaged across devices before updating the weights. The downside: you still need to fit an entire model on each GPU, and the inter-GPU communication adds overhead.

A refinement called ZeRO (Zero Redundancy Optimizer) eliminates the duplication. Instead of keeping full copies of parameters, gradients, and optimizer states on every GPU, ZeRO partitions these across GPUs. This drastically cuts memory per device, at the cost of more communication between devices. Different variants (ZeRO-1, ZeRO-2, ZeRO-3) trade off how aggressively they partition, and teams choose based on their specific memory and speed constraints.

Model parallelism goes further, splitting the model itself across devices rather than just the data. Expert parallelism routes different inputs to different GPUs when using mixture-of-experts architectures. Tensor parallelism splits large matrix multiplications across devices. Pipeline parallelism assigns different layers of the model to different GPUs. None of these are magic — they all introduce communication overhead — but they make it possible to train models that wouldn't fit on any single device.

Flash Attention: Making the Math Faster

Even with many GPUs, the attention computation itself is a bottleneck. The standard approach computes attention by repeatedly reading and writing large matrices to the GPU's main memory (called HBM — high bandwidth memory). HBM is large but relatively slow. The GPU also has a smaller, much faster on-chip memory called SRAM, but it's tiny by comparison.

Flash Attention, developed at Stanford in 2022, rethinks how the attention computation maps onto this hardware. Instead of computing the full attention matrix at once and shuttling it in and out of HBM multiple times, Flash Attention tiles the computation: it takes small blocks of the query, key, and value matrices, loads them into the fast SRAM, performs the complete local computation, and accumulates the results incrementally.

The key mathematical insight is that the softmax over an entire row can be computed incrementally across tiles — you don't need the whole row in memory at once if you track a scaling factor that adjusts for what you've seen so far.

The result is dramatic. Memory reads and writes to HBM drop by roughly 10x. Counter-intuitively, although Flash Attention technically performs more operations (because it recomputes some intermediate values rather than caching them), the overall runtime is faster — because the bottleneck was memory bandwidth, not raw compute. Flash Attention 2 and 3 extend these ideas to newer GPU architectures. Today it's considered a standard component rather than an optimization.

Quantization and Mixed Precision: Using Memory More Efficiently

Every parameter in a neural network is stored as a floating-point number. The default is 32-bit precision (FP32), which gives high numerical accuracy but costs a lot of memory. A natural question: do you really need that much precision?

The answer, for most operations, is no. Modern LLM training uses mixed precision: keep the model's weights in FP32, but perform the forward and backward pass in FP16 (16-bit). Weight updates are still computed in FP32 to avoid error accumulation. The logic is that individual training steps don't need to be numerically precise down to the 32nd bit — what matters is getting the gradient direction right. But the weights themselves accumulate small updates over millions of steps, so they benefit from higher precision.

Beyond mixed precision, quantization takes the idea further: compress the weights to 8-bit (INT8) or even 4-bit formats. This reduces memory dramatically but requires care. One common approach is NF4 (Normal Float 4), which assumes model weights are normally distributed and splits the representable range into equal-probability quantile buckets rather than fixed-size buckets. This makes better use of the available bits.

Quantization matters most for inference — deploying a model — where you want to run the model as cheaply as possible on as little hardware as possible.

Stage 2: Fine-Tuning — Teaching the Model to Be Helpful

After pre-training, you have a model that knows an enormous amount about language. Ask it a question and it will respond by continuing the text in a way that sounds like something it might have seen. That's not the same as being helpful.

Here's a concrete example. If you ask a pre-trained base model "Can I put my teddy bear in the washer?" it might respond by continuing the sentence with something plausible: information about teddy bear materials, washing instructions copied from a generic source, or even another question. It's not trying to help you. It's trying to predict what comes next.

To turn that language model into an assistant, you need supervised fine-tuning (SFT) — also called instruction tuning.

Instruction tuning trains the model on examples of (instruction, response) pairs. The model sees a user's question as fixed input and is trained to predict a helpful, accurate response. The loss is only applied to the response, not the instruction — you're not training the model to predict what the user wrote, you're training it to respond well.

The data required for SFT is tiny compared to pre-training — but the quality bar is much higher. GPT-3's instruction tuning used roughly 13,000 examples. Llama 3 used around 10 million. Even at 10 million examples of roughly 1,000 tokens each, that's 10 billion tokens — two orders of magnitude less than the 15 trillion used in pre-training. The dataset has to be carefully curated: accurate, diverse, safe, representative of the kinds of questions users will actually ask.

Modern SFT datasets combine multiple categories: question answering, coding, math, summarization, story generation, safety-critical refusals, and hedged responses for ambiguous queries. The goal is a model that generalizes across all of them.

LoRA: Fine-Tuning Without Breaking the Bank

Full fine-tuning — updating every weight in the model — is expensive. For a model with tens of billions of parameters, even a small fine-tuning run can cost significant money and time.

LoRA (Low-Rank Adaptation) is a now-standard technique for doing fine-tuning efficiently. The idea is to not modify the pre-trained weights directly. Instead, for each weight matrix W, you add a low-rank decomposition: a product of two small matrices, B and A, where the bottleneck dimension (rank R) is kept very small — typically 4, 8, or 16.

The pre-trained weights W0 are frozen. Only B and A are trained. The effective weight used during fine-tuning is W0 + BA. Because R is tiny compared to the full matrix dimensions (which are hundreds or thousands), the number of trainable parameters drops by orders of magnitude.

After fine-tuning, the BA matrices can be merged back into W0 with simple addition, so inference costs nothing extra.

A few practical notes from research: LoRA works best with higher learning rates than full fine-tuning — roughly 10x higher. It also doesn't perform as well with large batch sizes. The best place to apply LoRA was originally assumed to be the attention matrices, but more recent work suggests the feed-forward blocks are where it has the most impact.

QLoRA (Quantized LoRA) pushes this further: the frozen weights W0 are quantized to 4-bit NF4 format, while the small B and A matrices remain in full precision. This achieves roughly 16x memory savings compared to full-precision fine-tuning, making it possible to fine-tune large models on consumer-grade hardware.

Evaluating LLMs: A Hard Problem

Once you have a fine-tuned model, how do you know if it's actually good?

The field has developed a variety of benchmarks. MMLU (Massive Multitask Language Understanding) covers around 50 academic subjects. GSM8K tests grade-school math reasoning. HumanEval tests code generation. These benchmarks produce clean numbers, which makes them useful for comparing models.

But numbers can be misleading. Models are often trained on data that resembles the benchmark tasks, making scores look impressive even when the underlying capability hasn't improved. A model might score at the top of every benchmark and still feel unhelpful to real users.

Human preference rankings address this differently. Chatbot Arena, for example, presents users with responses from two anonymous models and asks which is better. Over many such comparisons, a ranking emerges. This captures something real — how users actually feel about the models — but it has its own failure modes. Early comparisons can anchor a model's ranking in ways that persist. The pool of people voting may not represent actual users. And someone wanting to game the leaderboard could do so by identifying which model is being compared against a given opponent and selecting strategically.

The uncomfortable reality is that model evaluation is genuinely hard. No single number captures whether a model is good for your use case. The right approach is to combine automated benchmarks, human evaluations, and task-specific testing — and understand what each one measures and what it misses.

The Full Picture

LLM training involves at least three distinct stages: pre-training on trillions of tokens to build general language understanding, supervised fine-tuning on thousands to millions of carefully curated examples to develop helpfulness, and (covered in subsequent lectures) preference tuning to align the model's outputs with what users actually want. The combination of fine-tuning and preference tuning is what researchers call alignment.

A fourth stage — "mid-training" — has emerged recently, inserted between pre-training and fine-tuning. It uses the same next-token prediction objective as pre-training but applies it to data specifically curated for the target domain. It's still early, but it reflects a growing recognition that the transition from raw pre-training to fine-tuning benefits from an intermediate step.

Each stage has its own costs, its own data requirements, and its own tradeoffs. Understanding how they fit together is essential for anyone building with, evaluating, or thinking critically about modern AI systems.

Share on social media

How to Actually Evaluate Your AI System?

Featured

What Is a Transformer?

Insight

How AI Models Learn to Be Helpful, Harmless, and Honest

Management