How AI Models Learn to Be Helpful, Harmless, and Honest

A Guide to Preference Tuning

Yann Paul

HR Manager

Management

In the previous installment of this series, we walked through how large language models are pre-trained on trillions of tokens and then fine-tuned with supervised data to behave like assistants. But fine-tuning alone doesn't produce the polished, warm, safe AI assistant you interact with today. There's a third stage — and it's arguably the most interesting: preference tuning.

This article explains what preference tuning is, why it exists, how it works mechanically, and what tradeoffs the two dominant approaches — RLHF and DPO — involve.

Why Fine-Tuning Isn't Enough

After supervised fine-tuning, a model knows how to answer questions. It's no longer just predicting the next word on the internet — it's trying to be helpful. But "helpful" and "how I want you to sound" are different things.

Consider a simple example. A user asks: "Can I put my teddy bear in the washer?" A fine-tuned model might respond: "No, it might get damaged. Try hand washing it instead." That's factually correct. But it's blunt. It doesn't acknowledge that the person might love their teddy bear. It doesn't soften the news. A better answer might be: "It's better not to — your teddy bear could get hurt. A gentle hand wash is safer."

The facts are identical. The tone is completely different. And tone, friendliness, safety, and dozens of other qualities are exactly what preference tuning is designed to fix.

More formally: fine-tuning teaches the model what it should generate. It does not teach the model what it should not generate. Preference tuning injects this negative signal — the ability to learn from outputs that were bad, not just from outputs that were good.

The Data: Preference Pairs

Everything starts with data. To do preference tuning, you need a dataset of preference pairs: for a given prompt, you have two responses — one that's better, one that's worse — and a label indicating which is which.

There are three ways to collect this kind of data, and they differ in how much you ask annotators to do.

The simplest is a binary pairwise comparison: show a human rater two responses and ask which one is better. No scores required — just a preference. This is the most common approach because it's cognitively easier. It's far simpler for a human to say "this poem is better than that one" than to assign an accurate score of 0.9 versus 0.2 to each poem in isolation.

A more granular version asks raters to choose from a scale: much better, better, slightly better, about the same, slightly worse, worse, much worse. This provides richer signal but introduces noise, since people interpret those categories differently.

The third approach is listwise ranking: given a set of responses, order them from best to worst. This captures relative quality across multiple options at once, but it's harder to execute and harder to keep consistent across raters.

In practice, pairwise binary comparison is the standard. For each prompt, the model generates two responses (using a non-zero temperature to get diversity), a rater or automated system compares them, and you record which was preferred. You can also generate these pairs by taking a model's bad output, rewriting it, and creating a (bad, good) pair from scratch.

One important nuance: human ratings are only as good as the guidelines given to raters. Preferences are subjective — some people like emojis in AI responses, others hate them — and if the rating guidelines are ambiguous, the resulting dataset will be noisy. Getting the guidelines right is a significant and often underappreciated part of building good preference data.

RLHF: Reinforcement Learning from Human Feedback

RLHF is the classic approach to preference tuning. It uses reinforcement learning — a framework from AI research in which an agent learns to act by receiving rewards for its behavior.

The translation to LLMs is fairly direct. The LLM is the agent. The state it's in is whatever text has been generated so far. The action it takes is predicting the next token. The policy — the probability distribution over what token to generate next — is just the output of the model's forward pass. And the reward is a score assigned to a complete response.

RLHF runs in two stages.

Stage 1: Train a reward model. Using your preference pairs, you train a separate model whose job is to predict, for any given (prompt, response) pair, how good the response is. The model is trained pairwise: it learns to assign higher scores to the preferred response and lower scores to the rejected one. But crucially, the reward model is pointwise at inference time — you give it one response and it outputs one score. The pairwise structure is in the training objective, not in how you use the model.

The mathematical foundation here is the Bradley-Terry model, which gives a principled way to estimate the probability that one response is preferred over another, based on the difference in their rewards. The loss function that falls out of this is elegant: it's the negative log of the probability that the winning response scores higher than the losing one, taken in expectation over your dataset. This can be derived from first principles by maximizing the likelihood of the observed preference data.

The reward model can be any LLM — in practice, the standard approach is to take a decoder-only model and attach a classification head that outputs a scalar score for the full sequence. The scale of the scores doesn't matter for ranking, but does matter when rewards flow into the RL training loop.

Stage 2: Use the reward model to train the policy. Now you train the LLM — the "policy" in RL terms — to generate responses that earn high rewards. The algorithm typically used here is PPO (Proximal Policy Optimization).

The word "proximal" is key. It reflects the central tension of this stage: you want the model to improve its outputs, but you don't want it to change too dramatically. There are three reasons for this constraint.

First, your pre-trained and fine-tuned model already contains enormous knowledge. If you allow unrestricted optimization toward reward, you risk destroying that knowledge — the model "forgets" everything it learned before.

Second, the reward model is imperfect. It was trained on human judgments, which are noisy, and it may not perfectly capture what you actually want. If you optimize aggressively for a flawed reward, you'll get a model that scores highly on the reward model but fails in practice. This is called reward hacking — the phenomenon where the model finds ways to game the metric rather than satisfy the underlying goal. A classic illustration: if a lecturer optimized purely for audience applause, they might tell more jokes to get louder claps — maximizing the metric while failing the actual objective of being informative.

Third, aggressive optimization creates training instability. Large policy updates are harder to keep stable than small, incremental ones.

The PPO loss function addresses this with two mechanisms. One is clipping: it limits how much the probability ratio between the current policy and the previous iteration's policy can change in a single update. The other is a KL divergence penalty: it adds a term to the loss that measures how far the current policy's distribution is from the reference model (usually the SFT model), and penalizes large deviations. In modern LLM training, practitioners often blend both approaches, and the reference point is typically the frozen SFT model rather than the previous iteration alone.

The PPO loop requires generating complete responses from the current policy, scoring them with the reward model, computing advantages (how much better or worse each completion was than expected), and updating the policy weights accordingly. Completions that earned above-average rewards get reinforced; completions that earned below-average rewards get discouraged — but neither update is allowed to be so large that it destabilizes training.

Challenges with PPO. This process is powerful but demanding. You need to keep multiple models in memory simultaneously: the policy being trained, the reference SFT model for the KL penalty, the reward model, and a value function for advantage estimation. That's four large models. The training is sensitive to hyperparameters — the KL penalty coefficient beta, the clipping range epsilon, and the parameters of the advantage estimator all matter. And because the model is generating its own training data at each iteration (this is called on-policy training), getting sufficient diversity in the completions is a constant concern.

Best-of-N: The Inference-Time Alternative

Some teams don't want to deal with the complexity of RL training at all. An alternative is best-of-N (also called BoN): rather than training the policy to generate better outputs, you generate many outputs at inference time, score them all with the reward model, and return the top-rated one.

The approach is simple. For a given prompt, generate N completions from the SFT model (using a positive temperature to ensure diversity), pass each through the reward model, and return the one with the highest score.

The main problem is cost. You've traded a one-time training expense for a per-query multiplier on your inference budget. Every user request now requires N model runs instead of one. For high-traffic deployments, this becomes prohibitive. Even with infinite compute, there's a latency problem: you have to wait for all N completions to finish before returning any answer, and the latency of the slowest completion determines your response time.

Best-of-N makes sense when you have a very limited inference load, when you're prototyping, or when training a full RL loop isn't feasible. But it's not a substitute for policy training at scale.

DPO: Direct Preference Optimization

The weight of maintaining four models, the training instability, the sensitivity to hyperparameters — all of this motivated researchers to ask: do we really need RL?

The answer, it turns out, is no. The DPO paper — whose title is the cheeky "Your Language Model Is Secretly a Reward Model" — showed that you can derive a training objective directly from preference pairs, without ever explicitly building or using a reward model.

The key insight was mathematical. Start with the same objective PPO optimizes: maximize rewards while staying close to the reference model, with the KL divergence as a penalty. Solve for what the optimal policy looks like analytically. Then plug that expression for the optimal policy back into the reward model formulation from the Bradley-Terry model.

When you do this algebra, the reward terms cancel out. What remains is a loss function expressed entirely in terms of the policy — the probability the current model assigns to the winning completion, the probability the reference model assigns to the same winning completion, and the same two quantities for the losing completion. No reward model needed.

The DPO loss says: push the current model to assign relatively higher probability to the winning response than the reference model does, and relatively lower probability to the losing response — but don't deviate too far from the reference model in either direction. The beta hyperparameter controls how strongly you penalize distance from the reference.

In practice, this means you freeze the SFT model as your reference, keep a copy of the current weights being trained, and compute the loss directly on your preference pairs. Two models instead of four. Supervised training instead of on-policy RL. Much simpler.

The tradeoffs. DPO is not strictly better than PPO. The main challenge is distribution shift: DPO trains on preference pairs that were generated by some prior model (or rewritten by humans), but the model being trained may produce different completions than those in the training data. The RL approach sidesteps this because it generates its own completions at each iteration and gets rewards for those specific completions. DPO doesn't have that self-correction mechanism.

In head-to-head benchmarks, well-tuned PPO tends to outperform DPO. But DPO is significantly easier to implement, cheaper to run, and more stable to train. The practical choice depends on your compute budget, your RL expertise, and how much performance you're willing to leave on the table in exchange for simplicity.

The Full Alignment Picture

Put it all together and you have the modern LLM training pipeline:

Pre-training teaches the model what language and code are, using trillions of tokens and next-token prediction as the objective. Supervised fine-tuning teaches the model to be a helpful assistant, using thousands to millions of high-quality (instruction, response) pairs. And preference tuning — whether via RLHF/PPO or DPO — teaches the model to prefer responses that humans actually prefer: friendlier in tone, safer in content, more accurate in nuance.

Each stage adds something the previous one couldn't. Pre-training provides the knowledge base. Fine-tuning provides the behavioral template. Preference tuning provides the alignment between what the model generates and what humans actually want.

The result, when done well, is a model that not only knows things and knows how to answer — but has internalized a sense of how to respond in a way that feels genuinely helpful.

Share on social media

How to Actually Evaluate Your AI System?

Featured

What Is a Transformer?

Insight

How LLMs Are Actually Trained

Insight