Research

Direct Preference Optimization: A Technical Deep Dive

April 17, 2025

・

Ivan Provilkov, Zain Hasan, Max Ryabinin

We're excited to announce that the Together Fine-Tuning Platform now supports Direct Preference Optimization (DPO)! This technique allows developers to align language models with human preferences creating more helpful, accurate, and tailored AI assistants. In this deep-dive blogpost, we provide details of what DPO is, how it works, when to use it and code examples. If you'd like to jump straight into code have a look at our code notebook.

Tuning LLMs on Preference Data

Modern language model development typically follows a three-stage process:

Pre-training on internet-scale data to build a foundation model with broad knowledge
Supervised fine-tuning (SFT) on specific high-quality examples to adapt a model to a particular knowledge domain or task‍
Preference-based learning to refine the model based on human preferences

Source. Great talk about this here: State of GPT - Karpathy Talk

This final stage, preference learning, is where DPO comes in as an alternative to Reinforcement Learning from Human Feedback (RLHF). It ensures that models not only perform tasks correctly but do so in ways that users prefer. It also allows users to teach the model nuances of a particular use case by showing examples of what is expected and what the model should avoid. Business use cases where you might employ DPO are to improve:

Helpfulness
Tone
Truthfulness
Harmlessness
Instruction-following

Preference tuning shapes the model's generation quality and alignment with human and business values.

What is Direct Preference Optimization?

DPO is a method for aligning language models with human preferences without using reinforcement learning (RL). Unlike traditional approaches, DPO allows you to train language models directly on preference data consisting of:

A prompt or instruction
A preferred (chosen) response
An unpreferred (rejected) response

For example, you might have a dataset entry like this:

    
      {
        "input": {
          "messages": [
            {
              "role": "assistant",
              "content": "Hello, how can I assist you today?"
            },
            {
              "role": "user",
              "content": "Can you tell me about the rise of the Roman Empire?"
            }
          ]
        },
        "preferred_output": [
          {
            "role": "assistant",
            "content": "The Roman Empire rose from a small city-state founded in 753 BCE. Rome expanded across the Italian peninsula. After the Punic Wars, it grew even stronger, and in 27 BCE, Augustus became the first emperor, marking the start of the Roman Empire."
          }
        ],
        "non_preferred_output": [
          {
            "role": "assistant",
            "content": "The Roman Empire rose due to military strength and strategic alliances."
          }
        ]
      }

DPO adjusts the model weights to increase the probability of generating responses like the preferred one while decreasing the probability of generating responses like the rejected one. It essentially teaches the model to distinguish between better and worse responses for a given prompt.

To understand how DPO works, let's use a cooking analogy. Imagine you're a chef with a cookbook (your initial model). The DPO loss function teaches you to adjust your recipes in a specific way: increase the likelihood of making dishes customers loved compared to your original recipes, and decrease the likelihood of making dishes customers disliked compared to how you made them initially. Rather than simply maximizing preferred dishes and minimizing disliked ones in absolute terms, this relative approach ensures you don't completely abandon your fundamental cooking techniques while developing improvements.

The β parameter acts like a constraint controlling how much you can experiment: a higher value forces you to stay closer to your original recipes, while a lower value allows for more significant departures from your cookbook. You're not learning from a more experienced chef; instead, you're using your initial recipes as a reference point to prevent you from drifting too far from what you already know how to cook.

DPO vs. RLHF

Companies have used Reinforcement RLHF to align language models with human preferences. While both DPO and RLHF aim to achieve the same goal, they take fundamentally different approaches:

RLHF: The Traditional Approach

RLHF is a complex, multi-stage process:

Train a reward model on human preference data
Use reinforcement learning (typically PPO) to optimize the policy model against the reward model
Generate online samples during training, slowing down the process
Manage the complexity of RL training, including value functions, policy gradients, and hyperparameter tuning

DPO: The Direct Approach

DPO simplifies this process dramatically:

Train directly on preference data without an intermediate reward model
Use a simple loss function that maximizes the margin between preferred and rejected responses
No need for online sampling during training, making it more efficient
Simpler implementation and hyperparameter tuning

In contrast to DPO, going back to our analogy of cookbook improvement, RLHF would be like hiring a food critic (reward model) to taste and rate your dishes, then having you repeatedly cook different variations and adjust your recipes based on the critic's ongoing feedback - a more complex process requiring the intermediate step of training the critic to recognize good food before you can improve your cooking. While DPO directly uses customer preferences to modify your existing recipes, RLHF introduces this third-party critic who guides your experimentation through ongoing taste tests and ratings.

Previous work has shown that DPO can match or exceed the performance of RLHF in many cases while being simpler to implement and more computationally efficient.

Stacking Methods: Combining SFT with Preference Fine-tuning

While DPO is powerful on its own, combining it with Supervised Fine-Tuning (SFT) creates an even more effective training pipeline. The recommended approach is to:

First perform SFT on your dataset to teach the model the basic task structure and response format
Then refine with DPO by continuing fine-tuning from your SFT checkpoint

This two-stage approach provides several key advantages:

Better initial starting point: SFT helps the model understand the basic format and content of desired responses
Significant quality improvements: SFT is particularly important when your training data differs from what the base model observed during pre-training
More effective preference learning: A model already familiar with the task domain can better learn subtle preferences
Faster convergence: The DPO phase can focus on preference refinement rather than learning task basics

To implement this approach, you can first concatenate the context with the preferred output for SFT, and then follow up with preference fine-tuning using pairs of preferred and non-preferred outputs.

This stacked approach, of SFT + DPO, yields superior results compared to using either method alone. The SFT phase ensures the model has the fundamental capabilities, while the DPO phase refines how those capabilities are expressed. Read our technical deep-dive on continual fine-tuning.

When to Use Direct Preference Optimization

DPO is ideal in scenarios where you have preference data that captures the nuances of what makes a good response better than an alternative. Here are some key situations where DPO shines:

1. When Prompting Isn't Sufficient

While prompts can guide model behavior, they take up tokens, incur costs with each call, and may be ignored. DPO provides a more robust solution by directly encoding preferences into the model weights.

2. When Humans Can Compare Better Than Create

It's often easier for humans to judge which of two responses is better than to craft the perfect response from scratch. DPO leverages this fact, making data collection more efficient.

3. When Making Controlled Improvements to Existing Models

DPO, especially with tuned β values, allows for more measured improvements to models that are already performing well but need refinement in specific areas.

Ideal Use-Cases for DPO

DPO excels in tasks with multiple valid approaches where quality judgments are nuanced:

Use Case	Why DPO Works Well
Chatbot Responses	If your application involves conversations in specific domains like psychology, medicine, or role-playing, then DPO provides a significant enhancement to conversation quality; this helps optimize for engagement and helpfulness
Summarization	Humans can easily compare summaries, but writing perfect ones is harder
Code Generation	Different coding styles can be valid; judgments about readability and maintainability are subjective
Question Answering	Multiple valid approaches with varying levels of helpfulness and clarity
Writing Assistance	Writing quality is subjective and context-dependent

DPO is not good for tasks with single correct answers, such as:

Information extraction (NER, classification)
Tool calling with limited variation
Mathematical computation
Tasks where there's an objectively correct answer

Getting Started with DPO on Together

We’ve prepared an code notebook for you to follow along. Below, we cover some important things to keep in mind when training a model using DPO.

Key Hyperparameters

While you still need to tune usual hyperparameters like learning rate, the most important parameter for DPO is --dpo-beta:

This controls how much the model is allowed to deviate from its reference model during training
Lower values (e.g., 0.1) allow more aggressive updates toward preferred responses
Higher values keep the model closer to its reference behavior
The default value is 0.1, but you can experiment with values between 0.05-0.5

Monitoring Training

When monitoring your DPO fine-tuning job, you'll see several metrics specific to preference optimization:

Accuracy: The percentage of times the reward for the preferred response is greater than the reward for the non-preferred response
KL Divergence: The divergence between the tuned model and the reference model

These metrics help you gauge how well your model is learning preferences while maintaining its core capabilities.

Sources

Lower
Cost
20%
faster
training
4x
network
compression
117x

Q: Should I use the RedPajama-V2 Dataset out of the box?

RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.

Links in this
article

Heading

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

INTEGRATION PARTNER

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Direct Preference Optimization: A Technical Deep Dive

Tuning LLMs on Preference Data

What is Direct Preference Optimization?

DPO vs. RLHF

RLHF: The Traditional Approach

DPO: The Direct Approach

Stacking Methods: Combining SFT with Preference Fine-tuning

When to Use Direct Preference Optimization

1. When Prompting Isn't Sufficient

2. When Humans Can Compare Better Than Create

3. When Making Controlled Improvements to Existing Models

Ideal Use-Cases for DPO

Getting Started with DPO on Together

Key Hyperparameters

Monitoring Training

Sources

Q: Should I use the RedPajama-V2 Dataset out of the box?

Subscribe to newsletter

Heading

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Subscribe to newsletter