Deriving the DPO Objective Under the Bradley-Terry Model | HackerNoon

Writings, Papers and Blogs on Text Models
August 25, 2024
9:14 pm

Authors:

(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier;

(2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier;

(3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier;

(4) Stefano Ermon, CZ Biohub;

(5) Christopher D. Manning, Stanford University;

(6) Chelsea Finn, Stanford University.

Table of Links

Abstract and 1. Introduction

2 Related Work

3 Preliminaries

4 Direct Preference Optimization

5 Theoretical Analysis of DPO

6 Experiments

7 Discussion, Acknowledgements, and References

Author Contributions

A Mathematical Derivations

A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective

A.2 Deriving the DPO Objective Under the Bradley-Terry Model

A.3 Deriving the DPO Objective Under the Plackett-Luce Model

A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2

A.6 Proof of Theorem 1

B DPO Implementation Details and Hyperparameters

C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details

C.2 GPT-4 prompts for computing summarization and dialogue win rates

C.3 Unlikelihood baseline

D Additional Empirical Results

D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments

D.3 Human study details

A.2 Deriving the DPO Objective Under the Bradley-Terry Model

It is straightforward to derive the DPO objective under the Bradley-Terry preference model as we have

In Section 4 we showed that we can express the (unavailable) ground-truth reward through its corresponding optimal policy:

Substituting Eq. 17 into Eq. 16 we obtain:

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Artificial intelligence blood test provides a reliable way to identify lung cancer

Using artificial intelligence technology to identify patterns of DNA fragments associated with lung cancer, researchers from the Johns Hopkins Kimmel Cancer Center and other institutions

June 6, 2024

A prosthesis driven by the nervous system helps people with amputation walk naturally

State-of-the-art prosthetic limbs can help people with amputations achieve a natural walking gait, but they don’t give the user full neural control over the limb.

July 1, 2024

How 2024 reshaped the humanoid robotics landscape – The Robot Report

Humanoid said its HMND 01 can assist with kitting, part handling, and machine feeding/offbearing in automotive, electronics, and other manufacturing facilities. | Source: Humanoid The

January 4, 2025

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

With expert analysis, comprehensive market coverage, and actionable insights, our newsletter equips you with the knowledge & tools necessary to make informed decisions & maximize your potential returns in the dynamic world of future tech stocks.

Deriving the DPO Objective Under the Bradley-Terry Model | HackerNoon

Table of Links

A.2 Deriving the DPO Objective Under the Bradley-Terry Model

Artificial intelligence blood test provides a reliable way to identify lung cancer

A prosthesis driven by the nervous system helps people with amputation walk naturally

How 2024 reshaped the humanoid robotics landscape – The Robot Report

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

Subscribe