Deriving the DPO Objective Under the Plackett-Luce Model | HackerNoon

Writings, Papers and Blogs on Text Models
August 25, 2024
9:14 pm

Authors:

(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier;

(2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier;

(3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier;

(4) Stefano Ermon, CZ Biohub;

(5) Christopher D. Manning, Stanford University;

(6) Chelsea Finn, Stanford University.

Table of Links

Abstract and 1. Introduction

2 Related Work

3 Preliminaries

4 Direct Preference Optimization

5 Theoretical Analysis of DPO

6 Experiments

7 Discussion, Acknowledgements, and References

Author Contributions

A Mathematical Derivations

A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective

A.2 Deriving the DPO Objective Under the Bradley-Terry Model

A.3 Deriving the DPO Objective Under the Plackett-Luce Model

A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2

A.6 Proof of Theorem 1

B DPO Implementation Details and Hyperparameters

C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details

C.2 GPT-4 prompts for computing summarization and dialogue win rates

C.3 Unlikelihood baseline

D Additional Empirical Results

D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments

D.3 Human study details

A.3 Deriving the DPO Objective Under the Plackett-Luce Model

The Plackett-Luce model [30, 21] is a generalization of the Bradley-Terry model over rankings (rather than just pair-wise comparisons). Similar to to the Bradley-Terry model, it stipulates that when presented with a set of possible choices, people prefer a choice with probability proportional to the value of some latent reward function for that choice. In our context, when presented with a prompt x and a set of K answers y1, . . . , yK a user would output a permutation τ : [K] → [K], giving their ranking of the answers. The Plackett-Luce model stipulates that

Notice that when K = 2, Equation 18 reduces to the Bradley-Terry model. However, for the general Plackett-Luce model, we can still utilize the results of Eq. 5 and substitute the reward function parameterized by its optimal policy. Similarly to Appendix A.2, the normalization constant Z(x) cancels out and we’re left with:

Lawsuit Accuses Meta Of Training AI On Torrented 82TB Dataset With Millions Of Pirated Books

Meta is involved in a class action lawsuit alleging copyright infringement, a claim the company disputes. Newly unsealed court documents containing private conversations between Meta

February 10, 2025

Five Reasons You Cannot Afford Not Knowing Probability Proportional to Size (PPS) Sampling

Data Science Simple Random Sampling (SRS) works, but if you do not know Probability Proportional to Size Sampling (PPS), you are risking yourself some critical

November 28, 2024

Hydration Unveils Decentralized Borrowing Platform On Polkadot | HackerNoon

GIBRALTAR, Gibraltar, November 29th, 2024/Chainwire/–Hydration has announced the launch of its decentralized borrowing platform, the Hydration Money Market. The new platform allows users to supply

November 29, 2024

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

With expert analysis, comprehensive market coverage, and actionable insights, our newsletter equips you with the knowledge & tools necessary to make informed decisions & maximize your potential returns in the dynamic world of future tech stocks.

Deriving the DPO Objective Under the Plackett-Luce Model | HackerNoon

Table of Links

A.3 Deriving the DPO Objective Under the Plackett-Luce Model

Lawsuit Accuses Meta Of Training AI On Torrented 82TB Dataset With Millions Of Pirated Books

Five Reasons You Cannot Afford Not Knowing Probability Proportional to Size (PPS) Sampling

Hydration Unveils Decentralized Borrowing Platform On Polkadot | HackerNoon

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

Subscribe