Deriving the Optimum of the KL-Constrained Reward Maximization Objective | HackerNoon

Writings, Papers and Blogs on Text Models
August 25, 2024
9:13 pm

Authors:

(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier;

(2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier;

(3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier;

(4) Stefano Ermon, CZ Biohub;

(5) Christopher D. Manning, Stanford University;

(6) Chelsea Finn, Stanford University.

Table of Links

Abstract and 1. Introduction

2 Related Work

3 Preliminaries

4 Direct Preference Optimization

5 Theoretical Analysis of DPO

6 Experiments

7 Discussion, Acknowledgements, and References

Author Contributions

A Mathematical Derivations

A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective

A.2 Deriving the DPO Objective Under the Bradley-Terry Model

A.3 Deriving the DPO Objective Under the Plackett-Luce Model

A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2

A.6 Proof of Theorem 1

B DPO Implementation Details and Hyperparameters

C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details

C.2 GPT-4 prompts for computing summarization and dialogue win rates

C.3 Unlikelihood baseline

D Additional Empirical Results

D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments

D.3 Human study details

Mathematical Derivations

A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective

In this appendix, we will derive Eq. 4. Analogously to Eq. 3, we optimize the following objective:

where we have partition function:

Ukraine Is Riddled With Land Mines. Drones and AI Can Help

Early on a June morning in 2023, my colleagues and I drove down a bumpy dirt road north of Kyiv in Ukraine. The Ukrainian Armed

April 25, 2024

How to Add a SplashScreen in SwiftUI | HackerNoon

Day 5: First Impressions Matter! 🚀 In the fifth post of the #30DaysOfSwift series, we’ll focus on creating a Launch Screen for your app. This

October 10, 2024

LockBit’s latest attack shows why fintech needs more zero trust

LockBit claiming to have breached the U.S. Treasury and, instead, exfiltrated customer data from a bank is a common deception strategy.Read More

July 8, 2024

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

With expert analysis, comprehensive market coverage, and actionable insights, our newsletter equips you with the knowledge & tools necessary to make informed decisions & maximize your potential returns in the dynamic world of future tech stocks.

Deriving the Optimum of the KL-Constrained Reward Maximization Objective | HackerNoon

Table of Links

Mathematical Derivations

A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective

Ukraine Is Riddled With Land Mines. Drones and AI Can Help

How to Add a SplashScreen in SwiftUI | HackerNoon

LockBit’s latest attack shows why fintech needs more zero trust

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

Subscribe