Performance of Best of N Baseline for Various N and Sample Responses and GPT-4 Judgments | HackerNoon

Writings, Papers and Blogs on Text Models
August 26, 2024
9:30 pm

Authors:

(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier;

(2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier;

(3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier;

(4) Stefano Ermon, CZ Biohub;

(5) Christopher D. Manning, Stanford University;

(6) Chelsea Finn, Stanford University.

Table of Links

Abstract and 1. Introduction

2 Related Work

3 Preliminaries

4 Direct Preference Optimization

5 Theoretical Analysis of DPO

6 Experiments

7 Discussion, Acknowledgements, and References

Author Contributions

A Mathematical Derivations

A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective

A.2 Deriving the DPO Objective Under the Bradley-Terry Model

A.3 Deriving the DPO Objective Under the Plackett-Luce Model

A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2

A.6 Proof of Theorem 1

B DPO Implementation Details and Hyperparameters

C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details

C.2 GPT-4 prompts for computing summarization and dialogue win rates

C.3 Unlikelihood baseline

D Additional Empirical Results

D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments

D.3 Human study details

Additional Empirical Results

D.1 Performance of Best of N baseline for Various N

We find that the Best of N baseline is a strong (although computationally expensive, requiring sampling many times) baseline in our experiments. We include an evaluation of the Best of N baseline for various N for the Anthropic-HH dialogue and TL;DR summarization; the results are shown in Figure 4.

D.2 Sample Responses and GPT-4 Judgments

In this section, we present examples of comparisons between DPO and the baseline (PPO temp 0. for summarization, and the ground truth chosen response for dialogue). See Tables 4-6 for summarization examples, and Tables 7-10 for dialogue examples.

RBR50 Spotlight: Kodama Systems uses robotics to prevent wildfires

Listen to this article Organization: Kodama SystemsCountry: U.S.Website: https://kodama.ai/Year Founded: 2021Number of Employees: 1-10Innovation Class: Application & Market In 2022, wildfires burned over 7.5 million

October 14, 2024

MEXC Exchange Leads In Global Memecoin Listings With Over 240 Pairs | HackerNoon

SINGAPORE, Singapore, October 15th, 2024/Chainwire/–Popular Memecoins continue to play an influential role in the cryptocurrency market. For exchanges like MEXC, these assets not only increase

October 15, 2024

Exploring Hierarchical Blending in Target Encoding

When can code hierarchies improve target encoding for high-cardinality categorical features? Valerie Carey · Follow Published in Towards Data Science · 12 min read ·

April 18, 2024

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

With expert analysis, comprehensive market coverage, and actionable insights, our newsletter equips you with the knowledge & tools necessary to make informed decisions & maximize your potential returns in the dynamic world of future tech stocks.