ORPO: Preference Optimization without the Supervised Fine-tuning (SFT) Step

A much cheaper alignment method performing as well as DPO

7 min read

20 hours ago

Generated with DALL-E

There are now many methods to align large language models (LLMs) with human preferences. Reinforcement learning with human feedback (RLHF) was one of the first and brought us ChatGPT, but RLHF is very costly. DPO, IPO, and KTO are notably…