Use the loss function of the Policy Gradient algorithm as key to understand various reinforcement learning algorithms: REINFORCE, Actor-Critic, and PPO, which are theoretical preparations to understand the Reinforcement Learning from Human Feedback (RLHF) algorithm used to build ChatGPT.
Studying reinforcement learning can be frustrating because the field is cursed with confusing jargon and algorithms with subtle differences.
I struggled, until one day my great colleague Peter Vrancs swiftly wrote down the derivation of the loss function for the Policy Gradient algorithm REINFORCE for me. Using this derivation, this article links the following algorithms together:
- REINFORCE
- The concept of advantage for variance reduction, and the Actor-Critic algorithm
- Proximal Policy Optimisation (PPO)
Even if there are many articles covering these algorithms, this article provides a unique angle of studying them in one go to save you learning time!
In my opinion, understanding these three algorithms is the theoretical bare…