Using RL to fine-tune the LLM to maximize the reward score.
The final stage of RLHF is to use the trained reward model to fine-tune the language model itself. The LLM is now treated as a reinforcement learning 'agent' whose 'policy' is the probability distribution over the vocabulary it generates at each step. The goal is to update this policy so that it generates responses that receive a high score from the reward model. A popular and effective algorithm used for this task is Proximal Policy Optimization (PPO). The process works as follows: A prompt is sampled from the dataset. The current LLM (the policy) generates a response to this prompt. The reward model then evaluates this prompt-response pair and produces a reward score. This reward is used to update the weights of the LLM. A naive application of RL could cause the LLM to 'over-optimize' for the reward model, finding adversarial examples that get a high score but are nonsensical or repetitive. This is known as 'reward hacking.' To prevent this, PPO adds a constraint to the optimization process. It includes a penalty term, typically based on the Kullback-Leibler (KL) divergence, which measures the difference between the current policy's output distribution and the output distribution of the original, pre-RLHF model. This KL penalty ensures that the policy doesn't move too far away from the original, instruction-tuned model in a single update step. It keeps the LLM grounded in its strong language capabilities while gently steering it towards outputs that align better with human preferences, striking a balance between alignment and capability.