Back to RLHF (Reinforcement Learning from Human Feedback) & Safety

Reward Modeling

Training a model to predict human preferences.

Key Notes

Reward Modeling is the second and arguably most critical step in the Reinforcement Learning from Human Feedback (RLHF) pipeline. The goal of RLHF is to align an LLM with human preferences, but reinforcement learning requires a reward function that can provide immediate feedback for any action the LLM takes. Since it's impossible to have humans rate every single response the LLM generates during RL training, we instead use the collected human preference data to train a proxy: the reward model (RM). A reward model is typically another language model, often initialized from the same pre-trained model being aligned, but with its final layer replaced by a linear layer that outputs a single scalar value. Its task is to learn to predict the human preference score for a given prompt-response pair. The training data for the RM consists of comparisons. For a given prompt, we have two or more responses that have been ranked by human labelers (e.g., Response A is better than Response B). The RM is trained on a ranking loss function. It is shown both Response A and Response B, and the training objective is to make the RM output a higher score for A than for B. By training on millions of these human-ranked comparisons, the reward model learns to internalize the complex, nuanced, and often subjective criteria that humans use to judge language, including helpfulness, factual accuracy, harmlessness, and style. Once trained, this reward model can provide an automated, scalable reward signal for any response the LLM generates, enabling the final reinforcement learning stage.

Back to RLHF (Reinforcement Learning from Human Feedback) & Safety

Reward Modeling

Training a model to predict human preferences.

Key Notes