RLHF (Reinforcement Learning from Human Feedback) & Safety

Aligning LLMs with human values and ensuring safe outputs.

5 days

Topics in this Chapter

Introduction to agents, environments, actions, states, and rewards.

Training a model to predict human preferences.

Using RL to fine-tune the LLM to maximize the reward score.

Broader concepts of making AI helpful, honest, and harmless.