Back to RLHF (Reinforcement Learning from Human Feedback) & Safety

Reinforcement Learning Basics

Introduction to agents, environments, actions, states, and rewards.

Key Notes

Reinforcement Learning (RL) is a paradigm of machine learning concerned with how an intelligent 'agent' ought to take 'actions' in an 'environment' to maximize a cumulative 'reward.' It's a framework for learning from interaction to achieve a goal. The core components of an RL problem are: the Agent, which is the learner or decision-maker (in our case, the LLM). The Environment is the world the agent interacts with (in RLHF, this is effectively the space of all possible text responses). A State (s) is a snapshot of the environment at a particular time. An Action (a) is a move the agent can make. In the context of an LLM, an action is the generation of a token or a full response. A Reward (r) is a feedback signal from the environment. It's a scalar value that tells the agent how good or bad its last action was. The agent's goal is to learn a 'policy' (π), which is a strategy or mapping from states to actions. The policy dictates what action the agent should take in any given state. The objective is to find an optimal policy that maximizes the total expected future reward. A key concept in RL is the trade-off between exploration and exploitation. The agent must exploit what it already knows to get rewards, but it also has to explore new actions to discover better strategies for the future. RL algorithms, like Q-learning or PPO, provide mathematical frameworks for an agent to learn this optimal policy through trial and error, by interacting with its environment and observing the rewards it receives.

Back to RLHF (Reinforcement Learning from Human Feedback) & Safety

Reinforcement Learning Basics

Introduction to agents, environments, actions, states, and rewards.

Key Notes