Introduction

In the field of machine learning, there has been a great advancement in language models. We have been able to write computer program that can understand a body of text, and generate new content. Those content can sometimes be strikingly compelling however it doesn’t mean those content is reliably good and correct. As this field will continue to grow, there is a need for the text generated to be accurate and creative, for the code to be functional and logically true. Those requirements are almost impossible to be incorporated into a loss function. The main technique is still to train the model to predict the next word/token based on what have come before it. To nudge the generated text towards human preferences, specific metrics like BLEU (Bilingual Evaluation Understudy) or ROUGE (Recall-Oriented Understudy for Gisting Evaluation) are used. These metrics evaluate the generated text with reference text. However, it is challenging to try to capture the complexity of human language preferences.

This is where RL with human feedback (RLHF) comes in. There is an additional step where human gives feedback on the generated text. A new bigger RL can be trained to optimize the human feedback. RLHF leverages human knowledge and expertise in the learning process. An RL agent would be trained with feedback from a human, who could either be an expert or not. The feedback provided by the human could be used to guide the agent and it can be as simple as a reward or as complex as demonstration or correction. The process follows a few steps: first, an initial policy is trained with standard RL methods. Second, the policy is then shown to a human who would give additional feedbacks on the agent’s actions. This feedback could be binary comparison of trajectories, ratings on a numerical scale, or explicit corrections. The feedback is meant to improve the policy and can be incorporated as another RL process that maps state-action pairs to human ratings. After several iterations, the policy is expected to improve substantially. By incorporating human feedback, we can guide the agent to good policies more safely and efficiently. By directly optimizing the language model based on human feedback, RLHF aligns the training of the model more closely with complex human values. The approach holds significant potential to enhance the quality of generated text in line with human preferences and context-specific requirements. The advent of RLHF thus marks a significant stride towards improving the adaptability and performance of language models, opening up exciting avenues for future research and applications.

Steps

First we pretrain a language model, then we gather data to train a new reward model and then fine tune the language model with reinforcement learning. The pretrained models could be GPT, Gopher and others. And they can be additionally fine-tuned on to meet some preferred criteria, such as helpful, honest and harmless. In the second step, a system is built so that for each sequence of text, there is a scalar reward which represents the human preference. Third, the model is fine tuned with RL. The policy is the language model that takes in a prompt and generates a sequence of text (or probability distribution over a body of text). The action space is all the vocabulary of the language model (around 50000 tokens). The observation space is the distribution of possible input token sequences. The reward function is the preference and constraint on policy change. Given a prompt x, the text y is generated. y is concanated to x, then passed to the preference model, which returns the preferability \(r_{\theta}\). Then the probability distribution over tokens by the RL policy is compared with that of the initial model, so that we know the KL divergence between these distributions \(r_{KL}\). The KL divergence penalizes the RL policy from moving far away from the inital pretrained model. This ensures the output to be reasonably coherent. Without this comparison, the optimization can start to generate gibberish but still gets high reward. The final reward for the RL update is \(r = r_{\theta} - \lambda r_{KL}\). Additional terms can be added into the reward function to incentive different things. Finally, the update rule from PPO change parameters so that it maximizes the reward metrics for the current batch of data. PPO puts constraints on the gradient to make sure that updating doesn’t destabilize the learning process. A2C (actor - critic) method can also be used.

The challenge lies in the part of the human. Human annotaions can be costly and even experts disagree on ground truth.

TAMER

TAMER means Training an Agent Manually via Evaluative Reinforcement. It allows a human to train a learning agent to perform a common class of complex tasks by giving reward to agent’s action. The agent models the reward function and choose action to maximize that reward. TAMER is teste on Tetris, with human’s feedback, the agent learns to clear 50 lines by its third game, much faster than autonomous learning agents.

Sometimes vanilla reinforcement learning takes too long for practical purposes. Sometimes in high stake situation, suboptimal performance can lead to substantial financial loss. And in cases the human has some opinion or expertise, it would be wasteful to not utilize it but learn from scratch. TAMER is developed so that the human instruction doesn’t need to be complicated, only positive and negative reinforcement signals are needed to communicate to the agent. It only requires the person to observe the agent’s action, make a judgement on quality, then send a feedback signal that can easily be mapped into a numerical value. The human doesn’t need to give advice or demonstrate behavior to the agent.

The sequential decision making task is modeled as Markov decision process (MDP). To learn the reward function, supervised learning would be used. A typical MDP has S to be the set of possible states, A to be the set of actions, T is the transition function, which gives the probability of transitioning into another state given a state and an action. \(\delta\) is the discount factor to decrease the value of a future reward. D is the distribution of the start states. R is a reward function where the reward is a function of state \(s_t\) and \(s_{t+1}\). Traditionally the agent will learns autonomously via environmental interaction. But in TAMER, there would be a human trainer that gives feedback. The agent would model the human’s reward function and greedily choose the action that maximize the immediate reward. The agent maximizes the immediate reward, not and expected one since the feedback of the human is assumed to have long term implications into account. After learning the human’s reward function, the agent can do the task without human feedback. The human’s reward function is not consistent, it is a moving target. This simplication is problematic in the cases that there are several goals to be reached at once, or in cases of agents built to serve human, then the correct policy would be the one specific to the human’s preference and need.

There are other systems that incorporate human’s feedback for an RL learner. For example, the human can give advice in the form of code making it inaccessible to non technical users, some system gives advice in natural language but it still requires to design a domain specific natural language interface. The system of TAMER is relatively simple in that sense, where the human only needs to give a binary evaluation of the agent’s action. It helps in the cases that the human cannot articulate why the agent is performing well or poorly. Another approach is let the human give example of good behavior to the agent then the agent can copy or improvise on it. The human can give a demonstration by performing the task themselves or controlling a similar device. Doing this is cognitively costly. And in some cases, the human doesn’t need to be expert or well trained to judge the action (driving complicated robots).

TAMER algorithm consists of several smaller algorithms: RunAgent(), UpdateRewardModel(), and ChooseAction(). RunAgent() initializes time t, the weights w for the reward model, and feature vectors. It then chooses the first action and receives a full state description and reward signal to choose an action. The reward signal comes from the human. UpdateRewardModel() uses gradient descent to adjust the weights of the linear function approximator. The error is the difference between the projected reward and the given reward. The ChooseAction() evaluates the effects of each potential action and chooses the one it thinks to be the most valuable.

Screenshot 2023-06-21 at 15 47 17

Screenshot 2023-06-21 at 15 47 23

COACH

COACH means Convergent Actor-Critic by Humans, it is an algorithm to learn from human feedback. It is empirically shown that human feedback depends on the purpose and the current policy of the agent. So COACH takes that fact into account. The advantage function is a good model of human feedback, capturing properties such as diminishing returns, rewarding improvement, and others.

COACH models agent interacting with the environment as an MDP, for RL, the reward or transition function is not known, and the agent has to learn a policy. A common class of RL algorithms are actor-critic. The actor dictates how the agent chooses an action, the crtic estimates the value function at each time step to update the policy parameters. The critique is the temporal difference (TD) error \(\delta_t = r_t + \delta V(s_t) - V(s_{t-1})\).

Screenshot 2023-06-21 at 17 03 51

Others

There are other approach in this similar vein, where human gives preference between pairs of trajectory segments to the agent. Complex RL tasks can be solved with this approach, without access to reward function, while relying on limited feedback. This allows the agent to learn more complex behaviors. Experiments are carried out for tasks that don’t know reward, only a feedback of which is better trajectory. Tasks are simulated robotics tasks and Atari games. The performance matches tradditional RL and the algorithm learns novel complex behaviors. The technique can be scaled up to large reinforcement learning systems. And it feels promising that RL systems can be applied to serve human complex value system. A deep version of TAMER is also developed to handle high dimensional state space in which a deep neural network function is used to approximate the human reward function. Training model with human preference is also extended into natural language tasks such as summarizing articles. Given a vocabulary \(\Sigma\), we have a language model \(\rho\) which defines a probability distribution over sequences of tokens. For example, we could have an article of 1000 words and we need a 100 word summary, then we can fix the beginning and then generate subsequent tokens with \(\rho\). A policy \(\pi = \rho\) is initialized and then would be fine tuned to perform the task with RL (PPO). RL can directly otpimize the expected reward or we will use human labels to train a reward model then optimize that reward model. Pretrained model and KL regularization can be applied to prevent the policy from diverging far from natural language.

Screenshot 2023-06-21 at 17 40 30

Other tasks that have been studied extensively include book summarizing. In this technique, the algorithm summarizes small parts of the book then recursively summarizes those summaries to achieve a full summary of the entire book. With human feedback, the result reaches the quality of human-written summary around 5% of the time. Some versions of GPT have been trained with human feedback to improve quality, such as WebGPT, InstructGPT. From Google we have GopherCite, that can answer open ended questions with high quality supporting evidence and refrain from answering when unsure. The aim of training large language models is to provide helpful and harmless assistant to human.

Conclusion

In summary, RLHF represents a significant advancement in the field of artificial intelligence, particularly in training language models. By incorporating human feedback into the training process, we can guide the learning of models more directly and with an increased likelihood of satisfying complex human needs and preferences. As opposed to traditional methods that use static metrics or simplistic loss functions, RLHF captures the dynamic and multifaceted aspects of what makes generated content useful, meaningful, and high quality.

Despite its promise, the implementation of RLHF is not without its challenges, such as the logistical issues in collecting human feedback, the potential for bias, and the complexity of transforming qualitative human judgments into quantitative data. However, the potential benefits of this approach underscore the importance of overcoming these hurdles.

As we move forward, the continued exploration of RLHF and similar methods will play a crucial role in pushing the boundaries of what language models and other AI systems can achieve, bringing us closer to machines that can understand and respond to our needs in ways that feel more human. The potential for RLHF to create more effective, intuitive, and user-aligned AI is immense, and further exploration in this space is an exciting prospect.