Introduction

There are several different approaches to use neural network function approximators in reinforcement learning. For example deep Q-learning, policy gradient methods, trust region gradient methods.. Proximal Policy Optimization (PPO) belongs to the class of policy gradient methods. PPO is developed by OpenAI, is effective and efficient, and is popular in the machine learning community.

Let’s recap a bit. Policy gradient methods aim to optimize the policy directly using the guidance from the gradient of the expected return. There are issues of those methods such as high variance, slow convergence, that PPO wants to address by using a simple idea: instead of making large update to the policy, make small update from the current policy (proximal update). Specifically, PPO uses a clipping function that introduces a lower and upper bound to the change that can be made to the policy in each update. The objective function incorporates this clip so that if the new policy deviates too much from the old one, there will be a penalty (a loss). By doing this, the difference between the new and old policy would be minimized to a specific range.

Policy gradient methods normally compute a policy gradient estimator then use a stochastic gradient ascent algorithm to climb.

\[\hat{g} = \hat{E}_t {[\nabla_{\theta} log \pi_{\theta} (a_t \mid s_t) \hat{A}_t ]}\]

with \(\pi_{\theta}\) being a stochastic policy, \(\hat{A}_t\) is an estimator of the advantage function at timestep t. \(\hat{E}_t\) is the empirical average over a finite batch of samples. The loss function is:

\[L_{PG}(\theta) = \hat{E}_t {[log \pi_{\theta}(a_t \mid s_t) \hat{A}_t ]}\]

Ascend the gradient of this loss will make the agent to take actions leading to higher rewards. The problem with this loss function is that it leads to large policy updates. And empirically, smaller policy updates are more likely to lead to convergence. Too big step might lead to “falling off a cliff” and not recover from there.

In trust region methods, there is a constraint on the size of the policy update.

\(maximize_{\theta} \hat{E}_t {[ \frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\theta_{old}}(a_t \mid s_t)} \hat{A}_t ]}\) subject to \(\hat{E}_t {[KL{[\pi_{\theta_{old}}(. \mid s_t), \pi_{\theta}(. \mid s_t)]}]} \leq \delta\)

We can also solve an unconstrained optimization problem, that gives a penalty instead of constraints:

\[maximize_{\theta} \hat{E}_t {[ \frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\theta_{old}}(a_t \mid s_t)} \hat{A}_t - \beta KL {[\pi_{\theta_{old}}(. \mid s_t), \pi_{\theta}(. \mid s_t)]} ]}\]

Let \(r_t(\theta) = \frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\theta_{old}}(a_t \mid s_t)}\) so \(r(\theta_{old}) = 1\). The objective of TRPO then becomes \(L_{CPI}(\theta) = \hat{E} {[ \frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\theta_{old}} (a_t \mid s_t)} \hat{A}_t ]} = \hat{E}_t {[ r_t(\theta) \hat{A}_t ]}\)

CPI being conservative policy iteration. A new objective function (for PPO) is suggested, to penalize changes to the policy that move \(r_t(\theta)\) away from 1.

The Clip function

\[L_{CLIP}(\theta) = E_t {[min(r_t(\theta)) A_t, clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon) A_t)]}\]

with \(\theta\) being the policy parameter, \(E_t\) denotes the empricial expectation. \(r_t\) is the ratio of the probability under the new and old policies. \(A_t\) is the estimated advantage at time t, \(\epsilon\) is a hyperparameter, usually 0.1 or 0.2.

We can see that, inside the \(L_{CLIP}\), the first term inside the min is the \(L_{CPI}\). The second term, add a clip to the probability ratio, effectively removing the incentive for letting \(r_t\) going outside the interval of \({[ 1 - \epsilon, 1 + \epsilon ]}\). Outside the range, the gradient is zero.

Screenshot 2023-06-20 at 13 27 23

One way to implement policy gradient with neural network, is to run the policy for T timesteps, then use the samples to update. The advantage of the policy at timestep t (the reward minus the average) is \(\hat{A}_t = r_t + \delta r_{t+1} + ... + \delta^{T-t+1} r_{T-1} + \delta^{T-t} V(s_T) - V(s_t)\)

A PPO could be done for N actors, collecting T timesteps for each, then the loss is calculated and optimized with minibatch SGD (or Adam) for K epochs as follows:

for iteration = 1,2.. do

  • for actor = 1,2..N do
    • Run policy \(\pi_{\theta_{old}}\) in environment for T timesteps
    • Compute advantage estimates \(\hat{A}_1, ..\hat{A}_T\)
  • end for
  • Optimize L with respect to \(\theta\), in K epochs
  • \(\theta_{old} \leftarrow \theta\) end for

Experiments are then run for different versions of loss functions: without clipping or penalty \(L_t(\theta) = r_t(\theta) \hat{A}_t\), with clipping \(L_t(\theta) = min(r_t(\theta) \hat{A}_t, clip(r_t(\theta)), 1-\epsilon, 1+\epsilon)\hat{A}_t\) with KL penalty \(L_t(\theta) = r_t(\theta) \hat{A}_t - \beta KL {[\pi_{\theta_{old}}, \pi_{\theta}]}\).

In some tasks, PPO almost matches ACER (actor critic with experience replay), a far more complex method. It also has some benefits of TRPO (trust region policy optimization) but much simpler to implement.

In conclusion, PPO offers a balance between sample complexity, ease of implementation, computational cost, and performance which makes it a popular choice for many reinforcement learning tasks. Like all reinforcement learning algorithms, it’s not a silver bullet and may not work well for all types of problems, but it has been demonstrated to be very effective in a wide range of applications.

Model-based RL

So far we have seen model free RL, a method of RL that agent learns to make decisions purely from its experiences, without relying on an explicit model of the environment’s dynamics. Instead, it learns either a value function or a policy directly from the experience samples.

There are two main types of model-free reinforcement learning methods that we already explored: Value-Based Methods, such as Q-learning or Deep Q-Networks (DQN), involve learning a value function, which estimates how good a particular state or action is in terms of expected future rewards. The policy is then derived from the value function, usually by choosing the action that maximizes the value in each state. Policy-Based Methods, such as REINFORCE or Proximal Policy Optimization (PPO), directly parameterize and optimize the policy without using a value function as an intermediary. The policy is a mapping directly from states to actions or distribution over actions.

Model-free methods are typically more straightforward to implement however they often require many more interactions with the environment to learn an effective policy. This is because they cannot use a model to “think ahead” and must instead learn from actual experiences.

Now let’s consider model-based RL. Model-based RL is a type of RL in which an agent learns a model of the environment’s dynamics. This model is used by the agent to make decisions about what action to take. The model is considered a structured learning tool through that the agent sees the world. The model can predict the future, and the agent collects data to improve the model so to consequently improve future actions. In other words, given a prediction model, the agent can predict the next state and reward based on the current state of the environment and an action in that state, since the model captures the dynamics of the environment. This is like planning by simulating actions using the model to evaluate the outcomes.

Here’s how a model-based reinforcement learning process might work: the first phase is the model learning phase, the agent interacts with the environment and gets back data about states, actions and rewards. This data is used to update the model about the world.

Technically speaking, at time t, we denote the state \(s_t\) and action \(a_t\), with reward \(r(s_t, a_t)\). The agent is in a world that has state transition distribution \(p(s_{t+1} \mid s_t, a_t)\). The parametric model \(f_{\theta}\) approximates this distribution with \(p_{\theta}(s_{t+1} \mid s_t, a_t)\). The agent then can leverage the model to act, it collects a dataset of states, rewards, actions, next states \(D = \{(s_n, a_n, s'_n)\}\). With this dataset, the agent learns the environment using a neural network to approximate \(f_{\theta}\).

An example is to use a neural network as a model. This neural network can learn to predict the next state and reward given the current affair. The second phase is the planning phase, the agent simulates different action trajectories from the model. This is so called a plan, starting from the current state. It is like a search when we look ahead into possibilities.

Screenshot 2023-06-20 at 17 03 00

There are two types of looking-ahead. One is one step dynamics model in which the agent looks ahead one step. The other is the recurrent dynamics model where the agent looks ahead several steps. In the one step dynamics model, given the state \(s_t\), and the action \(a_t\), the next state is \(s_{t+1} = f_{\theta}(s_t, a_t)\). To predict a trajectory, the dynamic model is recursively applied into the future. However, this leads to an issue of error compounding. For each state, there is an error \(\epsilon_t = \mid\mid \hat{s}_t - s_t \mid\mid\). This error would be multiplied during the recursion. This renders long term prediction inaccurate. In the recurrent dynamics model, the model predicts the next state \(s_{t+1}\) from the current hidden state of the neural network. The n step ahead would be \(s_{t+n} = f_{\theta} (s_t, n, \theta_{\pi})\) with the control parameters \(\theta_{\pi}\). This long term prediction ability is data efficient, computationally efficient for planning phase, and some other desirable properties such as it can capture uncertainty of trajectories correctly, and can be used for both discrete and continuous time.

Then the agent uses different algorithms to search for the optimal action sequence, such as a Monte Carlo tree search, or a simple one like a random shooting. The third phase is to select action. The agent would select the first action from the plan that maximize expected future reward. There are also dynamic programming techniques such as value iteration or policy iteration to compute the optimal value function or policy. The last phase is to use the outcomes of those actions to improve the overall policy. If an action results in a better than expected outcome, the propensity to choose that action will increase for the future.

Conclusion

In this post, we have finished looking at a model free method (PPO) and a brief overview of model based methods in RL. We can see that in model based methods, the planning and thinking process is quite logical. It is also efficient since the model is used to think ahead and evaluate the consequences of different actions before doing them. This process often requires less interactions with the environment compared to model free methods. One can imagine the cases for model based RL where interacting with the environment is costly. On the other hand, if the theoretical model is inaccurate or incomplete, the decision might be suboptimal or even harmful. Handling these are tricky.

In conclusion, model based RL is a potent approach in the field of RL, allowing agents to construct internal models of the world to make more informed decisions. It thinks through the possibilities before making decisions and it needs the model to be reasonably accurate. Planning in a complex environment with large state-action space would also make planning computationally demanding. Despite all of this, model based RL has applications in robotics, autonomous vehicles, game play and others.