Notes on Trust Region Policy Optimization

Introduction

Reinforcement learning has caught some attention in both academia and industry recently. Trust Region Policy Optimization (TRPO) is one of the notable fancy RL algorithms, developed by Schulman et al, that has nice theoretical monotonic improvement guarantee. The goal of this post is to give a brief and intuitive summary of the TRPO algorithm. This post assumes that readers have a basic understanding of reinforcement learning algorithms and policy gradient methods.

Ideas

The two main ideas behind TRPO, from my personal view, are MM algorithms and the Trust Region.

MM algorithms

The main idea of MM (minorization-maximization) algorithms is that, intuitively, for a maximization problem, we first find a approximated lower bound of the original objective as the surrogate objective and then maximize the approximated lower bound so as to optimize the original objective. (And vice versa for minimization problems) Widely known Expectation-Maximization (EM) algorithm is a subclass of MM algorithms.

In TRPO, Schulman et al developed a surrogate loss based on Kakade et al 2001 and Kakade & Langford 2002. The surrogate loss in TRPO is a lower bound of the original objective – the expected cumulative return of the policy.

Trust Region Methods

As described in Nocedal & Wright’s Numerical Optimization, “Trust-region methods define a region around the current iterative within which they trust the model to be an adequate representation of the objective function, and then choose the step to be the approximate minimizer of the model in this region”. Intuitively, during our optimization procedure, after we decided the gradient direction, when doing line search we want to constrain our step length to be within a “trust region” so that the local estimation of the gradient/curvature remains to be “trusted”.

In TRPO, Schulman et al used KL divergence between the old policy and updated policy as a measurement for trust region.

TRPO

After introducing the two main ideas, let’s dive into the algorithm itself. More math coming, keep on your seat belt!

Notations

An MDP is a tuple $(S, A, \{P_{sa}\}, \gamma, R, \rho_0)$

$S$ is a finite set of $N$ states.
$A = \{a_1, .., a_k\}$ is a set of $k$ actions.
$P_{sa}(s')$ is the state transition probability of landing at state $s'$ : $P(s, a, s')$ upon taking the action $a$ at state $s$ .
$\gamma \in [0, 1)$ is the discount factor.
$R: S \rightarrow \mathbf{R}$ is the reward function.
$\rho_0: S \rightarrow \mathbf{R}$ is the state distribution of the initial state $s_0$
$\rho_{\pi}: S \rightarrow \mathbf{R}$ is the discounted visitation frequences,

$\rho_\pi (s) = Pr[s_0 = s] + \gamma Pr[s_1 = s] + \gamma^2 Pr[s_2 = s] + ...$

$\eta (\pi) = \mathbf{E}_{s_0, a_0, ..} [\sum_{t=0}^\infty \gamma^t r (s_t)]$ is the expected discounted cumulative reward of policy $\pi$ . Where

$s_0 \sim \rho_0(s_0), a_t \sim \pi(a_t | s_t), s_{t+1} \sim P(s_{t+1}| s_t, a_t)$

$Q_{\pi} (s_t, a_t) = \mathbf{E}_{s_{t+1}, a_{t+1}, ...}[\sum_{l=0}^\infty \gamma ^l r(s_{t+l})]$ is the action-value function
$V_{\pi} (s_t) = \mathbf{E}_{a_t, s_{t+1}, ...} [\sum_{l=0}^\infty \gamma^l r(s_{t+l})]$ is the value function
$A_{\pi}(s,a) = Q_{\pi}(s, a) - V_{\pi}(s)$ is the advantage function. Where

$a_t \sim \pi (a_t | s_t), s_{t+1}\sim P(s_{t+1} | s_t, a_t), \text{ for } t \geq 0$

Derivations

Hooo .., the notations seem long but if you look into them, you would be able to recognize that they are quite basic in RL. There comes the main part of the algorithm, time to pay attention!

Here is the important identity proved by Kakade & Langford 2002:

$\begin{align} \eta(\pi) =& \eta(\pi_{0}) + \mathbf{E}_{s_0, a_0, ... \sim \pi}[\sum_{t=0}^{\infty} A_{\pi_0} (s_t, a_t)] \\ =& \eta(\pi_{0}) + \sum_{t=0}^\infty \sum_{s} P(s_t = s | \pi) \sum_a \pi (a|s) \gamma ^t A_{\pi_0}(s,a) \\ =& \eta(\pi_{0}) + \sum_{s} \sum_{t=0}^\infty P(s_t = s | \pi) \sum_a \pi (a|s) \gamma ^t A_{\pi_0}(s,a) \\ =& \eta(\pi_{0}) + \sum_{s} \rho_{\pi}(s) \sum_a \pi (a|s) A_{\pi_0}(s,a) \\ \eta(\pi) =& \eta(\pi_{0}) + \mathbf{E}_{\rho_{\pi}} \mathbf{E}_{a\sim \pi(s)}[A_{\pi_0}(s,a)] \end{align}$

Where $\pi_0$ is the old policy and $\pi$ is the new policy. Note that we have the current policy $\pi_0$ but we don’t have $\pi$ yet, therefore, $\rho_{\pi}$ is hard to obtain. Instead, Schulman et al used $\rho_{\pi_0}$ as an approximation to $\rho_{\pi}$ :

$\eta(\pi) \approx \eta(\pi_{0}) + \mathbf{E}_{\rho_{\pi_0}} \mathbf{E}_{a\sim \pi(s)}[A_{\pi_0}(s,a)]$

We then define the following as the objective function,

$L_{\pi_0}(\pi) = \eta(\pi_{0}) + \mathbf{E}_{\rho_{\pi_0}} \mathbf{E}_{a\sim \pi(s)}[A_{\pi_0}(s,a)]$

Now is the time when the MM algorithm and trust region come in. Let $\pi' = argmax_{\pi'} L_{\pi_0} (\pi')$ . If we define the new policy as the following mixture:

$\pi(s) = (1-\alpha) \pi_0(s) + \alpha \pi'(s)$

Kakade & Langford 2002 proved that,

$\eta(\pi) \geq L_{\pi_0}(\pi) - \frac{2 \epsilon \gamma}{(1-\gamma(1-\alpha)) (1-\gamma)}\alpha^2$

Where,

$\epsilon = max_{s} |\mathbf{E}_{a\sim \pi' (s) [A_{\pi_0} (s,a)]}|$

With this bound (r.h.s. of the inequality), we can constraint the update to be within some trust region.

Based on this bound, Schulman et al proved the following simpler bound involving KL-divergence between the new policy and the old policy:

$\eta(\pi) \geq L_{\pi_0}(\pi) - C max_{s} D_{KL}(\pi_0(s) \| \pi(s))$

Where $C = \frac{2\epsilon \gamma}{(1-\gamma)^2}$

Unfortunately, computing the maximum-KL divergence term over the whole state space is intractable. Schulman et al proposed to use mean-KL divergence over state space as an approximation so that we can estimation it by

$\bar {D}_{KL} (\pi_0 \| \pi) = \mathbf{E}_{s \sim \rho_{\pi_0}}[D_{KL}(\pi_0(s) \| \pi(s))]$

Then, we’ve arrived at the TRPO optimization problem

$\text{maximize}_{\theta} [L_{\theta_0}(\theta) - C \bar{D}_{KL}(\pi_0 \| \pi)]$

In Practice

Finally, in practice, Schulman suggests that we can choose one of the following variants of the algorithm:

directly use first order optimization methods to optimize the objective. (Known as Proximate Policy Optimization )
At each iteration, approximate the objective by first order approximation to $L$ and second order approximation to $\bar{D}_{KL}(\pi_0 \| \pi)$ and then use second order methods like conjugate gradient to approximate the gradient direction $F^{-1}g$ , where, $F$ is the second order derivative of the KL-divergence or known as the Fisher Information Matrix (FIM).
Place hard constraint on the KL-divergence (trust region). We can still use conjugate gradient to solve the following formulation

$\begin{align} & \text{maximize}_{\theta}& L_{\theta_0} (\theta) \\ & \text{subject to}& \bar{D}_{KL} (\pi_{\theta_0} \| \pi_\theta) \leq \delta \end{align}$

In terms of conjugate gradient, here are two introductory articles about second order methods. For an implementation of the conjugate gradient, l-bfgs, etc, please see my github repo.