Inverse Reinforcement Learning

My notes of inverse reinforcement learning with a collection of resources.

January 15, 2017 - 5 minute read -

The objective of this post is making a concise summary of IRL problems/algorithms and collecting a set of resources in this field. This post assumes basic knowledge of Markov Decision Process (MDP) and Reinforcement Learning (RL). I will keep updating this as I learn. Hope this post can help those who are also interested in IRL.

Intro

Good introductory slides: Pieter Abbeel IRL slides (UCB cs287 Advanced Robotics)

Motivation

• Modeling animal and human behavior
• Inverse Reinforcement Learning (IRL) is useful for apprenticeship learning
• Modeling of other agents, both adversarial and cooperative

Reinforcement Learning (RL)

RL optimizes the overall accumulative rewards of the agent given a reward function:

Where, $\pi$ is the policy to be learned, $R(s_t)$ is the reward on state $s_t$, and $\gamma$ is the discount factor.

Inverse Reinforcement Learning (IRL)

However, the reward function is not always explicitly given. The goal of IRL is to learn a reward function $R^*$ that explains the expert behaviour.

Where, $\pi^*$ is the optimal expert policy, $R^*(s_t)$ is the reward on state $s_t$, and $\gamma$ is the discount factor.

The problem is inferring the reward function $R^*$ given the dynamic model and the optimal policy $\pi^*$. Or more generally, only given the expert trajectories.

IRL vs Supervised Behavioral Cloning

Behavioral cloning follows the standard supervised learning approach, which maps from states to actions. Behavioral cloning tries to learn an optimal policy $\pi^*$. IRL, however, learns the reward function $R^*$.

The reward function $R^*$ is often much more succinct than the optimal policy especially in planning oriented tasks.

IRL Challenges

• $R=0$ is a degenerate solution.
• Only have access to the expert traces rather than the expert policy $\pi^*$.
• The expert is not always optimal
• Computationally changing – enumerating all policies

Feature-based Reward Function and Max Margin Methods

Abbeel, and Ng. 2004, “Apprenticeship learning via inverse reinforcement learning.”

Let $R(s) = w^T \phi(s)$, where $w \in \mathbf{R}^n$ and $\phi: S \rightarrow \mathbf{R}^n$ is a feature.

Where, $\mu(\pi)$ is the expected cumulative discounted sum of feature values or “feature expectations”.

This problem could be formulated as a max margin optimization problem:

Where, $\pi^*$ is the optimal expert policy. The computational problem led by large number of constraints could be solved by constraint generation

See also structured max margin with slack variables: Ratliff, Zinkevich and Bagnell, 2006 :

Maximum Entropy Inverse Reinforcement Learning

Max entropy IRL helps solve the problem of suboptimality of expert trajectories by employing “the principle of maximum entropy (Jaynes 1957)”:

Subject to precisely stated prior data (such as a proposition that expresses testable information), the probability distribution which best represents the current state of knowledge is the one with largest entropy.

I will follow the notation in this paper to express feature expectation as

Where $\{\zeta_i\}$ are trajectories, $P(\zeta_i)$ is the probability of trajectory $\zeta_i$, $f_{\zeta} = \sum_{s_j \in \zeta} f_{s_j}$. Or if we use discounted factor $\gamma$, $f_{\zeta} = \sum_{s_j \in \zeta} \gamma^j f_{s_j}$.

The reward for a single path is $\theta^T f_{\zeta}$, where $\theta$ are the weights.

Under Max-Ent model, plans with equivalent rewards have equal probabilities, and plans with higher rewards are exponentially more preferred.

Continuous Inverse Optimal Control

Levine et al. 2012, CIOC paper

From Max-Ent IRL we can model the probability of an action sequence $\vec{a_{\zeta}} = [a_1, ..., a_H]$ from trajectory $\zeta = [(s_1, a_1), .., (s_H, a_H)]$ as:

Approximate $R(\vec{s}, \vec{a}) = \sum_t R(s_t, a_t)$ using second-order Taylor expansion on action at trajectory $\zeta$:

Then approximate the probability:

The approximated log likelihood:

And finally use numerical optimization algorithms such as l-BFGS to find a local optima.

Papers (Chronological Order)

• Wulfmeier, Markus, Peter Ondruska, and Ingmar Posner. “Maximum entropy deep inverse reinforcement learning.” arXiv preprint arXiv:1507.04888 (2015).
• Levine, Sergey, and Vladlen Koltun. “Continuous inverse optimal control with locally optimal examples.” arXiv preprint arXiv:1206.4617 (2012).
• Ziebart, Brian D., et al. “Maximum Entropy Inverse Reinforcement Learning.” AAAI. 2008.
• Abbeel, Pieter, et al. “Apprenticeship learning for motion planning with application to parking lot navigation.” 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2008.
• Ratliff, Nathan D., J. Andrew Bagnell, and Martin A. Zinkevich. “Maximum margin planning.” Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
• Abbeel, Pieter, and Andrew Y. Ng. “Apprenticeship learning via inverse reinforcement learning.” Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.
• Ng, Andrew Y., and Stuart J. Russell. “Algorithms for inverse reinforcement learning.” Icml. 2000.
• Pieter Abbeel IRL slides (UCB cs287 Advanced Robotics)
• CMU 15889e Real Life Reinforcement Learning
• RL Course by David Silver on Youtube
• UCB CS294: Deep Reinforcement Learning