The objective of this post is making a concise summary of IRL problems/algorithms and collecting a set of resources in this field. This post assumes basic knowledge of Markov Decision Process (MDP) and Reinforcement Learning (RL). I will keep updating this as I learn. Hope this post can help those who are also interested in IRL.

## Table of Contents

- Intro
- Feature-based Reward Function and Max Margin Methods
- Maximum Entropy Inverse Reinforcement Learning
- Continuous Inverse Optimal Control
- Papers (Chronological Order)
- Useful Links

## Intro

Good introductory slides: Pieter Abbeel IRL slides (UCB cs287 Advanced Robotics)

#### Motivation

- Modeling animal and human behavior
- Inverse Reinforcement Learning (IRL) is useful for apprenticeship learning
- Modeling of other agents, both adversarial and cooperative

#### Reinforcement Learning (RL)

RL optimizes the overall accumulative rewards of the agent given a reward function:

Where, is the policy to be learned, is the reward on state , and is the discount factor.

#### Inverse Reinforcement Learning (IRL)

However, the reward function is not always explicitly given. The goal of IRL is to learn a reward function that explains the expert behaviour.

Where, is the optimal expert policy, is the reward on state , and is the discount factor.

The problem is inferring the reward function given the dynamic model and the optimal policy . Or more generally, only given the expert trajectories.

#### IRL vs Supervised Behavioral Cloning

Behavioral cloning follows the standard supervised learning approach, which maps from states to actions. Behavioral cloning tries to learn an optimal policy . IRL, however, learns the reward function .

The reward function is often much more succinct than the optimal policy especially in planning oriented tasks.

#### IRL Challenges

- is a degenerate solution.
- Only have access to the expert traces rather than the expert policy .
- The expert is not always optimal
- Computationally changing – enumerating all policies

## Feature-based Reward Function and Max Margin Methods

Abbeel, and Ng. 2004, “Apprenticeship learning via inverse reinforcement learning.”

Let , where and is a feature.

Where, is the expected cumulative discounted sum of feature values or “feature expectations”.

This problem could be formulated as a max margin optimization problem:

Where, is the optimal expert policy. The computational problem led by large number of constraints could be solved by **constraint generation**

See also structured max margin with slack variables: Ratliff, Zinkevich and Bagnell, 2006 :

## Maximum Entropy Inverse Reinforcement Learning

Max entropy IRL helps solve the problem of suboptimality of expert trajectories by employing “the principle of maximum entropy (Jaynes 1957)”:

Subject to precisely stated prior data (such as a proposition that expresses testable information), the probability distribution which best represents the current state of knowledge is the one with largest entropy.

I will follow the notation in this paper to express feature expectation as

Where are trajectories, is the probability of trajectory , . Or if we use discounted factor , .

The reward for a single path is , where are the weights.

Under Max-Ent model, plans with equivalent rewards have equal probabilities, and plans with higher rewards are exponentially more preferred.

#### Deterministic Path Distributions

#### Non-Deterministic Path Distributions

## Continuous Inverse Optimal Control

Levine et al. 2012, CIOC paper

From Max-Ent IRL we can model the probability of an action sequence from trajectory as:

Approximate using second-order Taylor expansion on action at trajectory :

with gradient and Hessian:

Then approximate the probability:

The approximated log likelihood:

And finally use numerical optimization algorithms such as l-BFGS to find a local optima.

## Papers (Chronological Order)

- Wulfmeier, Markus, Peter Ondruska, and Ingmar Posner. “Maximum entropy deep inverse reinforcement learning.” arXiv preprint arXiv:1507.04888 (2015).
- Levine, Sergey, and Vladlen Koltun. “Continuous inverse optimal control with locally optimal examples.” arXiv preprint arXiv:1206.4617 (2012).
- Ziebart, Brian D., et al. “Maximum Entropy Inverse Reinforcement Learning.” AAAI. 2008.
- Abbeel, Pieter, et al. “Apprenticeship learning for motion planning with application to parking lot navigation.” 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2008.
- Ratliff, Nathan D., J. Andrew Bagnell, and Martin A. Zinkevich. “Maximum margin planning.” Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
- Abbeel, Pieter, and Andrew Y. Ng. “Apprenticeship learning via inverse reinforcement learning.” Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.
- Ng, Andrew Y., and Stuart J. Russell. “Algorithms for inverse reinforcement learning.” Icml. 2000.