The objective of this post is making a concise summary of IRL problems/algorithms and collecting a set of resources in this field. This post assumes basic knowledge of Markov Decision Process (MDP) and Reinforcement Learning (RL). I will keep updating this as I learn. Hope this post can help those who are also interested in IRL.
Table of Contents
- Feature-based Reward Function and Max Margin Methods
- Maximum Entropy Inverse Reinforcement Learning
- Continuous Inverse Optimal Control
- Papers (Chronological Order)
- Useful Links
- Modeling animal and human behavior
- Inverse Reinforcement Learning (IRL) is useful for apprenticeship learning
- Modeling of other agents, both adversarial and cooperative
Reinforcement Learning (RL)
RL optimizes the overall accumulative rewards of the agent given a reward function:
Where, is the policy to be learned, is the reward on state , and is the discount factor.
Inverse Reinforcement Learning (IRL)
However, the reward function is not always explicitly given. The goal of IRL is to learn a reward function that explains the expert behaviour.
Where, is the optimal expert policy, is the reward on state , and is the discount factor.
The problem is inferring the reward function given the dynamic model and the optimal policy . Or more generally, only given the expert trajectories.
IRL vs Supervised Behavioral Cloning
Behavioral cloning follows the standard supervised learning approach, which maps from states to actions. Behavioral cloning tries to learn an optimal policy . IRL, however, learns the reward function .
The reward function is often much more succinct than the optimal policy especially in planning oriented tasks.
- is a degenerate solution.
- Only have access to the expert traces rather than the expert policy .
- The expert is not always optimal
- Computationally changing – enumerating all policies
Feature-based Reward Function and Max Margin Methods
Let , where and is a feature.
Where, is the expected cumulative discounted sum of feature values or “feature expectations”.
This problem could be formulated as a max margin optimization problem:
Where, is the optimal expert policy. The computational problem led by large number of constraints could be solved by constraint generation
See also structured max margin with slack variables: Ratliff, Zinkevich and Bagnell, 2006 :
Maximum Entropy Inverse Reinforcement Learning
Subject to precisely stated prior data (such as a proposition that expresses testable information), the probability distribution which best represents the current state of knowledge is the one with largest entropy.
I will follow the notation in this paper to express feature expectation as
Where are trajectories, is the probability of trajectory , . Or if we use discounted factor , .
The reward for a single path is , where are the weights.
Under Max-Ent model, plans with equivalent rewards have equal probabilities, and plans with higher rewards are exponentially more preferred.
Deterministic Path Distributions
Non-Deterministic Path Distributions
Continuous Inverse Optimal Control
From Max-Ent IRL we can model the probability of an action sequence from trajectory as:
Approximate using second-order Taylor expansion on action at trajectory :
with gradient and Hessian:
Then approximate the probability:
The approximated log likelihood:
And finally use numerical optimization algorithms such as l-BFGS to find a local optima.
Papers (Chronological Order)
- Wulfmeier, Markus, Peter Ondruska, and Ingmar Posner. “Maximum entropy deep inverse reinforcement learning.” arXiv preprint arXiv:1507.04888 (2015).
- Levine, Sergey, and Vladlen Koltun. “Continuous inverse optimal control with locally optimal examples.” arXiv preprint arXiv:1206.4617 (2012).
- Ziebart, Brian D., et al. “Maximum Entropy Inverse Reinforcement Learning.” AAAI. 2008.
- Abbeel, Pieter, et al. “Apprenticeship learning for motion planning with application to parking lot navigation.” 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2008.
- Ratliff, Nathan D., J. Andrew Bagnell, and Martin A. Zinkevich. “Maximum margin planning.” Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
- Abbeel, Pieter, and Andrew Y. Ng. “Apprenticeship learning via inverse reinforcement learning.” Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.
- Ng, Andrew Y., and Stuart J. Russell. “Algorithms for inverse reinforcement learning.” Icml. 2000.