toyCarIRL

Reinforcement learning (RL) is is the very basic and most intuitive form of trial and error learning, it is the way by which most of the living organisms with some form of thinking capabilities learn. Often referred to as learning by exploration, it is the way by which a new born human baby learns to take its first steps, that is by taking random actions initially and then slowly figuring out the actions which lead to the forward walking motion.

Note, this post assumes a good understanding of the Reinforcement learning framework, please make yourself familiar with RL through week 5 and 6 of this awesome online course AI_Berkeley.

Now the question that I kept asking myself is, what is the driving force for this kind of learning, what forces the agent to learn a particular behavior in the way it is doing it. Upon learning more about RL I came across the idea of rewards, basically the agent tries to choose its actions in such a way that the rewards that is gets from that particular behavior are maximized. Now to make the agent perform different behaviors, it is the reward structure that one must modify/exploit. But assume we only have the knowledge of the behavior of the expert with us, then how do we estimate the reward structure given a particular behavior in the environment? Well, this is the very problem of Inverse Reinforcement Learning (IRL), where given the optimal expert policy (actually assumed to be optimal), we wish to determine the underlying reward structure.

The reinforcement learning framework.
The Inverse reinforcement learning framework.

Again, this is not an Intro to Inverse Reinforcement Learning post, rather it is a tutorial on how to use/code Inverse reinforcement learning framework for your own problem, but IRL lies at the very core of it, and it is quintessential to know about it first. IRL has been extensively studied in the past and algorithms have been developed for it, please go through the papers Ng and Russell,2000, and Abbeel and Ng, 2004 for more information.

This posts adapts the algorithm from Abbeel and Ng, 2004 for solving the IRL problem.

Problem to be solved

The idea here is to program a simple agent in a 2D world full of obstacles to copy/clone different behaviors in the environment, the behaviors are input with the help of expert trajectories given manually by a human/computer expert. This form of learning from expert demonstrations is called Apprenticeship Learning in the scientific literature, at the core of it lies inverse Reinforcement Learning, and we are just trying to figure out the different reward functions for these different behaviors.

Apprenticeship vs. imitation learning - what is the difference?

In general, yes, they are the same thing, which means to learn from demonstration (LfD). Both methods learn from demonstration, but they learn different things:

Working Environment

The white dots represent the extent to which the agent's sensors extend.

Note, the normalization is done to ensure that every observable feature value is in the range [0,1] which is a necessary condition on the rewards for the IRL algorithm to converge.

Important modifications over the RL algorithm in Matt’s code

Note, that the RL algorithm is completely adopted from this post by Matt Harvey with minor changes, thus it makes perfect sense to talk about the changes that I have made, also even if the reader is comfortable with RL, I highly recommend a glance over that post in order to get an understanding of how the reinforcement learning is taking place.

The environment is significantly changed, with the agent getting abilites to not only sense the distance from the 3 sensors but also sense the color of the obstacles, enabling it to dishtinguish between the obstacles. Also, the agent is now smaller smaller in size and its sensing dots are now closer in order to get more resolution and better performance. Obstacles had to be made static for now, in order to simplify the process of testing the IRL algorithm, this may very well lead to overfitting of data, but I am not concerned about that at the moment. As discussed above the observation set or the state of the agent has been increased from 3 to 8, with the inclusion of the crash feature in the agent’s state. The reward structure is completely changed, the reward is now a weighted linear combination of these 8 features, the agent no more receives a -500 reward on bumping into obstacles, rather, the feature value for bumping is +1 and not bumping is 0 and it is on the algorithm to decide what weight should be assigned to this feature based on the expert behavior.

As stated in Matt’s blog, the aim here is to not just teach the RL agent to avoid obstacles, I mean why to assume anything about the reward structure, let the reward structure be completely decided by the algorithm from the expert demonstrations and see what behavior a particular setting of rewards achieves!

Inverse Reinforcement Learning

Important definitions -

  1. The features or basis functions phi_i which are basically observables in the state. The features in the current problem are discussed above in the state space section. We define phi(s_t) to be the sum of all the feature expectations phi_i such that:

  2. Rewards r_t - linear combination of these feature values observed at each state s_t.

  3. Feature expectations mu(pi) of a policy pi is the sum of discounted feature values phi(s_t).

    The feature expectations of a policy are independent of the weights, they only dependent on the states visited during the run (according to the policy) and on the discount factor gamma a number between 0 and 1 (e.g. 0.9 in our case). To obtain the feature expectations of a policy we have to execute the policy in real time with the agent and record the states visited and the feature values obtained.

Initializations

  1. Expert policy feature expectations or the expert’s feature expectations mu(pi_E) are obtained by the actions that are taken according to the expert behavior. We basically execute this policy and get the feature expectations as we do with any other policy. The expert feature expectations are given to the IRL algorithm to find the weights such that the reward funciton corresponding to the weights resemebles the underlying reward function that the expert is trying to maximize (in usual RL language).

  2. Random policy feature expectations - execute a random policy and use the feature expectations obtained to initialize IRL.

The algorithm

  1. Maintain a list of the policy feature expectations that we obtain after every iteration.
  2. At the very start we have only pi^1 -> the random policy feature expectations.
  3. Find the first set of weights of w^1 by convex optimization, the problem is similar to an SVM classifier which tries to give a +1 label to the expert feature expec. and -1 label to all the other policy feature expecs.-

    such that,

    Termination condition:

  4. Now, once we get the weights after one iteration of optimization, that is once we get a new reward function, we have to learn the policy which this reward function gives rise to. This is same as saying, find a policy that tries to maximize this obtained reward function. To find this new policy we have to train the Reinforcement Learning algorithm with this new reward function, and train it until the Q-values converge, to get a proper estimate of the policy.

  5. After we have learned a new policy we have to test this policy online, in order to get the feature expectations corresponding to this new policy. Then we add these new feature expectaitons to our list of feature expectations and carry on with out IRL algorithm’s next iteration until convergence.

The ones and the zeros

Let us now try to get a hang of the code. Please Find the complete code in this git repo. There are mainly 3 files that you have to worry about -

Import dependencies and define the important parameters, change the BEHAVIOR as required. FRAMES are the number of frames you want the RL algorithm to run. 100K is okay and takes around 2 hours.

Create the easy to use irlAgent class, which takes in the random and expert behaviors, and the other important parameters as shown.

The getRLAgentFE function uses the IRL_helper from the reinforcement learner to train a new model and get feature expectations by playing that model for 2000 iterations. It basically returns the feature expectations for every set of weights (W) it gets.

To update the dictionary in which we keep our obtained policies and their respective t values. Where t = (weights.tanspose)x(expert-newPolicy).

The implementation of the main IRL algorithm, that is discussed above.

The convex optimization to update the weights upon receving a new policy, basically assign +1 label to expert policy and -1 label to all the other policies and optimize for the weights under the mentioned contraints. To know more about this optimization visit site

Create an irlAgent and pass the desired parameters, select between the type of expert behavior you wish to learn the weights for and then run the optimalWeightFinder() function. Note that I have already obtained the feature expectations for red, yellow and brown behaviors. After the algorithm terminates, you will obtain a list of weights in ‘weights-red/yellow/brown.txt’, with the respective selected BEHAVIOR. Now, to select the best possible behavior from all the obtained weights, play the saved models in the saved-models_BEHAVIOR/evaluatedPolicies/ directory, the models are saved in the following format ‘saved-models_’+ BEHAVIOR +’/evaluatedPolicies/’+iteration number+ ‘-164-150-100-50000-100000’ + ‘.h5’. Basically you will get different weights for different iterations, first play the models to find out the model which performs best, then note the iteration number of that model, the weights obtained corresponding to this iteration number are the weights that get you closest to the expert behavior.

And then there are files which you probably don’t need to update/modify, atleast for the content in this post -

The Results, the behaviors and the weights

After around 10-15 iterations the algorithm converges in all the 4 different chosen behaviors, I obtained the following results:

Yellow loving agent: the objective is to go around yellow obstacles without bumping.
Brown loving agent: the objective is to go around brown obstacles without bumping.
Red loving agent: the objective is to go around red obstacles without bumping. The boundary is red colored.
Bumping loving agent: the objective is to bump into obstacles continuously.
Weights     I love yellow          I love Brown         I love Red          I love Bumping     
w1 (Left sensor dist.) -0.0880 -0.2627 0.2816 -0.5892
w2 (middle sensor dist.) -0.0624 0.0363 -0.5547 -0.3672
w3 (right sensor dist.) 0.0914 0.0931 -0.2297 -0.4660
w4 (Black color) -0.0114 0.0046 0.6824 -0.0299
w5 (yellow color) 0.6690 -0.1829 -0.3025 -0.1528
w6 (brown color) -0.0771 0.6987 0.0004 -0.0368
w7 (red color) -0.6650 -0.5922 0.0525 -0.5239
w8 (crash) -0.2897 -0.2201 -0.0075 0.0256

Possible explanation of the weights -

Note, it is very important to first think of whether you as a human will be able to dishtinguish between the given behaviors with the availability of the current state set (the observations) while designing the problem structure. Otherwise you might just be forcing the algorithm to find different weights without providing it the necessary information completely.

Train for a different behavior

If you really want to get into IRL, I would recommend that you actually try to teach the agent a new behavior (you might have to modify the environment for that, as the possible distinct behaviors for the current state set have already been exploited, well atleast according to me).

Install Pygame

Install Pygame’s dependencies with:

sudo apt install mercurial libfreetype6-dev libsdl-dev libsdl-image1.2-dev libsdl-ttf2.0-dev libsmpeg-dev libportmidi-dev libavformat-dev libsdl-mixer1.2-dev libswscale-dev libjpeg-dev

Then install Pygame itself:

pip3 install hg+http://bitbucket.org/pygame/pygame

Install Pymunk

This is the physics engine used by the simulation. It just went through a pretty significant rewrite (v5) so you need to grab the older v4 version. v4 is written for Python 2 so there are a couple extra steps.

Go back to your home or downloads and get Pymunk 4:

wget https://github.com/viblo/pymunk/archive/pymunk-4.0.0.tar.gz

Unpack it:

tar zxvf pymunk-4.0.0.tar.gz

Update from Python 2 to 3:

cd pymunk-pymukn-4.0.0/pymunk

2to3 -w *.py

Install it:

cd .. python3 setup.py install

Now go back to where you cloned reinforcement-learning-car and make sure everything worked with a quick python3 learning.py. If you see a screen come up with a little dot flying around the screen, you’re ready to go!

Training

First, you need to train a model. This will save weights to the saved-models folder. You may need to create this folder before running. You can train the model by running:

python3 learning.py

It can take anywhere from an hour to 36 hours to train a model, depending on the complexity of the network and the size of your sample. However, it will spit out weights every 25,000 frames, so you can move on to the next step in much less time.

Playing

Edit the playing.py file to change the path name for the model you want to load. Sorry about this, I know it should be a command line argument.

Then, watch the car drive itself around the obstacles!

python3 playing.py

That’s all there is to it.

Resources and credits -

  1. Thank you Matt Harvey for the game and the RL framework, basically for the father blog to this blog - Using RL to teach a virtual car to avoid obstacles
  2. Andrew Ng and Stuart Russell, 2000 - Algorithms for Inverse Reinforcement Learning
  3. Pieter Abbeel and Andrew Ng, 2004 - Apprenticeship Learning via Inverse Reinforcement Learning