Description of this tutorial

This is the first part of a fourpart series.

This series is mainly about my study notes and solutions to the Udacity Deep Reinforcement Learning Nanodegree projects.

You can find more information: Deep Reinforcement Learning Part 0
GitHub Repository For This Part
Contents Overview
Part 1. Reinforcement Learning
1.1 What is Reinforcement Learning?
Reinforcement Learning (RL), like supervised learning and unsupervised learning, is a branch of machine learning. However, unlike the latter two, RL is not to make predictions or clustering, but to autonomously learn through the rewards and punishments based on the RL agent’s interaction with the environment. This process is well illustrated in the figure below.
The agentenvironment interaction in reinforcement learning. (Source: Sutton and Barto, 2017)
1.2 The Setting
 The reinforcement learning (RL) framework is characterized by an agent learning to interact with its environment.
 At each time step, the agent receives the environment’s state (the environment presents a situation to the agent), and the agent must choose an appropriate action in response. One time step later, the agent receives a reward (the environment indicates whether the agent has responded appropriately to the state) and a new state.
 All agents have the goal to maximize expected cumulative reward, or the expected sum of rewards attained over all time steps.
1.3 Episodic vs. Continuing Tasks
 A task is an instance of the reinforcement learning (RL) problem.
 Continuing tasks are tasks that continue forever, without end.
 Episodic tasks are tasks with a welldefined starting and ending point.
 In this case, we refer to a complete sequence of interaction, from start to finish, as an episode.
 Episodic tasks come to an end whenever the agent reaches a terminal state.
1.4 Markov Decision Process (MDPs) and OneStep Dynamics
1.5 Qtable
When working with finite MDPs, we can estimate the actionvalue function qπ corresponding to a policy π in a table known as a Qtable.
1.6 EpsilonGreedy Policies
 EpsilonGreedy Policies A policy is ϵgreedy with respect to an actionvalue function estimate Q if for every state s ∈ S,
 with probability 1 − ϵ, the agent selects the greedy action, and
 with probability ϵ, the agent selects an action uniformly at random from the set of available (nongreedy AND greedy) actions.
1.7 Monte Carlo Methods (MC)
 Monte Carlo methods even though the underlying problem involves a great degree of randomness, we can infer useful information that we can trust just by collecting a lot of samples.
 The equiprobable random policy is the stochastic policy where from each state the agent randomly selects from the set of available actions, and each action is selected with equal probability.
 The biggest drawback of MC methods is that it does not applicable to continuing tasks since the agent needs to wait for the feedback until an episode ends which is the situation of episodic tasks where there is a clear terminal state
 MC methods update equation:
1.8 TemporalDifference Methods (TD)
Whereas Monte Carlo (MC) prediction methods must wait until the end of an episode to update the value function estimate, temporaldifference (TD) methods update the value function after every time step.
1.9 Sarsa
TD Control only uses a small time window of information. Instead of using the return Gt as the alternative estimate for updating qtable (needs to wait until an episode completed), Sum of the immediate reward Rt+1 and Q(St+1, At+1) of the current Q(St, At). We don’t want to update qtable after an episode, but update Q(St, At) every time step t. Therefore, we use sum of the immediate reward Rt+1 and the existing Q(St+1, At+1) in the table to replace return Gt in MC function as alternative estimate. This allows us to update Q(St, At) at each time step t.
1.10 Sarsamax (QLearning)
1.11 Difference between Sarsa and Sarsamax
Sarsa is to update Qtable after A0, R1, S1 and A1 are chosen through epsilongreedy(Q)
Sarsamax first chooses A0, R1 and S1 through epsilongreedy (Q), and then chooses the a (pure greedy) which can maximize Q(St+1,a) in Qtable before updating Qtable. And then repeat the process for T+1.
Q: What is the new value in the Qtable corresponding to the Q(s0, a0)?
A: Q(s0, a0) <— 6 + 0.1(1 + 9  6) = 6.2.
1.12 Expected Sarsa
Q: What is the new value in the Qtable corresponding to the Q(s0, a0)?
A: Q(s0, a0) <— 6 + 0.1( 1 + [0.1 x 8] + [0.1 x 7] + [0.7 x 9] + [0.1 x 8]  6) = 6.16.
1.14 Analyzing Performance
1.14.1 Similarities
All of the TD control methods we have examined (Sarsa, Sarsamax, Expected Sarsa) converge to the optimal actionvalue function q∗(and so yield the optimal policy π∗) if: 1. the value of ϵ decays in accordance with the GLIE conditions, and 2. the stepsize parameter α is sufficiently small.
1.14.2 Differences
The differences between these algorithms are summarized below: * Sarsa and Expected Sarsa are both onpolicy TD control algorithms. In this case, the same (ϵgreedy) policy that is evaluated and improved is also used to select actions. * Sarsamax is an offpolicy method, where the (greedy) policy that is evaluated and improved is different from the (ϵgreedy) policy that is used to select actions. * Onpolicy TD control methods (like Expected Sarsa and Sarsa) have better online performance than offpolicy TD control methods (like Sarsamax). * Expected Sarsa generally achieves better performance than Sarsa.
1.15 Greedy in the Limit with Infinite Exploration (GLIE)
Part 2. OpenAI Gym’s Taxiv3 Task
“The Taxi Problem” was originated from the subsection 3.1 of the paper “Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition” by Tom Dietterich
 The description of the problem from OpenAI Gym is referenced here: taxi.py
“Hierarchical Reinforcement Learning with…” by Tom Dietterich
Brief Description:
There are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger’s location, picks up the passenger, drives to the passenger’s destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends.Observations:
There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is in the taxi), and 4 destination locations.
Also, There are 6 possible actions, corresponding to moving North, East, South, or West, picking up the passenger, and dropping off the passenger.
2.1 Files:
agent_Sarsa.py
: The class of RL agent developed with Sarsa methodagent_Sarsamax.py
: The class of RL agent developed with Sarsamax (QLearning) methodagent_ExpectedSarsa.py
: The class of RL agent developed with Expected Sarsa methodmonitor_Sarsa.py
: The interact function tests how well our Sarsa agent learns from interaction with the environment.monitor_Sarsamax.py
: The interact function tests how well our Sarsamax (QLearning) agent learns from interaction with the environment.monitor_ExpectedSarsa.py
: The interact function tests how well our Expected Sarsa agent learns from interaction with the environment.main.py
: Run this file in the terminal to check the performance of our agents.
I set all seeds to be 505 for the reproducible results.
2.2 Sarsa Result
2.3 Sarsamax (QLearning) Result
2.4 Expected Sarsa Result
2.5 Comparison of Results
References
[1]Reinforcement Learning: An Introduction  Chapter 1: Introduction
[2]Reinforcement Learning: An Introduction  Chapter 3: Finite Markov Decision Processes
[3]Reinforcement Learning: An Introduction  Chapter 5: Monte Carlo Methods
[4]Reinforcement Learning: An Introduction  Chapter 6: TemporalDifference Learning
[5]Udacity Deep Reinforcement Learning Nanodegree Lectures
[6]David Silver’s RL Course Lecture