Temporal-Difference learning and the taxi-v2 environment


This blog post documents my attempt at solving the taxi-v2 environment from the OpenAI gym collection. To benchmark the used temporal difference learning algorithms, I calculate the theoretical expected episode reward for an optimal policy, derive an optimal policy using a shortest path algorithm, and show that there are an non-optimal policies that nevertheless always generate optimal trajectories.

The taxi-v2 environment

The description of the taxi-v2 environment can be found here:


There are 6 possible actions “south”, “north”, “east”, “west”, “pickup”, and “dropoff”, labeled with integers 0 to 5.

The state is characterized by the following 4 integer parameters:

(taxi_row, taxi_col, passenger_location, destination) 

The taxi drives on a 5x5 matrix. There are 5 possible passenger locations (R, G, Y, B, and Taxi), and 4 possible destinations:

|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |

The pipe characters ‘|’ indicate obstacles: For example, the taxi cannot drive from the position (0, 1) to (0, 2) is one move.

The state observations are encoded as Discrete(500), which is a state space consisting of integers 0..499, using the formula

((taxi_row*5 + taxi_col)*5 + passenger_location)*4 + destination.

Each action results in a reward of -1, except in the following two cases: 1. The “dropoff” action that ends the episode is rewarded with 20. (Not with -1 + 20, as stated in the source code documentation) 2. The misuse of “pickup” and “dropoff” actions is rewarded with -10.

The initial state is chosen at random (uniformly) such that the passenger is not in the taxi, and the pickup and dropoff positions are not equal.

Given this description, we can show that the expected cumulative reward is 8.46333…: In each episode, the pickup and dropoff actions give us -1 + 20 reward points. To calculate the expected travel path length, we can iterate over all possible initial states and calculate shortest travel path for each. I’ve done this using the NetworkX1 Python package containing numerous graph algorithms.

def expected_end_reward(self):
    expectation = -1.0 + 20.0
    path_lengths = 0.0
    path_no = 0.0
    for r0, c0, pickup_index, dropoff_index in itertools.product(
            range(5), range(5), range(4), range(4)):
        if pickup_index == dropoff_index:
        path_lengths += networkx.shortest_path_length(self.graph, (r0, c0), self.locs[pickup_index])
        path_lengths += networkx.shortest_path_length(self.graph, self.locs[pickup_index], self.locs[dropoff_index])
        path_no += 1.0
    expectation -= path_lengths/path_no
    return expectation


The leaderboard page in the gym wiki2 uses a running cumulative reward average calculated over 100 episodes which is a strange and unstable metric, since it is influenced significantly by the randomness of the initial states. It makes much more sense to calculate the average reward over a large number of episodes (e.g. 25000) and compare that statistics with with the theoretical expected reward calculated above.

Also, we can analyze at the learned policy itself, and, for each state s determine whether the proposed action is optimal or not.

Temporal-Difference learning

This task of driving a taxi around a 5x5 matrix might appear very straightforward at first, but in fact the interaction with the environment from an agents perspective can appear quite puzzling. Here is an small part of a trajectory generated by a random agent:

action, reward, next_state
2, -1, 393
5, -10, 393
2, -1, 393
3, -1, 373
4, -10, 373
2, -1, 393

The class Agent1 implements3 standard temporal-difference methods sarsa, sarsamax, and expected sarsa.

I ended up using the expected sarsa method with parameter values alpha=0.05, gamma=0.9, and epsilon=0.1 over the first 11000 episodes and epsilon=0.0 for the rest of the total number of 20000 episodes. Random initialization of the Q table seem to work better than using zero or constant initialization. My guess is that random Q table yields a “noisy” initial policy that encourages exploratory behavior at the beginning of the training.

The expected return of the policy estimated using this method matches the theoretical optimal expected value of ~8.4 in most cases. Somehow surprisingly, the estimated policy is not optimal and prescribes non-optimal actions for 8 states. It turns out that this actually doesn’t matter, because the states of the non-optimal state-action pairs are never attained.

In the following listing, the states are expanded to 4-tuples (“row”, “column”, “passenger location”, “dropoff location”).

state 436: 4 1 T R  action 0: south
state 437: 4 1 T G  action 0: south
state 438: 4 1 T Y  action 0: south
state 439: 4 1 T B  action 0: south
state 456: 4 2 T R  action 5: dropoff
state 457: 4 2 T G  action 2: east
state 458: 4 2 T Y  action 0: south
state 459: 4 2 T B  action 3: west
state 498: 4 4 T Y  action 2: east

|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |

For example, the position (4, 1) will never be visited while transporting a passenger, if the actions are dictated by an optimal policy. This example, points to a limitation of the policy optimality definition in case the set of allowed initial states is not the whole state space.


I use a slightly modified version of the render() function to record the episodes for this blog article. The destination is marked as red, the passenger location with magenta, and a yellow taxi turns white when it picks up the passenger.

Random agent
Trained agent


The Python module containing the agent, as well as functions and classes used to derive the auxiliary results cited in this blog post, is available on GitHub:


Similar write-ups and references

  1. https://networkx.github.io/documentation/stable/index.html↩︎

  2. https://github.com/openai/gym/wiki/Leaderboard#taxi-v2↩︎

  3. https://gitlab.com/jwergieluk/rl/blob/master/rl01.py↩︎