Showing preview only (272K chars total). Download the full file or copy to clipboard to get everything.
Repository: ShangtongZhang/reinforcement-learning-an-introduction
Branch: master
Commit: 96bc203617a7
Files: 31
Total size: 260.1 KB
Directory structure:
gitextract_m4ci91yn/
├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── chapter01/
│ └── tic_tac_toe.py
├── chapter02/
│ └── ten_armed_testbed.py
├── chapter03/
│ └── grid_world.py
├── chapter04/
│ ├── car_rental.py
│ ├── car_rental_synchronous.py
│ ├── gamblers_problem.py
│ └── grid_world.py
├── chapter05/
│ ├── blackjack.py
│ └── infinite_variance.py
├── chapter06/
│ ├── cliff_walking.py
│ ├── maximization_bias.py
│ ├── random_walk.py
│ └── windy_grid_world.py
├── chapter07/
│ └── random_walk.py
├── chapter08/
│ ├── expectation_vs_sample.py
│ ├── maze.py
│ └── trajectory_sampling.py
├── chapter09/
│ ├── random_walk.py
│ └── square_wave.py
├── chapter10/
│ ├── access_control.py
│ └── mountain_car.py
├── chapter11/
│ └── counterexample.py
├── chapter12/
│ ├── lambda_effect.py
│ ├── mountain_car.py
│ └── random_walk.py
├── chapter13/
│ └── short_corridor.py
└── requirements.txt
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
.idea
*.pyc
latex
*.bin
extra
.DS_Store
.vscode/
================================================
FILE: .travis.yml
================================================
language: python
python:
- "3.6"
install:
- pip install -r requirements.txt
script:
- ls chapter*/*.py | xargs -n 1 -P 1 python -m py_compile
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2019 Shangtong Zhang
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# Reinforcement Learning: An Introduction
[](https://travis-ci.org/ShangtongZhang/reinforcement-learning-an-introduction)
Python replication for Sutton & Barto's book [*Reinforcement Learning: An Introduction (2nd Edition)*](http://incompleteideas.net/book/the-book-2nd.html)
> If you have any confusion about the code or want to report a bug, please open an issue instead of emailing me directly, and unfortunately I do not have exercise answers for the book.
# Contents
### Chapter 1
1. Tic-Tac-Toe
### Chapter 2
1. [Figure 2.1: An exemplary bandit problem from the 10-armed testbed](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_1.png)
2. [Figure 2.2: Average performance of epsilon-greedy action-value methods on the 10-armed testbed](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_2.png)
3. [Figure 2.3: Optimistic initial action-value estimates](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_3.png)
4. [Figure 2.4: Average performance of UCB action selection on the 10-armed testbed](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_4.png)
5. [Figure 2.5: Average performance of the gradient bandit algorithm](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_5.png)
6. [Figure 2.6: A parameter study of the various bandit algorithms](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_6.png)
### Chapter 3
1. [Figure 3.2: Grid example with random policy](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_3_2.png)
2. [Figure 3.5: Optimal solutions to the gridworld example](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_3_5.png)
### Chapter 4
1. [Figure 4.1: Convergence of iterative policy evaluation on a small gridworld](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_4_1.png)
2. [Figure 4.2: Jack’s car rental problem](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_4_2.png)
3. [Figure 4.3: The solution to the gambler’s problem](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_4_3.png)
### Chapter 5
1. [Figure 5.1: Approximate state-value functions for the blackjack policy](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_5_1.png)
2. [Figure 5.2: The optimal policy and state-value function for blackjack found by Monte Carlo ES](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_5_2.png)
3. [Figure 5.3: Weighted importance sampling](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_5_3.png)
4. [Figure 5.4: Ordinary importance sampling with surprisingly unstable estimates](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_5_4.png)
### Chapter 6
1. [Example 6.2: Random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/example_6_2.png)
2. [Figure 6.2: Batch updating](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_2.png)
3. [Figure 6.3: Sarsa applied to windy grid world](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_3.png)
4. [Figure 6.4: The cliff-walking task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_4.png)
5. [Figure 6.6: Interim and asymptotic performance of TD control methods](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_6.png)
6. [Figure 6.7: Comparison of Q-learning and Double Q-learning](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_7.png)
### Chapter 7
1. [Figure 7.2: Performance of n-step TD methods on 19-state random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_7_2.png)
### Chapter 8
1. [Figure 8.2: Average learning curves for Dyna-Q agents varying in their number of planning steps](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_2.png)
2. [Figure 8.4: Average performance of Dyna agents on a blocking task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_4.png)
3. [Figure 8.5: Average performance of Dyna agents on a shortcut task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_5.png)
4. [Example 8.4: Prioritized sweeping significantly shortens learning time on the Dyna maze task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/example_8_4.png)
5. [Figure 8.7: Comparison of efficiency of expected and sample updates](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_7.png)
6. [Figure 8.8: Relative efficiency of different update distributions](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_8.png)
### Chapter 9
1. [Figure 9.1: Gradient Monte Carlo algorithm on the 1000-state random walk task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_1.png)
2. [Figure 9.2: Semi-gradient n-steps TD algorithm on the 1000-state random walk task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_2.png)
3. [Figure 9.5: Fourier basis vs polynomials on the 1000-state random walk task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_5.png)
4. [Figure 9.8: Example of feature width’s effect on initial generalization and asymptotic accuracy](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_8.png)
5. [Figure 9.10: Single tiling and multiple tilings on the 1000-state random walk task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_10.png)
### Chapter 10
1. [Figure 10.1: The cost-to-go function for Mountain Car task in one run](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_1.png)
2. [Figure 10.2: Learning curves for semi-gradient Sarsa on Mountain Car task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_2.png)
3. [Figure 10.3: One-step vs multi-step performance of semi-gradient Sarsa on the Mountain Car task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_3.png)
4. [Figure 10.4: Effect of the alpha and n on early performance of n-step semi-gradient Sarsa](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_4.png)
5. [Figure 10.5: Differential semi-gradient Sarsa on the access-control queuing task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_5.png)
### Chapter 11
1. [Figure 11.2: Baird's Counterexample](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_11_2.png)
2. [Figure 11.6: The behavior of the TDC algorithm on Baird’s counterexample](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_11_6.png)
3. [Figure 11.7: The behavior of the ETD algorithm in expectation on Baird’s counterexample](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_11_7.png)
### Chapter 12
1. [Figure 12.3: Off-line λ-return algorithm on 19-state random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_3.png)
2. [Figure 12.6: TD(λ) algorithm on 19-state random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_6.png)
3. [Figure 12.8: True online TD(λ) algorithm on 19-state random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_8.png)
4. [Figure 12.10: Sarsa(λ) with replacing traces on Mountain Car](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_10.png)
5. [Figure 12.11: Summary comparison of Sarsa(λ) algorithms on Mountain Car](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_11.png)
### Chapter 13
1. [Example 13.1: Short corridor with switched actions](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/example_13_1.png)
2. [Figure 13.1: REINFORCE on the short-corridor grid world](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_13_1.png)
3. [Figure 13.2: REINFORCE with baseline on the short-corridor grid-world](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_13_2.png)
# Environment
* python 3.6
* numpy
* matplotlib
* [seaborn](https://seaborn.pydata.org/index.html)
* [tqdm](https://pypi.org/project/tqdm/)
# Usage
> All files are self-contained
```commandline
python any_file_you_want.py
```
# Contribution
If you want to contribute some missing examples or fix some bugs, feel free to open an issue or make a pull request.
================================================
FILE: chapter01/tic_tac_toe.py
================================================
#######################################################################
# Copyright (C) #
# 2016 - 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Jan Hakenberg(jan.hakenberg@gmail.com) #
# 2016 Tian Jun(tianjun.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import pickle
BOARD_ROWS = 3
BOARD_COLS = 3
BOARD_SIZE = BOARD_ROWS * BOARD_COLS
class State:
def __init__(self):
# the board is represented by an n * n array,
# 1 represents a chessman of the player who moves first,
# -1 represents a chessman of another player
# 0 represents an empty position
self.data = np.zeros((BOARD_ROWS, BOARD_COLS))
self.winner = None
self.hash_val = None
self.end = None
# compute the hash value for one state, it's unique
def hash(self):
if self.hash_val is None:
self.hash_val = 0
for i in np.nditer(self.data):
self.hash_val = self.hash_val * 3 + i + 1
return self.hash_val
# check whether a player has won the game, or it's a tie
def is_end(self):
if self.end is not None:
return self.end
results = []
# check row
for i in range(BOARD_ROWS):
results.append(np.sum(self.data[i, :]))
# check columns
for i in range(BOARD_COLS):
results.append(np.sum(self.data[:, i]))
# check diagonals
trace = 0
reverse_trace = 0
for i in range(BOARD_ROWS):
trace += self.data[i, i]
reverse_trace += self.data[i, BOARD_ROWS - 1 - i]
results.append(trace)
results.append(reverse_trace)
for result in results:
if result == 3:
self.winner = 1
self.end = True
return self.end
if result == -3:
self.winner = -1
self.end = True
return self.end
# whether it's a tie
sum_values = np.sum(np.abs(self.data))
if sum_values == BOARD_SIZE:
self.winner = 0
self.end = True
return self.end
# game is still going on
self.end = False
return self.end
# @symbol: 1 or -1
# put chessman symbol in position (i, j)
def next_state(self, i, j, symbol):
new_state = State()
new_state.data = np.copy(self.data)
new_state.data[i, j] = symbol
return new_state
# print the board
def print_state(self):
for i in range(BOARD_ROWS):
print('-------------')
out = '| '
for j in range(BOARD_COLS):
if self.data[i, j] == 1:
token = '*'
elif self.data[i, j] == -1:
token = 'x'
else:
token = '0'
out += token + ' | '
print(out)
print('-------------')
def get_all_states_impl(current_state, current_symbol, all_states):
for i in range(BOARD_ROWS):
for j in range(BOARD_COLS):
if current_state.data[i][j] == 0:
new_state = current_state.next_state(i, j, current_symbol)
new_hash = new_state.hash()
if new_hash not in all_states:
is_end = new_state.is_end()
all_states[new_hash] = (new_state, is_end)
if not is_end:
get_all_states_impl(new_state, -current_symbol, all_states)
def get_all_states():
current_symbol = 1
current_state = State()
all_states = dict()
all_states[current_state.hash()] = (current_state, current_state.is_end())
get_all_states_impl(current_state, current_symbol, all_states)
return all_states
# all possible board configurations
all_states = get_all_states()
class Judger:
# @player1: the player who will move first, its chessman will be 1
# @player2: another player with a chessman -1
def __init__(self, player1, player2):
self.p1 = player1
self.p2 = player2
self.current_player = None
self.p1_symbol = 1
self.p2_symbol = -1
self.p1.set_symbol(self.p1_symbol)
self.p2.set_symbol(self.p2_symbol)
self.current_state = State()
def reset(self):
self.p1.reset()
self.p2.reset()
def alternate(self):
while True:
yield self.p1
yield self.p2
# @print_state: if True, print each board during the game
def play(self, print_state=False):
alternator = self.alternate()
self.reset()
current_state = State()
self.p1.set_state(current_state)
self.p2.set_state(current_state)
if print_state:
current_state.print_state()
while True:
player = next(alternator)
i, j, symbol = player.act()
next_state_hash = current_state.next_state(i, j, symbol).hash()
current_state, is_end = all_states[next_state_hash]
self.p1.set_state(current_state)
self.p2.set_state(current_state)
if print_state:
current_state.print_state()
if is_end:
return current_state.winner
# AI player
class Player:
# @step_size: the step size to update estimations
# @epsilon: the probability to explore
def __init__(self, step_size=0.1, epsilon=0.1):
self.estimations = dict()
self.step_size = step_size
self.epsilon = epsilon
self.states = []
self.greedy = []
self.symbol = 0
def reset(self):
self.states = []
self.greedy = []
def set_state(self, state):
self.states.append(state)
self.greedy.append(True)
def set_symbol(self, symbol):
self.symbol = symbol
for hash_val in all_states:
state, is_end = all_states[hash_val]
if is_end:
if state.winner == self.symbol:
self.estimations[hash_val] = 1.0
elif state.winner == 0:
# we need to distinguish between a tie and a lose
self.estimations[hash_val] = 0.5
else:
self.estimations[hash_val] = 0
else:
self.estimations[hash_val] = 0.5
# update value estimation
def backup(self):
states = [state.hash() for state in self.states]
for i in reversed(range(len(states) - 1)):
state = states[i]
td_error = self.greedy[i] * (
self.estimations[states[i + 1]] - self.estimations[state]
)
self.estimations[state] += self.step_size * td_error
# choose an action based on the state
def act(self):
state = self.states[-1]
next_states = []
next_positions = []
for i in range(BOARD_ROWS):
for j in range(BOARD_COLS):
if state.data[i, j] == 0:
next_positions.append([i, j])
next_states.append(state.next_state(
i, j, self.symbol).hash())
if np.random.rand() < self.epsilon:
action = next_positions[np.random.randint(len(next_positions))]
action.append(self.symbol)
self.greedy[-1] = False
return action
values = []
for hash_val, pos in zip(next_states, next_positions):
values.append((self.estimations[hash_val], pos))
# to select one of the actions of equal value at random due to Python's sort is stable
np.random.shuffle(values)
values.sort(key=lambda x: x[0], reverse=True)
action = values[0][1]
action.append(self.symbol)
return action
def save_policy(self):
with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'wb') as f:
pickle.dump(self.estimations, f)
def load_policy(self):
with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'rb') as f:
self.estimations = pickle.load(f)
# human interface
# input a number to put a chessman
# | q | w | e |
# | a | s | d |
# | z | x | c |
class HumanPlayer:
def __init__(self, **kwargs):
self.symbol = None
self.keys = ['q', 'w', 'e', 'a', 's', 'd', 'z', 'x', 'c']
self.state = None
def reset(self):
pass
def set_state(self, state):
self.state = state
def set_symbol(self, symbol):
self.symbol = symbol
def act(self):
self.state.print_state()
key = input("Input your position:")
data = self.keys.index(key)
i = data // BOARD_COLS
j = data % BOARD_COLS
return i, j, self.symbol
def train(epochs, print_every_n=500):
player1 = Player(epsilon=0.01)
player2 = Player(epsilon=0.01)
judger = Judger(player1, player2)
player1_win = 0.0
player2_win = 0.0
for i in range(1, epochs + 1):
winner = judger.play(print_state=False)
if winner == 1:
player1_win += 1
if winner == -1:
player2_win += 1
if i % print_every_n == 0:
print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))
player1.backup()
player2.backup()
judger.reset()
player1.save_policy()
player2.save_policy()
def compete(turns):
player1 = Player(epsilon=0)
player2 = Player(epsilon=0)
judger = Judger(player1, player2)
player1.load_policy()
player2.load_policy()
player1_win = 0.0
player2_win = 0.0
for _ in range(turns):
winner = judger.play()
if winner == 1:
player1_win += 1
if winner == -1:
player2_win += 1
judger.reset()
print('%d turns, player 1 win %.02f, player 2 win %.02f' % (turns, player1_win / turns, player2_win / turns))
# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():
while True:
player1 = HumanPlayer()
player2 = Player(epsilon=0)
judger = Judger(player1, player2)
player2.load_policy()
winner = judger.play()
if winner == player2.symbol:
print("You lose!")
elif winner == player1.symbol:
print("You win!")
else:
print("It is a tie!")
if __name__ == '__main__':
train(int(1e5))
compete(int(1e3))
play()
================================================
FILE: chapter02/ten_armed_testbed.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Tian Jun(tianjun.cpp@gmail.com) #
# 2016 Artem Oboturov(oboturov@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from tqdm import trange
matplotlib.use('Agg')
class Bandit:
# @k_arm: # of arms
# @epsilon: probability for exploration in epsilon-greedy algorithm
# @initial: initial estimation for each action
# @step_size: constant step size for updating estimations
# @sample_averages: if True, use sample averages to update estimations instead of constant step size
# @UCB_param: if not None, use UCB algorithm to select action
# @gradient: if True, use gradient based bandit algorithm
# @gradient_baseline: if True, use average reward as baseline for gradient based bandit algorithm
def __init__(self, k_arm=10, epsilon=0., initial=0., step_size=0.1, sample_averages=False, UCB_param=None,
gradient=False, gradient_baseline=False, true_reward=0.):
self.k = k_arm
self.step_size = step_size
self.sample_averages = sample_averages
self.indices = np.arange(self.k)
self.time = 0
self.UCB_param = UCB_param
self.gradient = gradient
self.gradient_baseline = gradient_baseline
self.average_reward = 0
self.true_reward = true_reward
self.epsilon = epsilon
self.initial = initial
def reset(self):
# real reward for each action
self.q_true = np.random.randn(self.k) + self.true_reward
# estimation for each action
self.q_estimation = np.zeros(self.k) + self.initial
# # of chosen times for each action
self.action_count = np.zeros(self.k)
self.best_action = np.argmax(self.q_true)
self.time = 0
# get an action for this bandit
def act(self):
if np.random.rand() < self.epsilon:
return np.random.choice(self.indices)
if self.UCB_param is not None:
UCB_estimation = self.q_estimation + \
self.UCB_param * np.sqrt(np.log(self.time + 1) / (self.action_count + 1e-5))
q_best = np.max(UCB_estimation)
return np.random.choice(np.where(UCB_estimation == q_best)[0])
if self.gradient:
exp_est = np.exp(self.q_estimation)
self.action_prob = exp_est / np.sum(exp_est)
return np.random.choice(self.indices, p=self.action_prob)
q_best = np.max(self.q_estimation)
return np.random.choice(np.where(self.q_estimation == q_best)[0])
# take an action, update estimation for this action
def step(self, action):
# generate the reward under N(real reward, 1)
reward = np.random.randn() + self.q_true[action]
self.time += 1
self.action_count[action] += 1
self.average_reward += (reward - self.average_reward) / self.time
if self.sample_averages:
# update estimation using sample averages
self.q_estimation[action] += (reward - self.q_estimation[action]) / self.action_count[action]
elif self.gradient:
one_hot = np.zeros(self.k)
one_hot[action] = 1
if self.gradient_baseline:
baseline = self.average_reward
else:
baseline = 0
self.q_estimation += self.step_size * (reward - baseline) * (one_hot - self.action_prob)
else:
# update estimation with constant step size
self.q_estimation[action] += self.step_size * (reward - self.q_estimation[action])
return reward
def simulate(runs, time, bandits):
rewards = np.zeros((len(bandits), runs, time))
best_action_counts = np.zeros(rewards.shape)
for i, bandit in enumerate(bandits):
for r in trange(runs):
bandit.reset()
for t in range(time):
action = bandit.act()
reward = bandit.step(action)
rewards[i, r, t] = reward
if action == bandit.best_action:
best_action_counts[i, r, t] = 1
mean_best_action_counts = best_action_counts.mean(axis=1)
mean_rewards = rewards.mean(axis=1)
return mean_best_action_counts, mean_rewards
def figure_2_1():
plt.violinplot(dataset=np.random.randn(200, 10) + np.random.randn(10))
plt.xlabel("Action")
plt.ylabel("Reward distribution")
plt.savefig('../images/figure_2_1.png')
plt.close()
def figure_2_2(runs=2000, time=1000):
epsilons = [0, 0.1, 0.01]
bandits = [Bandit(epsilon=eps, sample_averages=True) for eps in epsilons]
best_action_counts, rewards = simulate(runs, time, bandits)
plt.figure(figsize=(10, 20))
plt.subplot(2, 1, 1)
for eps, rewards in zip(epsilons, rewards):
plt.plot(rewards, label='$\epsilon = %.02f$' % (eps))
plt.xlabel('steps')
plt.ylabel('average reward')
plt.legend()
plt.subplot(2, 1, 2)
for eps, counts in zip(epsilons, best_action_counts):
plt.plot(counts, label='$\epsilon = %.02f$' % (eps))
plt.xlabel('steps')
plt.ylabel('% optimal action')
plt.legend()
plt.savefig('../images/figure_2_2.png')
plt.close()
def figure_2_3(runs=2000, time=1000):
bandits = []
bandits.append(Bandit(epsilon=0, initial=5, step_size=0.1))
bandits.append(Bandit(epsilon=0.1, initial=0, step_size=0.1))
best_action_counts, _ = simulate(runs, time, bandits)
plt.plot(best_action_counts[0], label='$\epsilon = 0, q = 5$')
plt.plot(best_action_counts[1], label='$\epsilon = 0.1, q = 0$')
plt.xlabel('Steps')
plt.ylabel('% optimal action')
plt.legend()
plt.savefig('../images/figure_2_3.png')
plt.close()
def figure_2_4(runs=2000, time=1000):
bandits = []
bandits.append(Bandit(epsilon=0, UCB_param=2, sample_averages=True))
bandits.append(Bandit(epsilon=0.1, sample_averages=True))
_, average_rewards = simulate(runs, time, bandits)
plt.plot(average_rewards[0], label='UCB $c = 2$')
plt.plot(average_rewards[1], label='epsilon greedy $\epsilon = 0.1$')
plt.xlabel('Steps')
plt.ylabel('Average reward')
plt.legend()
plt.savefig('../images/figure_2_4.png')
plt.close()
def figure_2_5(runs=2000, time=1000):
bandits = []
bandits.append(Bandit(gradient=True, step_size=0.1, gradient_baseline=True, true_reward=4))
bandits.append(Bandit(gradient=True, step_size=0.1, gradient_baseline=False, true_reward=4))
bandits.append(Bandit(gradient=True, step_size=0.4, gradient_baseline=True, true_reward=4))
bandits.append(Bandit(gradient=True, step_size=0.4, gradient_baseline=False, true_reward=4))
best_action_counts, _ = simulate(runs, time, bandits)
labels = [r'$\alpha = 0.1$, with baseline',
r'$\alpha = 0.1$, without baseline',
r'$\alpha = 0.4$, with baseline',
r'$\alpha = 0.4$, without baseline']
for i in range(len(bandits)):
plt.plot(best_action_counts[i], label=labels[i])
plt.xlabel('Steps')
plt.ylabel('% Optimal action')
plt.legend()
plt.savefig('../images/figure_2_5.png')
plt.close()
def figure_2_6(runs=2000, time=1000):
labels = ['epsilon-greedy', 'gradient bandit',
'UCB', 'optimistic initialization']
generators = [lambda epsilon: Bandit(epsilon=epsilon, sample_averages=True),
lambda alpha: Bandit(gradient=True, step_size=alpha, gradient_baseline=True),
lambda coef: Bandit(epsilon=0, UCB_param=coef, sample_averages=True),
lambda initial: Bandit(epsilon=0, initial=initial, step_size=0.1)]
parameters = [np.arange(-7, -1, dtype=np.float),
np.arange(-5, 2, dtype=np.float),
np.arange(-4, 3, dtype=np.float),
np.arange(-2, 3, dtype=np.float)]
bandits = []
for generator, parameter in zip(generators, parameters):
for param in parameter:
bandits.append(generator(pow(2, param)))
_, average_rewards = simulate(runs, time, bandits)
rewards = np.mean(average_rewards, axis=1)
i = 0
for label, parameter in zip(labels, parameters):
l = len(parameter)
plt.plot(parameter, rewards[i:i+l], label=label)
i += l
plt.xlabel('Parameter($2^x$)')
plt.ylabel('Average reward')
plt.legend()
plt.savefig('../images/figure_2_6.png')
plt.close()
if __name__ == '__main__':
figure_2_1()
figure_2_2()
figure_2_3()
figure_2_4()
figure_2_5()
figure_2_6()
================================================
FILE: chapter03/grid_world.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.table import Table
matplotlib.use('Agg')
WORLD_SIZE = 5
A_POS = [0, 1]
A_PRIME_POS = [4, 1]
B_POS = [0, 3]
B_PRIME_POS = [2, 3]
DISCOUNT = 0.9
# left, up, right, down
ACTIONS = [np.array([0, -1]),
np.array([-1, 0]),
np.array([0, 1]),
np.array([1, 0])]
ACTIONS_FIGS=[ '←', '↑', '→', '↓']
ACTION_PROB = 0.25
def step(state, action):
if state == A_POS:
return A_PRIME_POS, 10
if state == B_POS:
return B_PRIME_POS, 5
next_state = (np.array(state) + action).tolist()
x, y = next_state
if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:
reward = -1.0
next_state = state
else:
reward = 0
return next_state, reward
def draw_image(image):
fig, ax = plt.subplots()
ax.set_axis_off()
tb = Table(ax, bbox=[0, 0, 1, 1])
nrows, ncols = image.shape
width, height = 1.0 / ncols, 1.0 / nrows
# Add cells
for (i, j), val in np.ndenumerate(image):
# add state labels
if [i, j] == A_POS:
val = str(val) + " (A)"
if [i, j] == A_PRIME_POS:
val = str(val) + " (A')"
if [i, j] == B_POS:
val = str(val) + " (B)"
if [i, j] == B_PRIME_POS:
val = str(val) + " (B')"
tb.add_cell(i, j, width, height, text=val,
loc='center', facecolor='white')
# Row and column labels...
for i in range(len(image)):
tb.add_cell(i, -1, width, height, text=i+1, loc='right',
edgecolor='none', facecolor='none')
tb.add_cell(-1, i, width, height/2, text=i+1, loc='center',
edgecolor='none', facecolor='none')
ax.add_table(tb)
def draw_policy(optimal_values):
fig, ax = plt.subplots()
ax.set_axis_off()
tb = Table(ax, bbox=[0, 0, 1, 1])
nrows, ncols = optimal_values.shape
width, height = 1.0 / ncols, 1.0 / nrows
# Add cells
for (i, j), val in np.ndenumerate(optimal_values):
next_vals=[]
for action in ACTIONS:
next_state, _ = step([i, j], action)
next_vals.append(optimal_values[next_state[0],next_state[1]])
best_actions=np.where(next_vals == np.max(next_vals))[0]
val=''
for ba in best_actions:
val+=ACTIONS_FIGS[ba]
# add state labels
if [i, j] == A_POS:
val = str(val) + " (A)"
if [i, j] == A_PRIME_POS:
val = str(val) + " (A')"
if [i, j] == B_POS:
val = str(val) + " (B)"
if [i, j] == B_PRIME_POS:
val = str(val) + " (B')"
tb.add_cell(i, j, width, height, text=val,
loc='center', facecolor='white')
# Row and column labels...
for i in range(len(optimal_values)):
tb.add_cell(i, -1, width, height, text=i+1, loc='right',
edgecolor='none', facecolor='none')
tb.add_cell(-1, i, width, height/2, text=i+1, loc='center',
edgecolor='none', facecolor='none')
ax.add_table(tb)
def figure_3_2():
value = np.zeros((WORLD_SIZE, WORLD_SIZE))
while True:
# keep iteration until convergence
new_value = np.zeros_like(value)
for i in range(WORLD_SIZE):
for j in range(WORLD_SIZE):
for action in ACTIONS:
(next_i, next_j), reward = step([i, j], action)
# bellman equation
new_value[i, j] += ACTION_PROB * (reward + DISCOUNT * value[next_i, next_j])
if np.sum(np.abs(value - new_value)) < 1e-4:
draw_image(np.round(new_value, decimals=2))
plt.savefig('../images/figure_3_2.png')
plt.close()
break
value = new_value
def figure_3_2_linear_system():
'''
Here we solve the linear system of equations to find the exact solution.
We do this by filling the coefficients for each of the states with their respective right side constant.
'''
A = -1 * np.eye(WORLD_SIZE * WORLD_SIZE)
b = np.zeros(WORLD_SIZE * WORLD_SIZE)
for i in range(WORLD_SIZE):
for j in range(WORLD_SIZE):
s = [i, j] # current state
index_s = np.ravel_multi_index(s, (WORLD_SIZE, WORLD_SIZE))
for a in ACTIONS:
s_, r = step(s, a)
index_s_ = np.ravel_multi_index(s_, (WORLD_SIZE, WORLD_SIZE))
A[index_s, index_s_] += ACTION_PROB * DISCOUNT
b[index_s] -= ACTION_PROB * r
x = np.linalg.solve(A, b)
draw_image(np.round(x.reshape(WORLD_SIZE, WORLD_SIZE), decimals=2))
plt.savefig('../images/figure_3_2_linear_system.png')
plt.close()
def figure_3_5():
value = np.zeros((WORLD_SIZE, WORLD_SIZE))
while True:
# keep iteration until convergence
new_value = np.zeros_like(value)
for i in range(WORLD_SIZE):
for j in range(WORLD_SIZE):
values = []
for action in ACTIONS:
(next_i, next_j), reward = step([i, j], action)
# value iteration
values.append(reward + DISCOUNT * value[next_i, next_j])
new_value[i, j] = np.max(values)
if np.sum(np.abs(new_value - value)) < 1e-4:
draw_image(np.round(new_value, decimals=2))
plt.savefig('../images/figure_3_5.png')
plt.close()
draw_policy(new_value)
plt.savefig('../images/figure_3_5_policy.png')
plt.close()
break
value = new_value
if __name__ == '__main__':
figure_3_2_linear_system()
figure_3_2()
figure_3_5()
================================================
FILE: chapter04/car_rental.py
================================================
#######################################################################
# Copyright (C) #
# 2016 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# 2017 Aja Rangaswamy (aja004@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy.stats import poisson
matplotlib.use('Agg')
# maximum # of cars in each location
MAX_CARS = 20
# maximum # of cars to move during night
MAX_MOVE_OF_CARS = 5
# expectation for rental requests in first location
RENTAL_REQUEST_FIRST_LOC = 3
# expectation for rental requests in second location
RENTAL_REQUEST_SECOND_LOC = 4
# expectation for # of cars returned in first location
RETURNS_FIRST_LOC = 3
# expectation for # of cars returned in second location
RETURNS_SECOND_LOC = 2
DISCOUNT = 0.9
# credit earned by a car
RENTAL_CREDIT = 10
# cost of moving a car
MOVE_CAR_COST = 2
# all possible actions
actions = np.arange(-MAX_MOVE_OF_CARS, MAX_MOVE_OF_CARS + 1)
# An up bound for poisson distribution
# If n is greater than this value, then the probability of getting n is truncated to 0
POISSON_UPPER_BOUND = 11
# Probability for poisson distribution
# @lam: lambda should be less than 10 for this function
poisson_cache = dict()
def poisson_probability(n, lam):
global poisson_cache
key = n * 10 + lam
if key not in poisson_cache:
poisson_cache[key] = poisson.pmf(n, lam)
return poisson_cache[key]
def expected_return(state, action, state_value, constant_returned_cars):
"""
@state: [# of cars in first location, # of cars in second location]
@action: positive if moving cars from first location to second location,
negative if moving cars from second location to first location
@stateValue: state value matrix
@constant_returned_cars: if set True, model is simplified such that
the # of cars returned in daytime becomes constant
rather than a random value from poisson distribution, which will reduce calculation time
and leave the optimal policy/value state matrix almost the same
"""
# initailize total return
returns = 0.0
# cost for moving cars
returns -= MOVE_CAR_COST * abs(action)
# moving cars
NUM_OF_CARS_FIRST_LOC = min(state[0] - action, MAX_CARS)
NUM_OF_CARS_SECOND_LOC = min(state[1] + action, MAX_CARS)
# go through all possible rental requests
for rental_request_first_loc in range(POISSON_UPPER_BOUND):
for rental_request_second_loc in range(POISSON_UPPER_BOUND):
# probability for current combination of rental requests
prob = poisson_probability(rental_request_first_loc, RENTAL_REQUEST_FIRST_LOC) * \
poisson_probability(rental_request_second_loc, RENTAL_REQUEST_SECOND_LOC)
num_of_cars_first_loc = NUM_OF_CARS_FIRST_LOC
num_of_cars_second_loc = NUM_OF_CARS_SECOND_LOC
# valid rental requests should be less than actual # of cars
valid_rental_first_loc = min(num_of_cars_first_loc, rental_request_first_loc)
valid_rental_second_loc = min(num_of_cars_second_loc, rental_request_second_loc)
# get credits for renting
reward = (valid_rental_first_loc + valid_rental_second_loc) * RENTAL_CREDIT
num_of_cars_first_loc -= valid_rental_first_loc
num_of_cars_second_loc -= valid_rental_second_loc
if constant_returned_cars:
# get returned cars, those cars can be used for renting tomorrow
returned_cars_first_loc = RETURNS_FIRST_LOC
returned_cars_second_loc = RETURNS_SECOND_LOC
num_of_cars_first_loc = min(num_of_cars_first_loc + returned_cars_first_loc, MAX_CARS)
num_of_cars_second_loc = min(num_of_cars_second_loc + returned_cars_second_loc, MAX_CARS)
returns += prob * (reward + DISCOUNT * state_value[num_of_cars_first_loc, num_of_cars_second_loc])
else:
for returned_cars_first_loc in range(POISSON_UPPER_BOUND):
for returned_cars_second_loc in range(POISSON_UPPER_BOUND):
prob_return = poisson_probability(
returned_cars_first_loc, RETURNS_FIRST_LOC) * poisson_probability(returned_cars_second_loc, RETURNS_SECOND_LOC)
num_of_cars_first_loc_ = min(num_of_cars_first_loc + returned_cars_first_loc, MAX_CARS)
num_of_cars_second_loc_ = min(num_of_cars_second_loc + returned_cars_second_loc, MAX_CARS)
prob_ = prob_return * prob
returns += prob_ * (reward + DISCOUNT *
state_value[num_of_cars_first_loc_, num_of_cars_second_loc_])
return returns
def figure_4_2(constant_returned_cars=True):
value = np.zeros((MAX_CARS + 1, MAX_CARS + 1))
policy = np.zeros(value.shape, dtype=np.int)
iterations = 0
_, axes = plt.subplots(2, 3, figsize=(40, 20))
plt.subplots_adjust(wspace=0.1, hspace=0.2)
axes = axes.flatten()
while True:
fig = sns.heatmap(np.flipud(policy), cmap="YlGnBu", ax=axes[iterations])
fig.set_ylabel('# cars at first location', fontsize=30)
fig.set_yticks(list(reversed(range(MAX_CARS + 1))))
fig.set_xlabel('# cars at second location', fontsize=30)
fig.set_title('policy {}'.format(iterations), fontsize=30)
# policy evaluation (in-place)
while True:
old_value = value.copy()
for i in range(MAX_CARS + 1):
for j in range(MAX_CARS + 1):
new_state_value = expected_return([i, j], policy[i, j], value, constant_returned_cars)
value[i, j] = new_state_value
max_value_change = abs(old_value - value).max()
print('max value change {}'.format(max_value_change))
if max_value_change < 1e-4:
break
# policy improvement
policy_stable = True
for i in range(MAX_CARS + 1):
for j in range(MAX_CARS + 1):
old_action = policy[i, j]
action_returns = []
for action in actions:
if (0 <= action <= i) or (-j <= action <= 0):
action_returns.append(expected_return([i, j], action, value, constant_returned_cars))
else:
action_returns.append(-np.inf)
new_action = actions[np.argmax(action_returns)]
policy[i, j] = new_action
if policy_stable and old_action != new_action:
policy_stable = False
print('policy stable {}'.format(policy_stable))
if policy_stable:
fig = sns.heatmap(np.flipud(value), cmap="YlGnBu", ax=axes[-1])
fig.set_ylabel('# cars at first location', fontsize=30)
fig.set_yticks(list(reversed(range(MAX_CARS + 1))))
fig.set_xlabel('# cars at second location', fontsize=30)
fig.set_title('optimal value', fontsize=30)
break
iterations += 1
plt.savefig('../images/figure_4_2.png')
plt.close()
if __name__ == '__main__':
figure_4_2()
================================================
FILE: chapter04/car_rental_synchronous.py
================================================
#######################################################################
# Copyright (C) #
# 2016 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# 2017 Aja Rangaswamy (aja004@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
# This file is contributed by Tahsincan Köse which implements a synchronous policy evaluation, while the car_rental.py
# implements an asynchronous policy evaluation. This file also utilizes multi-processing for acceleration and contains
# an answer to Exercise 4.5
import numpy as np
import matplotlib.pyplot as plt
import math
import tqdm
import multiprocessing as mp
from functools import partial
import time
import itertools
############# PROBLEM SPECIFIC CONSTANTS #######################
MAX_CARS = 20
MAX_MOVE = 5
MOVE_COST = -2
ADDITIONAL_PARK_COST = -4
RENT_REWARD = 10
# expectation for rental requests in first location
RENTAL_REQUEST_FIRST_LOC = 3
# expectation for rental requests in second location
RENTAL_REQUEST_SECOND_LOC = 4
# expectation for # of cars returned in first location
RETURNS_FIRST_LOC = 3
# expectation for # of cars returned in second location
RETURNS_SECOND_LOC = 2
################################################################
poisson_cache = dict()
def poisson(n, lam):
global poisson_cache
key = n * 10 + lam
if key not in poisson_cache.keys():
poisson_cache[key] = math.exp(-lam) * math.pow(lam, n) / math.factorial(n)
return poisson_cache[key]
class PolicyIteration:
def __init__(self, truncate, parallel_processes, delta=1e-2, gamma=0.9, solve_4_5=False):
self.TRUNCATE = truncate
self.NR_PARALLEL_PROCESSES = parallel_processes
self.actions = np.arange(-MAX_MOVE, MAX_MOVE + 1)
self.inverse_actions = {el: ind[0] for ind, el in np.ndenumerate(self.actions)}
self.values = np.zeros((MAX_CARS + 1, MAX_CARS + 1))
self.policy = np.zeros(self.values.shape, dtype=np.int)
self.delta = delta
self.gamma = gamma
self.solve_extension = solve_4_5
def solve(self):
iterations = 0
total_start_time = time.time()
while True:
start_time = time.time()
self.values = self.policy_evaluation(self.values, self.policy)
elapsed_time = time.time() - start_time
print(f'PE => Elapsed time {elapsed_time} seconds')
start_time = time.time()
policy_change, self.policy = self.policy_improvement(self.actions, self.values, self.policy)
elapsed_time = time.time() - start_time
print(f'PI => Elapsed time {elapsed_time} seconds')
if policy_change == 0:
break
iterations += 1
total_elapsed_time = time.time() - total_start_time
print(f'Optimal policy is reached after {iterations} iterations in {total_elapsed_time} seconds')
# out-place
def policy_evaluation(self, values, policy):
global MAX_CARS
while True:
new_values = np.copy(values)
k = np.arange(MAX_CARS + 1)
# cartesian product
all_states = ((i, j) for i, j in itertools.product(k, k))
results = []
with mp.Pool(processes=self.NR_PARALLEL_PROCESSES) as p:
cook = partial(self.expected_return_pe, policy, values)
results = p.map(cook, all_states)
for v, i, j in results:
new_values[i, j] = v
difference = np.abs(new_values - values).sum()
print(f'Difference: {difference}')
values = new_values
if difference < self.delta:
print(f'Values are converged!')
return values
def policy_improvement(self, actions, values, policy):
new_policy = np.copy(policy)
expected_action_returns = np.zeros((MAX_CARS + 1, MAX_CARS + 1, np.size(actions)))
cooks = dict()
with mp.Pool(processes=8) as p:
for action in actions:
k = np.arange(MAX_CARS + 1)
all_states = ((i, j) for i, j in itertools.product(k, k))
cooks[action] = partial(self.expected_return_pi, values, action)
results = p.map(cooks[action], all_states)
for v, i, j, a in results:
expected_action_returns[i, j, self.inverse_actions[a]] = v
for i in range(expected_action_returns.shape[0]):
for j in range(expected_action_returns.shape[1]):
new_policy[i, j] = actions[np.argmax(expected_action_returns[i, j])]
policy_change = (new_policy != policy).sum()
print(f'Policy changed in {policy_change} states')
return policy_change, new_policy
# O(n^4) computation for all possible requests and returns
def bellman(self, values, action, state):
expected_return = 0
if self.solve_extension:
if action > 0:
# Free shuttle to the second location
expected_return += MOVE_COST * (action - 1)
else:
expected_return += MOVE_COST * abs(action)
else:
expected_return += MOVE_COST * abs(action)
for req1 in range(0, self.TRUNCATE):
for req2 in range(0, self.TRUNCATE):
# moving cars
num_of_cars_first_loc = int(min(state[0] - action, MAX_CARS))
num_of_cars_second_loc = int(min(state[1] + action, MAX_CARS))
# valid rental requests should be less than actual # of cars
real_rental_first_loc = min(num_of_cars_first_loc, req1)
real_rental_second_loc = min(num_of_cars_second_loc, req2)
# get credits for renting
reward = (real_rental_first_loc + real_rental_second_loc) * RENT_REWARD
if self.solve_extension:
if num_of_cars_first_loc >= 10:
reward += ADDITIONAL_PARK_COST
if num_of_cars_second_loc >= 10:
reward += ADDITIONAL_PARK_COST
num_of_cars_first_loc -= real_rental_first_loc
num_of_cars_second_loc -= real_rental_second_loc
# probability for current combination of rental requests
prob = poisson(req1, RENTAL_REQUEST_FIRST_LOC) * \
poisson(req2, RENTAL_REQUEST_SECOND_LOC)
for ret1 in range(0, self.TRUNCATE):
for ret2 in range(0, self.TRUNCATE):
num_of_cars_first_loc_ = min(num_of_cars_first_loc + ret1, MAX_CARS)
num_of_cars_second_loc_ = min(num_of_cars_second_loc + ret2, MAX_CARS)
prob_ = poisson(ret1, RETURNS_FIRST_LOC) * \
poisson(ret2, RETURNS_SECOND_LOC) * prob
# Classic Bellman equation for state-value
# prob_ corresponds to p(s'|s,a) for each possible s' -> (num_of_cars_first_loc_,num_of_cars_second_loc_)
expected_return += prob_ * (
reward + self.gamma * values[num_of_cars_first_loc_, num_of_cars_second_loc_])
return expected_return
# Parallelization enforced different helper functions
# Expected return calculator for Policy Evaluation
def expected_return_pe(self, policy, values, state):
action = policy[state[0], state[1]]
expected_return = self.bellman(values, action, state)
return expected_return, state[0], state[1]
# Expected return calculator for Policy Improvement
def expected_return_pi(self, values, action, state):
if ((action >= 0 and state[0] >= action) or (action < 0 and state[1] >= abs(action))) == False:
return -float('inf'), state[0], state[1], action
expected_return = self.bellman(values, action, state)
return expected_return, state[0], state[1], action
def plot(self):
print(self.policy)
plt.figure()
plt.xlim(0, MAX_CARS + 1)
plt.ylim(0, MAX_CARS + 1)
plt.table(cellText=np.flipud(self.policy), loc=(0, 0), cellLoc='center')
plt.show()
if __name__ == '__main__':
TRUNCATE = 9
solver = PolicyIteration(TRUNCATE, parallel_processes=4, delta=1e-1, gamma=0.9, solve_4_5=True)
solver.solve()
solver.plot()
================================================
FILE: chapter04/gamblers_problem.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
matplotlib.use('Agg')
# goal
GOAL = 100
# all states, including state 0 and state 100
STATES = np.arange(GOAL + 1)
# probability of head
HEAD_PROB = 0.4
def figure_4_3():
# state value
state_value = np.zeros(GOAL + 1)
state_value[GOAL] = 1.0
sweeps_history = []
# value iteration
while True:
old_state_value = state_value.copy()
sweeps_history.append(old_state_value)
for state in STATES[1:GOAL]:
# get possilbe actions for current state
actions = np.arange(min(state, GOAL - state) + 1)
action_returns = []
for action in actions:
action_returns.append(
HEAD_PROB * state_value[state + action] + (1 - HEAD_PROB) * state_value[state - action])
new_value = np.max(action_returns)
state_value[state] = new_value
delta = abs(state_value - old_state_value).max()
if delta < 1e-9:
sweeps_history.append(state_value)
break
# compute the optimal policy
policy = np.zeros(GOAL + 1)
for state in STATES[1:GOAL]:
actions = np.arange(min(state, GOAL - state) + 1)
action_returns = []
for action in actions:
action_returns.append(
HEAD_PROB * state_value[state + action] + (1 - HEAD_PROB) * state_value[state - action])
# round to resemble the figure in the book, see
# https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/issues/83
policy[state] = actions[np.argmax(np.round(action_returns[1:], 5)) + 1]
plt.figure(figsize=(10, 20))
plt.subplot(2, 1, 1)
for sweep, state_value in enumerate(sweeps_history):
plt.plot(state_value, label='sweep {}'.format(sweep))
plt.xlabel('Capital')
plt.ylabel('Value estimates')
plt.legend(loc='best')
plt.subplot(2, 1, 2)
plt.scatter(STATES, policy)
plt.xlabel('Capital')
plt.ylabel('Final policy (stake)')
plt.savefig('../images/figure_4_3.png')
plt.close()
if __name__ == '__main__':
figure_4_3()
================================================
FILE: chapter04/grid_world.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.table import Table
matplotlib.use('Agg')
WORLD_SIZE = 4
# left, up, right, down
ACTIONS = [np.array([0, -1]),
np.array([-1, 0]),
np.array([0, 1]),
np.array([1, 0])]
ACTION_PROB = 0.25
def is_terminal(state):
x, y = state
return (x == 0 and y == 0) or (x == WORLD_SIZE - 1 and y == WORLD_SIZE - 1)
def step(state, action):
if is_terminal(state):
return state, 0
next_state = (np.array(state) + action).tolist()
x, y = next_state
if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:
next_state = state
reward = -1
return next_state, reward
def draw_image(image):
fig, ax = plt.subplots()
ax.set_axis_off()
tb = Table(ax, bbox=[0, 0, 1, 1])
nrows, ncols = image.shape
width, height = 1.0 / ncols, 1.0 / nrows
# Add cells
for (i, j), val in np.ndenumerate(image):
tb.add_cell(i, j, width, height, text=val,
loc='center', facecolor='white')
# Row and column labels...
for i in range(len(image)):
tb.add_cell(i, -1, width, height, text=i+1, loc='right',
edgecolor='none', facecolor='none')
tb.add_cell(-1, i, width, height/2, text=i+1, loc='center',
edgecolor='none', facecolor='none')
ax.add_table(tb)
def compute_state_value(in_place=True, discount=1.0):
new_state_values = np.zeros((WORLD_SIZE, WORLD_SIZE))
iteration = 0
while True:
if in_place:
state_values = new_state_values
else:
state_values = new_state_values.copy()
old_state_values = state_values.copy()
for i in range(WORLD_SIZE):
for j in range(WORLD_SIZE):
value = 0
for action in ACTIONS:
(next_i, next_j), reward = step([i, j], action)
value += ACTION_PROB * (reward + discount * state_values[next_i, next_j])
new_state_values[i, j] = value
max_delta_value = abs(old_state_values - new_state_values).max()
if max_delta_value < 1e-4:
break
iteration += 1
return new_state_values, iteration
def figure_4_1():
# While the author suggests using in-place iterative policy evaluation,
# Figure 4.1 actually uses out-of-place version.
_, asycn_iteration = compute_state_value(in_place=True)
values, sync_iteration = compute_state_value(in_place=False)
draw_image(np.round(values, decimals=2))
print('In-place: {} iterations'.format(asycn_iteration))
print('Synchronous: {} iterations'.format(sync_iteration))
plt.savefig('../images/figure_4_1.png')
plt.close()
if __name__ == '__main__':
figure_4_1()
================================================
FILE: chapter05/blackjack.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# 2017 Nicky van Foreest(vanforeest@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
# actions: hit or stand
ACTION_HIT = 0
ACTION_STAND = 1 # "strike" in the book
ACTIONS = [ACTION_HIT, ACTION_STAND]
# policy for player
POLICY_PLAYER = np.zeros(22, dtype=np.int)
for i in range(12, 20):
POLICY_PLAYER[i] = ACTION_HIT
POLICY_PLAYER[20] = ACTION_STAND
POLICY_PLAYER[21] = ACTION_STAND
# function form of target policy of player
def target_policy_player(usable_ace_player, player_sum, dealer_card):
return POLICY_PLAYER[player_sum]
# function form of behavior policy of player
def behavior_policy_player(usable_ace_player, player_sum, dealer_card):
if np.random.binomial(1, 0.5) == 1:
return ACTION_STAND
return ACTION_HIT
# policy for dealer
POLICY_DEALER = np.zeros(22)
for i in range(12, 17):
POLICY_DEALER[i] = ACTION_HIT
for i in range(17, 22):
POLICY_DEALER[i] = ACTION_STAND
# get a new card
def get_card():
card = np.random.randint(1, 14)
card = min(card, 10)
return card
# get the value of a card (11 for ace).
def card_value(card_id):
return 11 if card_id == 1 else card_id
# play a game
# @policy_player: specify policy for player
# @initial_state: [whether player has a usable Ace, sum of player's cards, one card of dealer]
# @initial_action: the initial action
def play(policy_player, initial_state=None, initial_action=None):
# player status
# sum of player
player_sum = 0
# trajectory of player
player_trajectory = []
# whether player uses Ace as 11
usable_ace_player = False
# dealer status
dealer_card1 = 0
dealer_card2 = 0
usable_ace_dealer = False
if initial_state is None:
# generate a random initial state
while player_sum < 12:
# if sum of player is less than 12, always hit
card = get_card()
player_sum += card_value(card)
# If the player's sum is larger than 21, he may hold one or two aces.
if player_sum > 21:
assert player_sum == 22
# last card must be ace
player_sum -= 10
else:
usable_ace_player |= (1 == card)
# initialize cards of dealer, suppose dealer will show the first card he gets
dealer_card1 = get_card()
dealer_card2 = get_card()
else:
# use specified initial state
usable_ace_player, player_sum, dealer_card1 = initial_state
dealer_card2 = get_card()
# initial state of the game
state = [usable_ace_player, player_sum, dealer_card1]
# initialize dealer's sum
dealer_sum = card_value(dealer_card1) + card_value(dealer_card2)
usable_ace_dealer = 1 in (dealer_card1, dealer_card2)
# if the dealer's sum is larger than 21, he must hold two aces.
if dealer_sum > 21:
assert dealer_sum == 22
# use one Ace as 1 rather than 11
dealer_sum -= 10
assert dealer_sum <= 21
assert player_sum <= 21
# game starts!
# player's turn
while True:
if initial_action is not None:
action = initial_action
initial_action = None
else:
# get action based on current sum
action = policy_player(usable_ace_player, player_sum, dealer_card1)
# track player's trajectory for importance sampling
player_trajectory.append([(usable_ace_player, player_sum, dealer_card1), action])
if action == ACTION_STAND:
break
# if hit, get new card
card = get_card()
# Keep track of the ace count. the usable_ace_player flag is insufficient alone as it cannot
# distinguish between having one ace or two.
ace_count = int(usable_ace_player)
if card == 1:
ace_count += 1
player_sum += card_value(card)
# If the player has a usable ace, use it as 1 to avoid busting and continue.
while player_sum > 21 and ace_count:
player_sum -= 10
ace_count -= 1
# player busts
if player_sum > 21:
return state, -1, player_trajectory
assert player_sum <= 21
usable_ace_player = (ace_count == 1)
# dealer's turn
while True:
# get action based on current sum
action = POLICY_DEALER[dealer_sum]
if action == ACTION_STAND:
break
# if hit, get a new card
new_card = get_card()
ace_count = int(usable_ace_dealer)
if new_card == 1:
ace_count += 1
dealer_sum += card_value(new_card)
# If the dealer has a usable ace, use it as 1 to avoid busting and continue.
while dealer_sum > 21 and ace_count:
dealer_sum -= 10
ace_count -= 1
# dealer busts
if dealer_sum > 21:
return state, 1, player_trajectory
usable_ace_dealer = (ace_count == 1)
# compare the sum between player and dealer
assert player_sum <= 21 and dealer_sum <= 21
if player_sum > dealer_sum:
return state, 1, player_trajectory
elif player_sum == dealer_sum:
return state, 0, player_trajectory
else:
return state, -1, player_trajectory
# Monte Carlo Sample with On-Policy
def monte_carlo_on_policy(episodes):
states_usable_ace = np.zeros((10, 10))
# initialze counts to 1 to avoid 0 being divided
states_usable_ace_count = np.ones((10, 10))
states_no_usable_ace = np.zeros((10, 10))
# initialze counts to 1 to avoid 0 being divided
states_no_usable_ace_count = np.ones((10, 10))
for i in tqdm(range(0, episodes)):
_, reward, player_trajectory = play(target_policy_player)
for (usable_ace, player_sum, dealer_card), _ in player_trajectory:
player_sum -= 12
dealer_card -= 1
if usable_ace:
states_usable_ace_count[player_sum, dealer_card] += 1
states_usable_ace[player_sum, dealer_card] += reward
else:
states_no_usable_ace_count[player_sum, dealer_card] += 1
states_no_usable_ace[player_sum, dealer_card] += reward
return states_usable_ace / states_usable_ace_count, states_no_usable_ace / states_no_usable_ace_count
# Monte Carlo with Exploring Starts
def monte_carlo_es(episodes):
# (playerSum, dealerCard, usableAce, action)
state_action_values = np.zeros((10, 10, 2, 2))
# initialze counts to 1 to avoid division by 0
state_action_pair_count = np.ones((10, 10, 2, 2))
# behavior policy is greedy
def behavior_policy(usable_ace, player_sum, dealer_card):
usable_ace = int(usable_ace)
player_sum -= 12
dealer_card -= 1
# get argmax of the average returns(s, a)
values_ = state_action_values[player_sum, dealer_card, usable_ace, :] / \
state_action_pair_count[player_sum, dealer_card, usable_ace, :]
return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
# play for several episodes
for episode in tqdm(range(episodes)):
# for each episode, use a randomly initialized state and action
initial_state = [bool(np.random.choice([0, 1])),
np.random.choice(range(12, 22)),
np.random.choice(range(1, 11))]
initial_action = np.random.choice(ACTIONS)
current_policy = behavior_policy if episode else target_policy_player
_, reward, trajectory = play(current_policy, initial_state, initial_action)
first_visit_check = set()
for (usable_ace, player_sum, dealer_card), action in trajectory:
usable_ace = int(usable_ace)
player_sum -= 12
dealer_card -= 1
state_action = (usable_ace, player_sum, dealer_card, action)
if state_action in first_visit_check:
continue
first_visit_check.add(state_action)
# update values of state-action pairs
state_action_values[player_sum, dealer_card, usable_ace, action] += reward
state_action_pair_count[player_sum, dealer_card, usable_ace, action] += 1
return state_action_values / state_action_pair_count
# Monte Carlo Sample with Off-Policy
def monte_carlo_off_policy(episodes):
initial_state = [True, 13, 2]
rhos = []
returns = []
for i in range(0, episodes):
_, reward, player_trajectory = play(behavior_policy_player, initial_state=initial_state)
# get the importance ratio
numerator = 1.0
denominator = 1.0
for (usable_ace, player_sum, dealer_card), action in player_trajectory:
if action == target_policy_player(usable_ace, player_sum, dealer_card):
denominator *= 0.5
else:
numerator = 0.0
break
rho = numerator / denominator
rhos.append(rho)
returns.append(reward)
rhos = np.asarray(rhos)
returns = np.asarray(returns)
weighted_returns = rhos * returns
weighted_returns = np.add.accumulate(weighted_returns)
rhos = np.add.accumulate(rhos)
ordinary_sampling = weighted_returns / np.arange(1, episodes + 1)
with np.errstate(divide='ignore',invalid='ignore'):
weighted_sampling = np.where(rhos != 0, weighted_returns / rhos, 0)
return ordinary_sampling, weighted_sampling
def figure_5_1():
states_usable_ace_1, states_no_usable_ace_1 = monte_carlo_on_policy(10000)
states_usable_ace_2, states_no_usable_ace_2 = monte_carlo_on_policy(500000)
states = [states_usable_ace_1,
states_usable_ace_2,
states_no_usable_ace_1,
states_no_usable_ace_2]
titles = ['Usable Ace, 10000 Episodes',
'Usable Ace, 500000 Episodes',
'No Usable Ace, 10000 Episodes',
'No Usable Ace, 500000 Episodes']
_, axes = plt.subplots(2, 2, figsize=(40, 30))
plt.subplots_adjust(wspace=0.1, hspace=0.2)
axes = axes.flatten()
for state, title, axis in zip(states, titles, axes):
fig = sns.heatmap(np.flipud(state), cmap="YlGnBu", ax=axis, xticklabels=range(1, 11),
yticklabels=list(reversed(range(12, 22))))
fig.set_ylabel('player sum', fontsize=30)
fig.set_xlabel('dealer showing', fontsize=30)
fig.set_title(title, fontsize=30)
plt.savefig('../images/figure_5_1.png')
plt.close()
def figure_5_2():
state_action_values = monte_carlo_es(500000)
state_value_no_usable_ace = np.max(state_action_values[:, :, 0, :], axis=-1)
state_value_usable_ace = np.max(state_action_values[:, :, 1, :], axis=-1)
# get the optimal policy
action_no_usable_ace = np.argmax(state_action_values[:, :, 0, :], axis=-1)
action_usable_ace = np.argmax(state_action_values[:, :, 1, :], axis=-1)
images = [action_usable_ace,
state_value_usable_ace,
action_no_usable_ace,
state_value_no_usable_ace]
titles = ['Optimal policy with usable Ace',
'Optimal value with usable Ace',
'Optimal policy without usable Ace',
'Optimal value without usable Ace']
_, axes = plt.subplots(2, 2, figsize=(40, 30))
plt.subplots_adjust(wspace=0.1, hspace=0.2)
axes = axes.flatten()
for image, title, axis in zip(images, titles, axes):
fig = sns.heatmap(np.flipud(image), cmap="YlGnBu", ax=axis, xticklabels=range(1, 11),
yticklabels=list(reversed(range(12, 22))))
fig.set_ylabel('player sum', fontsize=30)
fig.set_xlabel('dealer showing', fontsize=30)
fig.set_title(title, fontsize=30)
plt.savefig('../images/figure_5_2.png')
plt.close()
def figure_5_3():
true_value = -0.27726
episodes = 10000
runs = 100
error_ordinary = np.zeros(episodes)
error_weighted = np.zeros(episodes)
for i in tqdm(range(0, runs)):
ordinary_sampling_, weighted_sampling_ = monte_carlo_off_policy(episodes)
# get the squared error
error_ordinary += np.power(ordinary_sampling_ - true_value, 2)
error_weighted += np.power(weighted_sampling_ - true_value, 2)
error_ordinary /= runs
error_weighted /= runs
plt.plot(np.arange(1, episodes + 1), error_ordinary, color='green', label='Ordinary Importance Sampling')
plt.plot(np.arange(1, episodes + 1), error_weighted, color='red', label='Weighted Importance Sampling')
plt.ylim(-0.1, 5)
plt.xlabel('Episodes (log scale)')
plt.ylabel(f'Mean square error\n(average over {runs} runs)')
plt.xscale('log')
plt.legend()
plt.savefig('../images/figure_5_3.png')
plt.close()
if __name__ == '__main__':
figure_5_1()
figure_5_2()
figure_5_3()
================================================
FILE: chapter05/infinite_variance.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
ACTION_BACK = 0
ACTION_END = 1
# behavior policy
def behavior_policy():
return np.random.binomial(1, 0.5)
# target policy
def target_policy():
return ACTION_BACK
# one turn
def play():
# track the action for importance ratio
trajectory = []
while True:
action = behavior_policy()
trajectory.append(action)
if action == ACTION_END:
return 0, trajectory
if np.random.binomial(1, 0.9) == 0:
return 1, trajectory
def figure_5_4():
runs = 10
episodes = 100000
for run in range(runs):
rewards = []
for episode in range(0, episodes):
reward, trajectory = play()
if trajectory[-1] == ACTION_END:
rho = 0
else:
rho = 1.0 / pow(0.5, len(trajectory))
rewards.append(rho * reward)
rewards = np.add.accumulate(rewards)
estimations = np.asarray(rewards) / np.arange(1, episodes + 1)
plt.plot(estimations)
plt.xlabel('Episodes (log scale)')
plt.ylabel('Ordinary Importance Sampling')
plt.xscale('log')
plt.savefig('../images/figure_5_4.png')
plt.close()
if __name__ == '__main__':
figure_5_4()
================================================
FILE: chapter06/cliff_walking.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from tqdm import tqdm
# world height
WORLD_HEIGHT = 4
# world width
WORLD_WIDTH = 12
# probability for exploration
EPSILON = 0.1
# step size
ALPHA = 0.5
# gamma for Q-Learning and Expected Sarsa
GAMMA = 1
# all possible actions
ACTION_UP = 0
ACTION_DOWN = 1
ACTION_LEFT = 2
ACTION_RIGHT = 3
ACTIONS = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT]
# initial state action pair values
START = [3, 0]
GOAL = [3, 11]
def step(state, action):
i, j = state
if action == ACTION_UP:
next_state = [max(i - 1, 0), j]
elif action == ACTION_LEFT:
next_state = [i, max(j - 1, 0)]
elif action == ACTION_RIGHT:
next_state = [i, min(j + 1, WORLD_WIDTH - 1)]
elif action == ACTION_DOWN:
next_state = [min(i + 1, WORLD_HEIGHT - 1), j]
else:
assert False
reward = -1
if (action == ACTION_DOWN and i == 2 and 1 <= j <= 10) or (
action == ACTION_RIGHT and state == START):
reward = -100
next_state = START
return next_state, reward
# reward for each action in each state
# actionRewards = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4))
# actionRewards[:, :, :] = -1.0
# actionRewards[2, 1:11, ACTION_DOWN] = -100.0
# actionRewards[3, 0, ACTION_RIGHT] = -100.0
# set up destinations for each action in each state
# actionDestination = []
# for i in range(0, WORLD_HEIGHT):
# actionDestination.append([])
# for j in range(0, WORLD_WIDTH):
# destinaion = dict()
# destinaion[ACTION_UP] = [max(i - 1, 0), j]
# destinaion[ACTION_LEFT] = [i, max(j - 1, 0)]
# destinaion[ACTION_RIGHT] = [i, min(j + 1, WORLD_WIDTH - 1)]
# if i == 2 and 1 <= j <= 10:
# destinaion[ACTION_DOWN] = START
# else:
# destinaion[ACTION_DOWN] = [min(i + 1, WORLD_HEIGHT - 1), j]
# actionDestination[-1].append(destinaion)
# actionDestination[3][0][ACTION_RIGHT] = START
# choose an action based on epsilon greedy algorithm
def choose_action(state, q_value):
if np.random.binomial(1, EPSILON) == 1:
return np.random.choice(ACTIONS)
else:
values_ = q_value[state[0], state[1], :]
return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
# an episode with Sarsa
# @q_value: values for state action pair, will be updated
# @expected: if True, will use expected Sarsa algorithm
# @step_size: step size for updating
# @return: total rewards within this episode
def sarsa(q_value, expected=False, step_size=ALPHA):
state = START
action = choose_action(state, q_value)
rewards = 0.0
while state != GOAL:
next_state, reward = step(state, action)
next_action = choose_action(next_state, q_value)
rewards += reward
if not expected:
target = q_value[next_state[0], next_state[1], next_action]
else:
# calculate the expected value of new state
target = 0.0
q_next = q_value[next_state[0], next_state[1], :]
best_actions = np.argwhere(q_next == np.max(q_next))
for action_ in ACTIONS:
if action_ in best_actions:
target += ((1.0 - EPSILON) / len(best_actions) + EPSILON / len(ACTIONS)) * q_value[next_state[0], next_state[1], action_]
else:
target += EPSILON / len(ACTIONS) * q_value[next_state[0], next_state[1], action_]
target *= GAMMA
q_value[state[0], state[1], action] += step_size * (
reward + target - q_value[state[0], state[1], action])
state = next_state
action = next_action
return rewards
# an episode with Q-Learning
# @q_value: values for state action pair, will be updated
# @step_size: step size for updating
# @return: total rewards within this episode
def q_learning(q_value, step_size=ALPHA):
state = START
rewards = 0.0
while state != GOAL:
action = choose_action(state, q_value)
next_state, reward = step(state, action)
rewards += reward
# Q-Learning update
q_value[state[0], state[1], action] += step_size * (
reward + GAMMA * np.max(q_value[next_state[0], next_state[1], :]) -
q_value[state[0], state[1], action])
state = next_state
return rewards
# print optimal policy
def print_optimal_policy(q_value):
optimal_policy = []
for i in range(0, WORLD_HEIGHT):
optimal_policy.append([])
for j in range(0, WORLD_WIDTH):
if [i, j] == GOAL:
optimal_policy[-1].append('G')
continue
bestAction = np.argmax(q_value[i, j, :])
if bestAction == ACTION_UP:
optimal_policy[-1].append('U')
elif bestAction == ACTION_DOWN:
optimal_policy[-1].append('D')
elif bestAction == ACTION_LEFT:
optimal_policy[-1].append('L')
elif bestAction == ACTION_RIGHT:
optimal_policy[-1].append('R')
for row in optimal_policy:
print(row)
# Use multiple runs instead of a single run and a sliding window
# With a single run I failed to present a smooth curve
# However the optimal policy converges well with a single run
# Sarsa converges to the safe path, while Q-Learning converges to the optimal path
def figure_6_4():
# episodes of each run
episodes = 500
# perform 40 independent runs
runs = 50
rewards_sarsa = np.zeros(episodes)
rewards_q_learning = np.zeros(episodes)
for r in tqdm(range(runs)):
q_sarsa = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4))
q_q_learning = np.copy(q_sarsa)
for i in range(0, episodes):
# cut off the value by -100 to draw the figure more elegantly
# rewards_sarsa[i] += max(sarsa(q_sarsa), -100)
# rewards_q_learning[i] += max(q_learning(q_q_learning), -100)
rewards_sarsa[i] += sarsa(q_sarsa)
rewards_q_learning[i] += q_learning(q_q_learning)
# averaging over independt runs
rewards_sarsa /= runs
rewards_q_learning /= runs
# draw reward curves
plt.plot(rewards_sarsa, label='Sarsa')
plt.plot(rewards_q_learning, label='Q-Learning')
plt.xlabel('Episodes')
plt.ylabel('Sum of rewards during episode')
plt.ylim([-100, 0])
plt.legend()
plt.savefig('../images/figure_6_4.png')
plt.close()
# display optimal policy
print('Sarsa Optimal Policy:')
print_optimal_policy(q_sarsa)
print('Q-Learning Optimal Policy:')
print_optimal_policy(q_q_learning)
# Due to limited capacity of calculation of my machine, I can't complete this experiment
# with 100,000 episodes and 50,000 runs to get the fully averaged performance
# However even I only play for 1,000 episodes and 10 runs, the curves looks still good.
def figure_6_6():
step_sizes = np.arange(0.1, 1.1, 0.1)
episodes = 1000
runs = 10
ASY_SARSA = 0
ASY_EXPECTED_SARSA = 1
ASY_QLEARNING = 2
INT_SARSA = 3
INT_EXPECTED_SARSA = 4
INT_QLEARNING = 5
methods = range(0, 6)
performace = np.zeros((6, len(step_sizes)))
for run in range(runs):
for ind, step_size in tqdm(list(zip(range(0, len(step_sizes)), step_sizes))):
q_sarsa = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4))
q_expected_sarsa = np.copy(q_sarsa)
q_q_learning = np.copy(q_sarsa)
for ep in range(episodes):
sarsa_reward = sarsa(q_sarsa, expected=False, step_size=step_size)
expected_sarsa_reward = sarsa(q_expected_sarsa, expected=True, step_size=step_size)
q_learning_reward = q_learning(q_q_learning, step_size=step_size)
performace[ASY_SARSA, ind] += sarsa_reward
performace[ASY_EXPECTED_SARSA, ind] += expected_sarsa_reward
performace[ASY_QLEARNING, ind] += q_learning_reward
if ep < 100:
performace[INT_SARSA, ind] += sarsa_reward
performace[INT_EXPECTED_SARSA, ind] += expected_sarsa_reward
performace[INT_QLEARNING, ind] += q_learning_reward
performace[:3, :] /= episodes * runs
performace[3:, :] /= 100 * runs
labels = ['Asymptotic Sarsa', 'Asymptotic Expected Sarsa', 'Asymptotic Q-Learning',
'Interim Sarsa', 'Interim Expected Sarsa', 'Interim Q-Learning']
for method, label in zip(methods, labels):
plt.plot(step_sizes, performace[method, :], label=label)
plt.xlabel('alpha')
plt.ylabel('reward per episode')
plt.legend()
plt.savefig('../images/figure_6_6.png')
plt.close()
if __name__ == '__main__':
figure_6_4()
figure_6_6()
================================================
FILE: chapter06/maximization_bias.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from tqdm import tqdm
import copy
# state A
STATE_A = 0
# state B
STATE_B = 1
# use one terminal state
STATE_TERMINAL = 2
# starts from state A
STATE_START = STATE_A
# possible actions in A
ACTION_A_RIGHT = 0
ACTION_A_LEFT = 1
# probability for exploration
EPSILON = 0.1
# step size
ALPHA = 0.1
# discount for max value
GAMMA = 1.0
# possible actions in B, maybe 10 actions
ACTIONS_B = range(0, 10)
# all possible actions
STATE_ACTIONS = [[ACTION_A_RIGHT, ACTION_A_LEFT], ACTIONS_B]
# state action pair values, if a state is a terminal state, then the value is always 0
INITIAL_Q = [np.zeros(2), np.zeros(len(ACTIONS_B)), np.zeros(1)]
# set up destination for each state and each action
TRANSITION = [[STATE_TERMINAL, STATE_B], [STATE_TERMINAL] * len(ACTIONS_B)]
# choose an action based on epsilon greedy algorithm
def choose_action(state, q_value):
if np.random.binomial(1, EPSILON) == 1:
return np.random.choice(STATE_ACTIONS[state])
else:
values_ = q_value[state]
return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
# take @action in @state, return the reward
def take_action(state, action):
if state == STATE_A:
return 0
return np.random.normal(-0.1, 1)
# if there are two state action pair value array, use double Q-Learning
# otherwise use normal Q-Learning
def q_learning(q1, q2=None):
state = STATE_START
# track the # of action left in state A
left_count = 0
while state != STATE_TERMINAL:
if q2 is None:
action = choose_action(state, q1)
else:
# derive a action form Q1 and Q2
action = choose_action(state, [item1 + item2 for item1, item2 in zip(q1, q2)])
if state == STATE_A and action == ACTION_A_LEFT:
left_count += 1
reward = take_action(state, action)
next_state = TRANSITION[state][action]
if q2 is None:
active_q = q1
target = np.max(active_q[next_state])
else:
if np.random.binomial(1, 0.5) == 1:
active_q = q1
target_q = q2
else:
active_q = q2
target_q = q1
best_action = np.random.choice([action_ for action_, value_ in enumerate(active_q[next_state]) if value_ == np.max(active_q[next_state])])
target = target_q[next_state][best_action]
# Q-Learning update
active_q[state][action] += ALPHA * (
reward + GAMMA * target - active_q[state][action])
state = next_state
return left_count
# Figure 6.7, 1,000 runs may be enough, # of actions in state B will also affect the curves
def figure_6_7():
# each independent run has 300 episodes
episodes = 300
runs = 1000
left_counts_q = np.zeros((runs, episodes))
left_counts_double_q = np.zeros((runs, episodes))
for run in tqdm(range(runs)):
q = copy.deepcopy(INITIAL_Q)
q1 = copy.deepcopy(INITIAL_Q)
q2 = copy.deepcopy(INITIAL_Q)
for ep in range(0, episodes):
left_counts_q[run, ep] = q_learning(q)
left_counts_double_q[run, ep] = q_learning(q1, q2)
left_counts_q = left_counts_q.mean(axis=0)
left_counts_double_q = left_counts_double_q.mean(axis=0)
plt.plot(left_counts_q, label='Q-Learning')
plt.plot(left_counts_double_q, label='Double Q-Learning')
plt.plot(np.ones(episodes) * 0.05, label='Optimal')
plt.xlabel('episodes')
plt.ylabel('% left actions from A')
plt.legend()
plt.savefig('../images/figure_6_7.png')
plt.close()
if __name__ == '__main__':
figure_6_7()
================================================
FILE: chapter06/random_walk.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from tqdm import tqdm
# 0 is the left terminal state
# 6 is the right terminal state
# 1 ... 5 represents A ... E
VALUES = np.zeros(7)
VALUES[1:6] = 0.5
# For convenience, we assume all rewards are 0
# and the left terminal state has value 0, the right terminal state has value 1
# This trick has been used in Gambler's Problem
VALUES[6] = 1
# set up true state values
TRUE_VALUE = np.zeros(7)
TRUE_VALUE[1:6] = np.arange(1, 6) / 6.0
TRUE_VALUE[6] = 1
ACTION_LEFT = 0
ACTION_RIGHT = 1
# @values: current states value, will be updated if @batch is False
# @alpha: step size
# @batch: whether to update @values
def temporal_difference(values, alpha=0.1, batch=False):
state = 3
trajectory = [state]
rewards = [0]
while True:
old_state = state
if np.random.binomial(1, 0.5) == ACTION_LEFT:
state -= 1
else:
state += 1
# Assume all rewards are 0
reward = 0
trajectory.append(state)
# TD update
if not batch:
values[old_state] += alpha * (reward + values[state] - values[old_state])
if state == 6 or state == 0:
break
rewards.append(reward)
return trajectory, rewards
# @values: current states value, will be updated if @batch is False
# @alpha: step size
# @batch: whether to update @values
def monte_carlo(values, alpha=0.1, batch=False):
state = 3
trajectory = [state]
# if end up with left terminal state, all returns are 0
# if end up with right terminal state, all returns are 1
while True:
if np.random.binomial(1, 0.5) == ACTION_LEFT:
state -= 1
else:
state += 1
trajectory.append(state)
if state == 6:
returns = 1.0
break
elif state == 0:
returns = 0.0
break
if not batch:
for state_ in trajectory[:-1]:
# MC update
values[state_] += alpha * (returns - values[state_])
return trajectory, [returns] * (len(trajectory) - 1)
# Example 6.2 left
def compute_state_value():
episodes = [0, 1, 10, 100]
current_values = np.copy(VALUES)
plt.figure(1)
for i in range(episodes[-1] + 1):
if i in episodes:
plt.plot(("A", "B", "C", "D", "E"), current_values[1:6], label=str(i) + ' episodes')
temporal_difference(current_values)
plt.plot(("A", "B", "C", "D", "E"), TRUE_VALUE[1:6], label='true values')
plt.xlabel('State')
plt.ylabel('Estimated Value')
plt.legend()
# Example 6.2 right
def rms_error():
# Same alpha value can appear in both arrays
td_alphas = [0.15, 0.1, 0.05]
mc_alphas = [0.01, 0.02, 0.03, 0.04]
episodes = 100 + 1
runs = 100
for i, alpha in enumerate(td_alphas + mc_alphas):
total_errors = np.zeros(episodes)
if i < len(td_alphas):
method = 'TD'
linestyle = 'solid'
else:
method = 'MC'
linestyle = 'dashdot'
for r in tqdm(range(runs)):
errors = []
current_values = np.copy(VALUES)
for i in range(0, episodes):
errors.append(np.sqrt(np.sum(np.power(TRUE_VALUE - current_values, 2)) / 5.0))
if method == 'TD':
temporal_difference(current_values, alpha=alpha)
else:
monte_carlo(current_values, alpha=alpha)
total_errors += np.asarray(errors)
total_errors /= runs
plt.plot(total_errors, linestyle=linestyle, label=method + ', $\\alpha$ = %.02f' % (alpha))
plt.xlabel('Walks/Episodes')
plt.ylabel('Empirical RMS error, averaged over states')
plt.legend()
# Figure 6.2
# @method: 'TD' or 'MC'
def batch_updating(method, episodes, alpha=0.001):
# perform 100 independent runs
runs = 100
total_errors = np.zeros(episodes)
for r in tqdm(range(0, runs)):
current_values = np.copy(VALUES)
current_values[1:6] = -1
errors = []
# track shown trajectories and reward/return sequences
trajectories = []
rewards = []
for ep in range(episodes):
if method == 'TD':
trajectory_, rewards_ = temporal_difference(current_values, batch=True)
else:
trajectory_, rewards_ = monte_carlo(current_values, batch=True)
trajectories.append(trajectory_)
rewards.append(rewards_)
while True:
# keep feeding our algorithm with trajectories seen so far until state value function converges
updates = np.zeros(7)
for trajectory_, rewards_ in zip(trajectories, rewards):
for i in range(0, len(trajectory_) - 1):
if method == 'TD':
updates[trajectory_[i]] += rewards_[i] + current_values[trajectory_[i + 1]] - current_values[trajectory_[i]]
else:
updates[trajectory_[i]] += rewards_[i] - current_values[trajectory_[i]]
updates *= alpha
if np.sum(np.abs(updates)) < 1e-3:
break
# perform batch updating
current_values += updates
# calculate rms error
errors.append(np.sqrt(np.sum(np.power(current_values - TRUE_VALUE, 2)) / 5.0))
total_errors += np.asarray(errors)
total_errors /= runs
return total_errors
def example_6_2():
plt.figure(figsize=(10, 20))
plt.subplot(2, 1, 1)
compute_state_value()
plt.subplot(2, 1, 2)
rms_error()
plt.tight_layout()
plt.savefig('../images/example_6_2.png')
plt.close()
def figure_6_2():
episodes = 100 + 1
td_errors = batch_updating('TD', episodes)
mc_errors = batch_updating('MC', episodes)
plt.plot(td_errors, label='TD')
plt.plot(mc_errors, label='MC')
plt.title("Batch Training")
plt.xlabel('Walks/Episodes')
plt.ylabel('RMS error, averaged over states')
plt.xlim(0, 100)
plt.ylim(0, 0.25)
plt.legend()
plt.savefig('../images/figure_6_2.png')
plt.close()
if __name__ == '__main__':
example_6_2()
figure_6_2()
================================================
FILE: chapter06/windy_grid_world.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
# world height
WORLD_HEIGHT = 7
# world width
WORLD_WIDTH = 10
# wind strength for each column
WIND = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0]
# possible actions
ACTION_UP = 0
ACTION_DOWN = 1
ACTION_LEFT = 2
ACTION_RIGHT = 3
# probability for exploration
EPSILON = 0.1
# Sarsa step size
ALPHA = 0.5
# reward for each step
REWARD = -1.0
START = [3, 0]
GOAL = [3, 7]
ACTIONS = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT]
def step(state, action):
i, j = state
if action == ACTION_UP:
return [max(i - 1 - WIND[j], 0), j]
elif action == ACTION_DOWN:
return [max(min(i + 1 - WIND[j], WORLD_HEIGHT - 1), 0), j]
elif action == ACTION_LEFT:
return [max(i - WIND[j], 0), max(j - 1, 0)]
elif action == ACTION_RIGHT:
return [max(i - WIND[j], 0), min(j + 1, WORLD_WIDTH - 1)]
else:
assert False
# play for an episode
def episode(q_value):
# track the total time steps in this episode
time = 0
# initialize state
state = START
# choose an action based on epsilon-greedy algorithm
if np.random.binomial(1, EPSILON) == 1:
action = np.random.choice(ACTIONS)
else:
values_ = q_value[state[0], state[1], :]
action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
# keep going until get to the goal state
while state != GOAL:
next_state = step(state, action)
if np.random.binomial(1, EPSILON) == 1:
next_action = np.random.choice(ACTIONS)
else:
values_ = q_value[next_state[0], next_state[1], :]
next_action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
# Sarsa update
q_value[state[0], state[1], action] += \
ALPHA * (REWARD + q_value[next_state[0], next_state[1], next_action] -
q_value[state[0], state[1], action])
state = next_state
action = next_action
time += 1
return time
def figure_6_3():
q_value = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4))
episode_limit = 500
steps = []
ep = 0
while ep < episode_limit:
steps.append(episode(q_value))
# time = episode(q_value)
# episodes.extend([ep] * time)
ep += 1
steps = np.add.accumulate(steps)
plt.plot(steps, np.arange(1, len(steps) + 1))
plt.xlabel('Time steps')
plt.ylabel('Episodes')
plt.savefig('../images/figure_6_3.png')
plt.close()
# display the optimal policy
optimal_policy = []
for i in range(0, WORLD_HEIGHT):
optimal_policy.append([])
for j in range(0, WORLD_WIDTH):
if [i, j] == GOAL:
optimal_policy[-1].append('G')
continue
bestAction = np.argmax(q_value[i, j, :])
if bestAction == ACTION_UP:
optimal_policy[-1].append('U')
elif bestAction == ACTION_DOWN:
optimal_policy[-1].append('D')
elif bestAction == ACTION_LEFT:
optimal_policy[-1].append('L')
elif bestAction == ACTION_RIGHT:
optimal_policy[-1].append('R')
print('Optimal policy is:')
for row in optimal_policy:
print(row)
print('Wind strength for each column:\n{}'.format([str(w) for w in WIND]))
if __name__ == '__main__':
figure_6_3()
================================================
FILE: chapter07/random_walk.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from tqdm import tqdm
# all states
N_STATES = 19
# discount
GAMMA = 1
# all states but terminal states
STATES = np.arange(1, N_STATES + 1)
# start from the middle state
START_STATE = 10
# two terminal states
# an action leading to the left terminal state has reward -1
# an action leading to the right terminal state has reward 1
END_STATES = [0, N_STATES + 1]
# true state value from bellman equation
TRUE_VALUE = np.arange(-20, 22, 2) / 20.0
TRUE_VALUE[0] = TRUE_VALUE[-1] = 0
# n-steps TD method
# @value: values for each state, will be updated
# @n: # of steps
# @alpha: # step size
def temporal_difference(value, n, alpha):
# initial starting state
state = START_STATE
# arrays to store states and rewards for an episode
# space isn't a major consideration, so I didn't use the mod trick
states = [state]
rewards = [0]
# track the time
time = 0
# the length of this episode
T = float('inf')
while True:
# go to next time step
time += 1
if time < T:
# choose an action randomly
if np.random.binomial(1, 0.5) == 1:
next_state = state + 1
else:
next_state = state - 1
if next_state == 0:
reward = -1
elif next_state == 20:
reward = 1
else:
reward = 0
# store new state and new reward
states.append(next_state)
rewards.append(reward)
if next_state in END_STATES:
T = time
# get the time of the state to update
update_time = time - n
if update_time >= 0:
returns = 0.0
# calculate corresponding rewards
for t in range(update_time + 1, min(T, update_time + n) + 1):
returns += pow(GAMMA, t - update_time - 1) * rewards[t]
# add state value to the return
if update_time + n <= T:
returns += pow(GAMMA, n) * value[states[(update_time + n)]]
state_to_update = states[update_time]
# update the state value
if not state_to_update in END_STATES:
value[state_to_update] += alpha * (returns - value[state_to_update])
if update_time == T - 1:
break
state = next_state
# Figure 7.2, it will take quite a while
def figure7_2():
# all possible steps
steps = np.power(2, np.arange(0, 10))
# all possible alphas
alphas = np.arange(0, 1.1, 0.1)
# each run has 10 episodes
episodes = 10
# perform 100 independent runs
runs = 100
# track the errors for each (step, alpha) combination
errors = np.zeros((len(steps), len(alphas)))
for run in tqdm(range(0, runs)):
for step_ind, step in enumerate(steps):
for alpha_ind, alpha in enumerate(alphas):
# print('run:', run, 'step:', step, 'alpha:', alpha)
value = np.zeros(N_STATES + 2)
for ep in range(0, episodes):
temporal_difference(value, step, alpha)
# calculate the RMS error
errors[step_ind, alpha_ind] += np.sqrt(np.sum(np.power(value - TRUE_VALUE, 2)) / N_STATES)
# take average
errors /= episodes * runs
for i in range(0, len(steps)):
plt.plot(alphas, errors[i, :], label='n = %d' % (steps[i]))
plt.xlabel('alpha')
plt.ylabel('RMS error')
plt.ylim([0.25, 0.55])
plt.legend()
plt.savefig('../images/figure_7_2.png')
plt.close()
if __name__ == '__main__':
figure7_2()
================================================
FILE: chapter08/expectation_vs_sample.py
================================================
#######################################################################
# Copyright (C) #
# 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from tqdm import tqdm
# for figure 8.7, run a simulation of 2 * @b steps
def b_steps(b):
# set the value of the next b states
# it is not clear how to set this
distribution = np.random.randn(b)
# true value of the current state
true_v = np.mean(distribution)
samples = []
errors = []
# sample 2b steps
for t in range(2 * b):
v = np.random.choice(distribution)
samples.append(v)
errors.append(np.abs(np.mean(samples) - true_v))
return errors
def figure_8_7():
runs = 100
branch = [2, 10, 100, 1000]
for b in branch:
errors = np.zeros((runs, 2 * b))
for r in tqdm(np.arange(runs)):
errors[r] = b_steps(b)
errors = errors.mean(axis=0)
x_axis = (np.arange(len(errors)) + 1) / float(b)
plt.plot(x_axis, errors, label='b = %d' % (b))
plt.xlabel('number of computations')
plt.xticks([0, 1.0, 2.0], ['0', 'b', '2b'])
plt.ylabel('RMS error')
plt.legend()
plt.savefig('../images/figure_8_7.png')
plt.close()
if __name__ == '__main__':
figure_8_7()
================================================
FILE: chapter08/maze.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from tqdm import tqdm
import heapq
from copy import deepcopy
class PriorityQueue:
def __init__(self):
self.pq = []
self.entry_finder = {}
self.REMOVED = '<removed-task>'
self.counter = 0
def add_item(self, item, priority=0):
if item in self.entry_finder:
self.remove_item(item)
entry = [priority, self.counter, item]
self.counter += 1
self.entry_finder[item] = entry
heapq.heappush(self.pq, entry)
def remove_item(self, item):
entry = self.entry_finder.pop(item)
entry[-1] = self.REMOVED
def pop_item(self):
while self.pq:
priority, count, item = heapq.heappop(self.pq)
if item is not self.REMOVED:
del self.entry_finder[item]
return item, priority
raise KeyError('pop from an empty priority queue')
def empty(self):
return not self.entry_finder
# A wrapper class for a maze, containing all the information about the maze.
# Basically it's initialized to DynaMaze by default, however it can be easily adapted
# to other maze
class Maze:
def __init__(self):
# maze width
self.WORLD_WIDTH = 9
# maze height
self.WORLD_HEIGHT = 6
# all possible actions
self.ACTION_UP = 0
self.ACTION_DOWN = 1
self.ACTION_LEFT = 2
self.ACTION_RIGHT = 3
self.actions = [self.ACTION_UP, self.ACTION_DOWN, self.ACTION_LEFT, self.ACTION_RIGHT]
# start state
self.START_STATE = [2, 0]
# goal state
self.GOAL_STATES = [[0, 8]]
# all obstacles
self.obstacles = [[1, 2], [2, 2], [3, 2], [0, 7], [1, 7], [2, 7], [4, 5]]
self.old_obstacles = None
self.new_obstacles = None
# time to change obstacles
self.obstacle_switch_time = None
# initial state action pair values
# self.stateActionValues = np.zeros((self.WORLD_HEIGHT, self.WORLD_WIDTH, len(self.actions)))
# the size of q value
self.q_size = (self.WORLD_HEIGHT, self.WORLD_WIDTH, len(self.actions))
# max steps
self.max_steps = float('inf')
# track the resolution for this maze
self.resolution = 1
# extend a state to a higher resolution maze
# @state: state in lower resolution maze
# @factor: extension factor, one state will become factor^2 states after extension
def extend_state(self, state, factor):
new_state = [state[0] * factor, state[1] * factor]
new_states = []
for i in range(0, factor):
for j in range(0, factor):
new_states.append([new_state[0] + i, new_state[1] + j])
return new_states
# extend a state into higher resolution
# one state in original maze will become @factor^2 states in @return new maze
def extend_maze(self, factor):
new_maze = Maze()
new_maze.WORLD_WIDTH = self.WORLD_WIDTH * factor
new_maze.WORLD_HEIGHT = self.WORLD_HEIGHT * factor
new_maze.START_STATE = [self.START_STATE[0] * factor, self.START_STATE[1] * factor]
new_maze.GOAL_STATES = self.extend_state(self.GOAL_STATES[0], factor)
new_maze.obstacles = []
for state in self.obstacles:
new_maze.obstacles.extend(self.extend_state(state, factor))
new_maze.q_size = (new_maze.WORLD_HEIGHT, new_maze.WORLD_WIDTH, len(new_maze.actions))
# new_maze.stateActionValues = np.zeros((new_maze.WORLD_HEIGHT, new_maze.WORLD_WIDTH, len(new_maze.actions)))
new_maze.resolution = factor
return new_maze
# take @action in @state
# @return: [new state, reward]
def step(self, state, action):
x, y = state
if action == self.ACTION_UP:
x = max(x - 1, 0)
elif action == self.ACTION_DOWN:
x = min(x + 1, self.WORLD_HEIGHT - 1)
elif action == self.ACTION_LEFT:
y = max(y - 1, 0)
elif action == self.ACTION_RIGHT:
y = min(y + 1, self.WORLD_WIDTH - 1)
if [x, y] in self.obstacles:
x, y = state
if [x, y] in self.GOAL_STATES:
reward = 1.0
else:
reward = 0.0
return [x, y], reward
# a wrapper class for parameters of dyna algorithms
class DynaParams:
def __init__(self):
# discount
self.gamma = 0.95
# probability for exploration
self.epsilon = 0.1
# step size
self.alpha = 0.1
# weight for elapsed time
self.time_weight = 0
# n-step planning
self.planning_steps = 5
# average over several independent runs
self.runs = 10
# algorithm names
self.methods = ['Dyna-Q', 'Dyna-Q+']
# threshold for priority queue
self.theta = 0
# choose an action based on epsilon-greedy algorithm
def choose_action(state, q_value, maze, dyna_params):
if np.random.binomial(1, dyna_params.epsilon) == 1:
return np.random.choice(maze.actions)
else:
values = q_value[state[0], state[1], :]
return np.random.choice([action for action, value in enumerate(values) if value == np.max(values)])
# Trivial model for planning in Dyna-Q
class TrivialModel:
# @rand: an instance of np.random.RandomState for sampling
def __init__(self, rand=np.random):
self.model = dict()
self.rand = rand
# feed the model with previous experience
def feed(self, state, action, next_state, reward):
state = deepcopy(state)
next_state = deepcopy(next_state)
if tuple(state) not in self.model.keys():
self.model[tuple(state)] = dict()
self.model[tuple(state)][action] = [list(next_state), reward]
# randomly sample from previous experience
def sample(self):
state_index = self.rand.choice(range(len(self.model.keys())))
state = list(self.model)[state_index]
action_index = self.rand.choice(range(len(self.model[state].keys())))
action = list(self.model[state])[action_index]
next_state, reward = self.model[state][action]
state = deepcopy(state)
next_state = deepcopy(next_state)
return list(state), action, list(next_state), reward
# Time-based model for planning in Dyna-Q+
class TimeModel:
# @maze: the maze instance. Indeed it's not very reasonable to give access to maze to the model.
# @timeWeight: also called kappa, the weight for elapsed time in sampling reward, it need to be small
# @rand: an instance of np.random.RandomState for sampling
def __init__(self, maze, time_weight=1e-4, rand=np.random):
self.rand = rand
self.model = dict()
# track the total time
self.time = 0
self.time_weight = time_weight
self.maze = maze
# feed the model with previous experience
def feed(self, state, action, next_state, reward):
state = deepcopy(state)
next_state = deepcopy(next_state)
self.time += 1
if tuple(state) not in self.model.keys():
self.model[tuple(state)] = dict()
# Actions that had never been tried before from a state were allowed to be considered in the planning step
for action_ in self.maze.actions:
if action_ != action:
# Such actions would lead back to the same state with a reward of zero
# Notice that the minimum time stamp is 1 instead of 0
self.model[tuple(state)][action_] = [list(state), 0, 1]
self.model[tuple(state)][action] = [list(next_state), reward, self.time]
# randomly sample from previous experience
def sample(self):
state_index = self.rand.choice(range(len(self.model.keys())))
state = list(self.model)[state_index]
action_index = self.rand.choice(range(len(self.model[state].keys())))
action = list(self.model[state])[action_index]
next_state, reward, time = self.model[state][action]
# adjust reward with elapsed time since last vist
reward += self.time_weight * np.sqrt(self.time - time)
state = deepcopy(state)
next_state = deepcopy(next_state)
return list(state), action, list(next_state), reward
# Model containing a priority queue for Prioritized Sweeping
class PriorityModel(TrivialModel):
def __init__(self, rand=np.random):
TrivialModel.__init__(self, rand)
# maintain a priority queue
self.priority_queue = PriorityQueue()
# track predecessors for every state
self.predecessors = dict()
# add a @state-@action pair into the priority queue with priority @priority
def insert(self, priority, state, action):
# note the priority queue is a minimum heap, so we use -priority
self.priority_queue.add_item((tuple(state), action), -priority)
# @return: whether the priority queue is empty
def empty(self):
return self.priority_queue.empty()
# get the first item in the priority queue
def sample(self):
(state, action), priority = self.priority_queue.pop_item()
next_state, reward = self.model[state][action]
state = deepcopy(state)
next_state = deepcopy(next_state)
return -priority, list(state), action, list(next_state), reward
# feed the model with previous experience
def feed(self, state, action, next_state, reward):
state = deepcopy(state)
next_state = deepcopy(next_state)
TrivialModel.feed(self, state, action, next_state, reward)
if tuple(next_state) not in self.predecessors.keys():
self.predecessors[tuple(next_state)] = set()
self.predecessors[tuple(next_state)].add((tuple(state), action))
# get all seen predecessors of a state @state
def predecessor(self, state):
if tuple(state) not in self.predecessors.keys():
return []
predecessors = []
for state_pre, action_pre in list(self.predecessors[tuple(state)]):
predecessors.append([list(state_pre), action_pre, self.model[state_pre][action_pre][1]])
return predecessors
# play for an episode for Dyna-Q algorithm
# @q_value: state action pair values, will be updated
# @model: model instance for planning
# @maze: a maze instance containing all information about the environment
# @dyna_params: several params for the algorithm
def dyna_q(q_value, model, maze, dyna_params):
state = maze.START_STATE
steps = 0
while state not in maze.GOAL_STATES:
# track the steps
steps += 1
# get action
action = choose_action(state, q_value, maze, dyna_params)
# take action
next_state, reward = maze.step(state, action)
# Q-Learning update
q_value[state[0], state[1], action] += \
dyna_params.alpha * (reward + dyna_params.gamma * np.max(q_value[next_state[0], next_state[1], :]) -
q_value[state[0], state[1], action])
# feed the model with experience
model.feed(state, action, next_state, reward)
# sample experience from the model
for t in range(0, dyna_params.planning_steps):
state_, action_, next_state_, reward_ = model.sample()
q_value[state_[0], state_[1], action_] += \
dyna_params.alpha * (reward_ + dyna_params.gamma * np.max(q_value[next_state_[0], next_state_[1], :]) -
q_value[state_[0], state_[1], action_])
state = next_state
# check whether it has exceeded the step limit
if steps > maze.max_steps:
break
return steps
# play for an episode for prioritized sweeping algorithm
# @q_value: state action pair values, will be updated
# @model: model instance for planning
# @maze: a maze instance containing all information about the environment
# @dyna_params: several params for the algorithm
# @return: # of backups during this episode
def prioritized_sweeping(q_value, model, maze, dyna_params):
state = maze.START_STATE
# track the steps in this episode
steps = 0
# track the backups in planning phase
backups = 0
while state not in maze.GOAL_STATES:
steps += 1
# get action
action = choose_action(state, q_value, maze, dyna_params)
# take action
next_state, reward = maze.step(state, action)
# feed the model with experience
model.feed(state, action, next_state, reward)
# get the priority for current state action pair
priority = np.abs(reward + dyna_params.gamma * np.max(q_value[next_state[0], next_state[1], :]) -
q_value[state[0], state[1], action])
if priority > dyna_params.theta:
model.insert(priority, state, action)
# start planning
planning_step = 0
# planning for several steps,
# although keep planning until the priority queue becomes empty will converge much faster
while planning_step < dyna_params.planning_steps and not model.empty():
# get a sample with highest priority from the model
priority, state_, action_, next_state_, reward_ = model.sample()
# update the state action value for the sample
delta = reward_ + dyna_params.gamma * np.max(q_value[next_state_[0], next_state_[1], :]) - \
q_value[state_[0], state_[1], action_]
q_value[state_[0], state_[1], action_] += dyna_params.alpha * delta
# deal with all the predecessors of the sample state
for state_pre, action_pre, reward_pre in model.predecessor(state_):
priority = np.abs(reward_pre + dyna_params.gamma * np.max(q_value[state_[0], state_[1], :]) -
q_value[state_pre[0], state_pre[1], action_pre])
if priority > dyna_params.theta:
model.insert(priority, state_pre, action_pre)
planning_step += 1
state = next_state
# update the # of backups
backups += planning_step + 1
return backups
# Figure 8.2, DynaMaze, use 10 runs instead of 30 runs
def figure_8_2():
# set up an instance for DynaMaze
dyna_maze = Maze()
dyna_params = DynaParams()
runs = 10
episodes = 50
planning_steps = [0, 5, 50]
steps = np.zeros((len(planning_steps), episodes))
for run in tqdm(range(runs)):
for i, planning_step in enumerate(planning_steps):
dyna_params.planning_steps = planning_step
q_value = np.zeros(dyna_maze.q_size)
# generate an instance of Dyna-Q model
model = TrivialModel()
for ep in range(episodes):
# print('run:', run, 'planning step:', planning_step, 'episode:', ep)
steps[i, ep] += dyna_q(q_value, model, dyna_maze, dyna_params)
# averaging over runs
steps /= runs
for i in range(len(planning_steps)):
plt.plot(steps[i, :], label='%d planning steps' % (planning_steps[i]))
plt.xlabel('episodes')
plt.ylabel('steps per episode')
plt.legend()
plt.savefig('../images/figure_8_2.png')
plt.close()
# wrapper function for changing maze
# @maze: a maze instance
# @dynaParams: several parameters for dyna algorithms
def changing_maze(maze, dyna_params):
# set up max steps
max_steps = maze.max_steps
# track the cumulative rewards
rewards = np.zeros((dyna_params.runs, 2, max_steps))
for run in tqdm(range(dyna_params.runs)):
# set up models
models = [TrivialModel(), TimeModel(maze, time_weight=dyna_params.time_weight)]
# initialize state action values
q_values = [np.zeros(maze.q_size), np.zeros(maze.q_size)]
for i in range(len(dyna_params.methods)):
# print('run:', run, dyna_params.methods[i])
# set old obstacles for the maze
maze.obstacles = maze.old_obstacles
steps = 0
last_steps = steps
while steps < max_steps:
# play for an episode
steps += dyna_q(q_values[i], models[i], maze, dyna_params)
# update cumulative rewards
rewards[run, i, last_steps: steps] = rewards[run, i, last_steps]
rewards[run, i, min(steps, max_steps - 1)] = rewards[run, i, last_steps] + 1
last_steps = steps
if steps > maze.obstacle_switch_time:
# change the obstacles
maze.obstacles = maze.new_obstacles
# averaging over runs
rewards = rewards.mean(axis=0)
return rewards
# Figure 8.4, BlockingMaze
def figure_8_4():
# set up a blocking maze instance
blocking_maze = Maze()
blocking_maze.START_STATE = [5, 3]
blocking_maze.GOAL_STATES = [[0, 8]]
blocking_maze.old_obstacles = [[3, i] for i in range(0, 8)]
# new obstalces will block the optimal path
blocking_maze.new_obstacles = [[3, i] for i in range(1, 9)]
# step limit
blocking_maze.max_steps = 3000
# obstacles will change after 1000 steps
# the exact step for changing will be different
# However given that 1000 steps is long enough for both algorithms to converge,
# the difference is guaranteed to be very small
blocking_maze.obstacle_switch_time = 1000
# set up parameters
dyna_params = DynaParams()
dyna_params.alpha = 1.0
dyna_params.planning_steps = 10
dyna_params.runs = 20
# kappa must be small, as the reward for getting the goal is only 1
dyna_params.time_weight = 1e-4
# play
rewards = changing_maze(blocking_maze, dyna_params)
for i in range(len(dyna_params.methods)):
plt.plot(rewards[i, :], label=dyna_params.methods[i])
plt.xlabel('time steps')
plt.ylabel('cumulative reward')
plt.legend()
plt.savefig('../images/figure_8_4.png')
plt.close()
# Figure 8.5, ShortcutMaze
def figure_8_5():
# set up a shortcut maze instance
shortcut_maze = Maze()
shortcut_maze.START_STATE = [5, 3]
shortcut_maze.GOAL_STATES = [[0, 8]]
shortcut_maze.old_obstacles = [[3, i] for i in range(1, 9)]
# new obstacles will have a shorter path
shortcut_maze.new_obstacles = [[3, i] for i in range(1, 8)]
# step limit
shortcut_maze.max_steps = 6000
# obstacles will change after 3000 steps
# the exact step for changing will be different
# However given that 3000 steps is long enough for both algorithms to converge,
# the difference is guaranteed to be very small
shortcut_maze.obstacle_switch_time = 3000
# set up parameters
dyna_params = DynaParams()
# 50-step planning
dyna_params.planning_steps = 50
dyna_params.runs = 5
dyna_params.time_weight = 1e-3
dyna_params.alpha = 1.0
# play
rewards = changing_maze(shortcut_maze, dyna_params)
for i in range(len(dyna_params.methods)):
plt.plot( rewards[i, :], label=dyna_params.methods[i])
plt.xlabel('time steps')
plt.ylabel('cumulative reward')
plt.legend()
plt.savefig('../images/figure_8_5.png')
plt.close()
# Check whether state-action values are already optimal
def check_path(q_values, maze):
# get the length of optimal path
# 14 is the length of optimal path of the original maze
# 1.2 means it's a relaxed optifmal path
max_steps = 14 * maze.resolution * 1.2
state = maze.START_STATE
steps = 0
while state not in maze.GOAL_STATES:
action = np.argmax(q_values[state[0], state[1], :])
state, _ = maze.step(state, action)
steps += 1
if steps > max_steps:
return False
return True
# Example 8.4, mazes with different resolution
def example_8_4():
# get the original 6 * 9 maze
original_maze = Maze()
# set up the parameters for each algorithm
params_dyna = DynaParams()
params_dyna.planning_steps = 5
params_dyna.alpha = 0.5
params_dyna.gamma = 0.95
params_prioritized = DynaParams()
params_prioritized.theta = 0.0001
params_prioritized.planning_steps = 5
params_prioritized.alpha = 0.5
params_prioritized.gamma = 0.95
params = [params_prioritized, params_dyna]
# set up models for planning
models = [PriorityModel, TrivialModel]
method_names = ['Prioritized Sweeping', 'Dyna-Q']
# due to limitation of my machine, I can only perform experiments for 5 mazes
# assuming the 1st maze has w * h states, then k-th maze has w * h * k * k states
num_of_mazes = 5
# build all the mazes
mazes = [original_maze.extend_maze(i) for i in range(1, num_of_mazes + 1)]
methods = [prioritized_sweeping, dyna_q]
# My machine cannot afford too many runs...
runs = 5
# track the # of backups
backups = np.zeros((runs, 2, num_of_mazes))
for run in range(0, runs):
for i in range(0, len(method_names)):
for mazeIndex, maze in zip(range(0, len(mazes)), mazes):
print('run %d, %s, maze size %d' % (run, method_names[i], maze.WORLD_HEIGHT * maze.WORLD_WIDTH))
# initialize the state action values
q_value = np.zeros(maze.q_size)
# track steps / backups for each episode
steps = []
# generate the model
model = models[i]()
# play for an episode
while True:
steps.append(methods[i](q_value, model, maze, params[i]))
# print best actions w.r.t. current state-action values
# printActions(currentStateActionValues, maze)
# check whether the (relaxed) optimal path is found
if check_path(q_value, maze):
break
# update the total steps / backups for this maze
backups[run, i, mazeIndex] = np.sum(steps)
backups = backups.mean(axis=0)
# Dyna-Q performs several backups per step
backups[1, :] *= params_dyna.planning_steps + 1
for i in range(0, len(method_names)):
plt.plot(np.arange(1, num_of_mazes + 1), backups[i, :], label=method_names[i])
plt.xlabel('maze resolution factor')
plt.ylabel('backups until optimal solution')
plt.yscale('log')
plt.legend()
plt.savefig('../images/example_8_4.png')
plt.close()
if __name__ == '__main__':
figure_8_2()
figure_8_4()
figure_8_5()
example_8_4()
================================================
FILE: chapter08/trajectory_sampling.py
================================================
#######################################################################
# Copyright (C) #
# 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from tqdm import tqdm
matplotlib.use('Agg')
# 2 actions
ACTIONS = [0, 1]
# each transition has a probability to terminate with 0
TERMINATION_PROB = 0.1
# maximum expected updates
MAX_STEPS = 20000
# epsilon greedy for behavior policy
EPSILON = 0.1
# break tie randomly
def argmax(value):
max_q = np.max(value)
return np.random.choice([a for a, q in enumerate(value) if q == max_q])
class Task:
# @n_states: number of non-terminal states
# @b: branch
# Each episode starts with state 0, and state n_states is a terminal state
def __init__(self, n_states, b):
self.n_states = n_states
self.b = b
# transition matrix, each state-action pair leads to b possible states
self.transition = np.random.randint(n_states, size=(n_states, len(ACTIONS), b))
# it is not clear how to set the reward, I use a unit normal distribution here
# reward is determined by (s, a, s')
self.reward = np.random.randn(n_states, len(ACTIONS), b)
def step(self, state, action):
if np.random.rand() < TERMINATION_PROB:
return self.n_states, 0
next_ = np.random.randint(self.b)
return self.transition[state, action, next_], self.reward[state, action, next_]
# Evaluate the value of the start state for the greedy policy
# derived from @q under the MDP @task
def evaluate_pi(q, task):
# use Monte Carlo method to estimate the state value
runs = 1000
returns = []
for r in range(runs):
rewards = 0
state = 0
while state < task.n_states:
action = argmax(q[state])
state, r = task.step(state, action)
rewards += r
returns.append(rewards)
return np.mean(returns)
# perform expected update from a uniform state-action distribution of the MDP @task
# evaluate the learned q value every @eval_interval steps
def uniform(task, eval_interval):
performance = []
q = np.zeros((task.n_states, 2))
for step in tqdm(range(MAX_STEPS)):
state = step // len(ACTIONS) % task.n_states
action = step % len(ACTIONS)
next_states = task.transition[state, action]
q[state, action] = (1 - TERMINATION_PROB) * np.mean(
task.reward[state, action] + np.max(q[next_states, :], axis=1))
if step % eval_interval == 0:
v_pi = evaluate_pi(q, task)
performance.append([step, v_pi])
return zip(*performance)
# perform expected update from an on-policy distribution of the MDP @task
# evaluate the learned q value every @eval_interval steps
def on_policy(task, eval_interval):
performance = []
q = np.zeros((task.n_states, 2))
state = 0
for step in tqdm(range(MAX_STEPS)):
if np.random.rand() < EPSILON:
action = np.random.choice(ACTIONS)
else:
action = argmax(q[state])
next_state, _ = task.step(state, action)
next_states = task.transition[state, action]
q[state, action] = (1 - TERMINATION_PROB) * np.mean(
task.reward[state, action] + np.max(q[next_states, :], axis=1))
if next_state == task.n_states:
next_state = 0
state = next_state
if step % eval_interval == 0:
v_pi = evaluate_pi(q, task)
performance.append([step, v_pi])
return zip(*performance)
def figure_8_8():
num_states = [1000, 10000]
branch = [1, 3, 10]
methods = [on_policy, uniform]
# average across 30 tasks
n_tasks = 30
# number of evaluation points
x_ticks = 100
plt.figure(figsize=(10, 20))
for i, n in enumerate(num_states):
plt.subplot(2, 1, i+1)
for b in branch:
tasks = [Task(n, b) for _ in range(n_tasks)]
for method in methods:
steps = None
value = []
for task in tasks:
steps, v = method(task, MAX_STEPS / x_ticks)
value.append(v)
value = np.mean(np.asarray(value), axis=0)
plt.plot(steps, value, label=f'b = {b}, {method.__name__}')
plt.title(f'{n} states')
plt.ylabel('value of start state')
plt.legend()
plt.subplot(2, 1, 2)
plt.xlabel('computation time, in expected updates')
plt.savefig('../images/figure_8_8.png')
plt.close()
if __name__ == '__main__':
figure_8_8()
================================================
FILE: chapter09/random_walk.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from tqdm import tqdm
# # of states except for terminal states
N_STATES = 1000
# all states
STATES = np.arange(1, N_STATES + 1)
# start from a central state
START_STATE = 500
# terminal states
END_STATES = [0, N_STATES + 1]
# possible actions
ACTION_LEFT = -1
ACTION_RIGHT = 1
ACTIONS = [ACTION_LEFT, ACTION_RIGHT]
# maximum stride for an action
STEP_RANGE = 100
def compute_true_value():
# true state value, just a promising guess
true_value = np.arange(-1001, 1003, 2) / 1001.0
# Dynamic programming to find the true state values, based on the promising guess above
# Assume all rewards are 0, given that we have already given value -1 and 1 to terminal states
while True:
old_value = np.copy(true_value)
for state in STATES:
true_value[state] = 0
for action in ACTIONS:
for step in range(1, STEP_RANGE + 1):
step *= action
next_state = state + step
next_state = max(min(next_state, N_STATES + 1), 0)
# asynchronous update for faster convergence
true_value[state] += 1.0 / (2 * STEP_RANGE) * true_value[next_state]
error = np.sum(np.abs(old_value - true_value))
if error < 1e-2:
break
# correct the state value for terminal states to 0
true_value[0] = true_value[-1] = 0
return true_value
# take an @action at @state, return new state and reward for this transition
def step(state, action):
step = np.random.randint(1, STEP_RANGE + 1)
step *= action
state += step
state = max(min(state, N_STATES + 1), 0)
if state == 0:
reward = -1
elif state == N_STATES + 1:
reward = 1
else:
reward = 0
return state, reward
# get an action, following random policy
def get_action():
if np.random.binomial(1, 0.5) == 1:
return 1
return -1
# a wrapper class for aggregation value function
class ValueFunction:
# @num_of_groups: # of aggregations
def __init__(self, num_of_groups):
self.num_of_groups = num_of_groups
self.group_size = N_STATES // num_of_groups
# thetas
self.params = np.zeros(num_of_groups)
# get the value of @state
def value(self, state):
if state in END_STATES:
return 0
group_index = (state - 1) // self.group_size
return self.params[group_index]
# update parameters
# @delta: step size * (target - old estimation)
# @state: state of current sample
def update(self, delta, state):
group_index = (state - 1) // self.group_size
self.params[group_index] += delta
# a wrapper class for tile coding value function
class TilingsValueFunction:
# @num_of_tilings: # of tilings
# @tileWidth: each tiling has several tiles, this parameter specifies the width of each tile
# @tilingOffset: specifies how tilings are put together
def __init__(self, numOfTilings, tileWidth, tilingOffset):
self.numOfTilings = numOfTilings
self.tileWidth = tileWidth
self.tilingOffset = tilingOffset
# To make sure that each sate is covered by same number of tiles,
# we need one more tile for each tiling
self.tilingSize = N_STATES // tileWidth + 1
# weight for each tile
self.params = np.zeros((self.numOfTilings, self.tilingSize))
# For performance, only track the starting position for each tiling
# As we have one more tile for each tiling, the starting position will be negative
self.tilings = np.arange(-tileWidth + 1, 0, tilingOffset)
# get the value of @state
def value(self, state):
stateValue = 0.0
# go through all the tilings
for tilingIndex in range(0, len(self.tilings)):
# find the active tile in current tiling
tileIndex = (state - self.tilings[tilingIndex]) // self.tileWidth
stateValue += self.params[tilingIndex, tileIndex]
return stateValue
# update parameters
# @delta: step size * (target - old estimation)
# @state: state of current sample
def update(self, delta, state):
# each state is covered by same number of tilings
# so the delta should be divided equally into each tiling (tile)
delta /= self.numOfTilings
# go through all the tilings
for tilingIndex in range(0, len(self.tilings)):
# find the active tile in current tiling
tileIndex = (state - self.tilings[tilingIndex]) // self.tileWidth
self.params[tilingIndex, tileIndex] += delta
# a wrapper class for polynomial / Fourier -based value function
POLYNOMIAL_BASES = 0
FOURIER_BASES = 1
class BasesValueFunction:
# @order: # of bases, each function also has one more constant parameter (called bias in machine learning)
# @type: polynomial bases or Fourier bases
def __init__(self, order, type):
self.order = order
self.weights = np.zeros(order + 1)
# set up bases function
self.bases = []
if type == POLYNOMIAL_BASES:
for i in range(0, order + 1):
self.bases.append(lambda s, i=i: pow(s, i))
elif type == FOURIER_BASES:
for i in range(0, order + 1):
self.bases.append(lambda s, i=i: np.cos(i * np.pi * s))
# get the value of @state
def value(self, state):
# map the state space into [0, 1]
state /= float(N_STATES)
# get the feature vector
feature = np.asarray([func(state) for func in self.bases])
return np.dot(self.weights, feature)
def update(self, delta, state):
# map the state space into [0, 1]
state /= float(N_STATES)
# get derivative value
derivative_value = np.asarray([func(state) for func in self.bases])
self.weights += delta * derivative_value
# gradient Monte Carlo algorithm
# @value_function: an instance of class ValueFunction
# @alpha: step size
# @distribution: array to store the distribution statistics
def gradient_monte_carlo(value_function, alpha, distribution=None):
state = START_STATE
trajectory = [state]
# We assume gamma = 1, so return is just the same as the latest reward
reward = 0.0
while state not in END_STATES:
action = get_action()
next_state, reward = step(state, action)
trajectory.append(next_state)
state = next_state
# Gradient update for each state in this trajectory
for state in trajectory[:-1]:
delta = alpha * (reward - value_function.value(state))
value_function.update(delta, state)
if distribution is not None:
distribution[state] += 1
# semi-gradient n-step TD algorithm
# @valueFunction: an instance of class ValueFunction
# @n: # of steps
# @alpha: step size
def semi_gradient_temporal_difference(value_function, n, alpha):
# initial starting state
state = START_STATE
# arrays to store states and rewards for an episode
# space isn't a major consideration, so I didn't use the mod trick
states = [state]
rewards = [0]
# track the time
time = 0
# the length of this episode
T = float('inf')
while True:
# go to next time step
time += 1
if time < T:
# choose an action randomly
action = get_action()
next_state, reward = step(state, action)
# store new state and new reward
states.append(next_state)
rewards.append(reward)
if next_state in END_STATES:
T = time
# get the time of the state to update
update_time = time - n
if update_time >= 0:
returns = 0.0
# calculate corresponding rewards
for t in range(update_time + 1, min(T, update_time + n) + 1):
returns += rewards[t]
# add state value to the return
if update_time + n <= T:
returns += value_function.value(states[update_time + n])
state_to_update = states[update_time]
# update the value function
if not state_to_update in END_STATES:
delta = alpha * (returns - value_function.value(state_to_update))
value_function.update(delta, state_to_update)
if update_time == T - 1:
break
state = next_state
# Figure 9.1, gradient Monte Carlo algorithm
def figure_9_1(true_value):
episodes = int(1e5)
alpha = 2e-5
# we have 10 aggregations in this example, each has 100 states
value_function = ValueFunction(10)
distribution = np.zeros(N_STATES + 2)
for ep in tqdm(range(episodes)):
gradient_monte_carlo(value_function, alpha, distribution)
distribution /= np.sum(distribution)
state_values = [value_function.value(i) for i in STATES]
plt.figure(figsize=(10, 20))
plt.subplot(2, 1, 1)
plt.plot(STATES, state_values, label='Approximate MC value')
plt.plot(STATES, true_value[1: -1], label='True value')
plt.xlabel('State')
plt.ylabel('Value')
plt.legend()
plt.subplot(2, 1, 2)
plt.plot(STATES, distribution[1: -1], label='State distribution')
plt.xlabel('State')
plt.ylabel('Distribution')
plt.legend()
plt.savefig('../images/figure_9_1.png')
plt.close()
# semi-gradient TD on 1000-state random walk
def figure_9_2_left(true_value):
episodes = int(1e5)
alpha = 2e-4
value_function = ValueFunction(10)
for ep in tqdm(range(episodes)):
semi_gradient_temporal_difference(value_function, 1, alpha)
stateValues = [value_function.value(i) for i in STATES]
plt.plot(STATES, stateValues, label='Approximate TD value')
plt.plot(STATES, true_value[1: -1], label='True value')
plt.xlabel('State')
plt.ylabel('Value')
plt.legend()
# different alphas and steps for semi-gradient TD
def figure_9_2_right(true_value):
# all possible steps
steps = np.power(2, np.arange(0, 10))
# all possible alphas
alphas = np.arange(0, 1.1, 0.1)
# each run has 10 episodes
episodes = 10
# perform 100 independent runs
runs = 100
# track the errors for each (step, alpha) combination
errors = np.zeros((len(steps), len(alphas)))
for run in tqdm(range(runs)):
for step_ind, step in zip(range(len(steps)), steps):
for alpha_ind, alpha in zip(range(len(alphas)), alphas):
# we have 20 aggregations in this example
value_function = ValueFunction(20)
for ep in range(0, episodes):
semi_gradient_temporal_difference(value_function, step, alpha)
# calculate the RMS error
state_value = np.asarray([value_function.value(i) for i in STATES])
errors[step_ind, alpha_ind] += np.sqrt(np.sum(np.power(state_value - true_value[1: -1], 2)) / N_STATES)
# take average
errors /= episodes * runs
# truncate the error
for i in range(len(steps)):
plt.plot(alphas, errors[i, :], label='n = ' + str(steps[i]))
plt.xlabel('alpha')
plt.ylabel('RMS error')
plt.ylim([0.25, 0.55])
plt.legend()
def figure_9_2(true_value):
plt.figure(figsize=(10, 20))
plt.subplot(2, 1, 1)
figure_9_2_left(true_value)
plt.subplot(2, 1, 2)
figure_9_2_right(true_value)
plt.savefig('../images/figure_9_2.png')
plt.close()
# Figure 9.5, Fourier basis and polynomials
def figure_9_5(true_value):
# my machine can only afford 1 run
runs = 1
episodes = 5000
# # of bases
orders = [5, 10, 20]
alphas = [1e-4, 5e-5]
labels = [['polynomial basis'] * 3, ['fourier basis'] * 3]
# track errors for each episode
errors = np.zeros((len(alphas), len(orders), episodes))
for run in range(runs):
for i in range(len(orders)):
value_functions = [BasesValueFunction(orders[i], POLYNOMIAL_BASES), BasesValueFunction(orders[i], FOURIER_BASES)]
for j in range(len(value_functions)):
for episode in tqdm(range(episodes)):
# gradient Monte Carlo algorithm
gradient_monte_carlo(value_functions[j], alphas[j])
# get state values under current value function
state_values = [value_functions[j].value(state) for state in STATES]
# get the root-mean-squared error
errors[j, i, episode] += np.sqrt(np.mean(np.power(true_value[1: -1] - state_values, 2)))
# average over independent runs
errors /= runs
for i in range(len(alphas)):
for j in range(len(orders)):
plt.plot(errors[i, j, :], label='%s order = %d' % (labels[i][j], orders[j]))
plt.xlabel('Episodes')
# The book plots RMSVE, which is RMSE weighted by a state distribution
plt.ylabel('RMSE')
plt.legend()
plt.savefig('../images/figure_9_5.png')
plt.close()
# Figure 9.10, it will take quite a while
def figure_9_10(true_value):
# My machine can only afford one run, thus the curve isn't so smooth
runs = 1
# number of episodes
episodes = 5000
num_of_tilings = 50
# each tile will cover 200 states
tile_width = 200
# how to put so many tilings
tiling_offset = 4
labels = ['tile coding (50 tilings)', 'state aggregation (one tiling)']
# track errors for each episode
errors = np.zeros((len(labels), episodes))
for run in range(runs):
# initialize value functions for multiple tilings and single tiling
value_functions = [TilingsValueFunction(num_of_tilings, tile_width, tiling_offset),
ValueFunction(N_STATES // tile_width)]
for i in range(len(value_functions)):
for episode in tqdm(range(episodes)):
# I use a changing alpha according to the episode instead of a small fixed alpha
# With a small fixed alpha, I don't think 5000 episodes is enough for so many
# parameters in multiple tilings.
# The asymptotic performance for single tiling stays unchanged under a changing alpha,
# however the asymptotic performance for multiple tilings improves significantly
alpha = 1.0 / (episode + 1)
# gradient Monte Carlo algorithm
gradient_monte_carlo(value_functions[i], alpha)
# get state values under current value function
state_values = [value_functions[i].value(state) for state in STATES]
# get the root-mean-squared error
errors[i][episode] += np.sqrt(np.mean(np.power(true_value[1: -1] - state_values, 2)))
# average over independent runs
errors /= runs
for i in range(0, len(labels)):
plt.plot(errors[i], label=labels[i])
plt.xlabel('Episodes')
# The book plots RMSVE, which is RMSE weighted by a state distribution
plt.ylabel('RMSE')
plt.legend()
plt.savefig('../images/figure_9_10.png')
plt.close()
if __name__ == '__main__':
true_value = compute_true_value()
figure_9_1(true_value)
figure_9_2(true_value)
figure_9_5(true_value)
figure_9_10(true_value)
================================================
FILE: chapter09/square_wave.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from tqdm import tqdm
# wrapper class for an interval
# readability is more important than efficiency, so I won't use many tricks
class Interval:
# [@left, @right)
def __init__(self, left, right):
self.left = left
self.right = right
# whether a point is in this interval
def contain(self, x):
return self.left <= x < self.right
# length of this interval
def size(self):
return self.right - self.left
# domain of the square wave, [0, 2)
DOMAIN = Interval(0.0, 2.0)
# square wave function
def square_wave(x):
if 0.5 < x < 1.5:
return 1
return 0
# get @n samples randomly from the square wave
def sample(n):
samples = []
for i in range(0, n):
x = np.random.uniform(DOMAIN.left, DOMAIN.right)
y = square_wave(x)
samples.append([x, y])
return samples
# wrapper class for value function
class ValueFunction:
# @domain: domain of this function, an instance of Interval
# @alpha: basic step size for one update
def __init__(self, feature_width, domain=DOMAIN, alpha=0.2, num_of_features=50):
self.feature_width = feature_width
self.num_of_featrues = num_of_features
self.features = []
self.alpha = alpha
self.domain = domain
# there are many ways to place those feature windows,
# following is just one possible way
step = (domain.size() - feature_width) / (num_of_features - 1)
left = domain.left
for i in range(0, num_of_features - 1):
self.features.append(Interval(left, left + feature_width))
left += step
self.features.append(Interval(left, domain.right))
# initialize weight for each feature
self.weights = np.zeros(num_of_features)
# for point @x, return the indices of corresponding feature windows
def get_active_features(self, x):
active_features = []
for i in range(0, len(self.features)):
if self.features[i].contain(x):
active_features.append(i)
return active_features
# estimate the value for point @x
def value(self, x):
active_features = self.get_active_features(x)
return np.sum(self.weights[active_features])
# update weights given sample of point @x
# @delta: y - x
def update(self, delta, x):
active_features = self.get_active_features(x)
delta *= self.alpha / len(active_features)
for index in active_features:
self.weights[index] += delta
# train @value_function with a set of samples @samples
def approximate(samples, value_function):
for x, y in samples:
delta = y - value_function.value(x)
value_function.update(delta, x)
# Figure 9.8
def figure_9_8():
num_of_samples = [10, 40, 160, 640, 2560, 10240]
feature_widths = [0.2, 0.4, 1.0]
plt.figure(figsize=(30, 20))
axis_x = np.arange(DOMAIN.left, DOMAIN.right, 0.02)
for index, num_of_sample in enumerate(num_of_samples):
print(num_of_sample, 'samples')
samples = sample(num_of_sample)
value_functions = [ValueFunction(feature_width) for feature_width in feature_widths]
plt.subplot(2, 3, index + 1)
plt.title('%d samples' % (num_of_sample))
for value_function in value_functions:
approximate(samples, value_function)
values = [value_function.value(x) for x in axis_x]
plt.plot(axis_x, values, label='feature width %.01f' % (value_function.feature_width))
plt.legend()
plt.savefig('../images/figure_9_8.png')
plt.close()
if __name__ == '__main__':
figure_9_8()
================================================
FILE: chapter10/access_control.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from tqdm import tqdm
from mpl_toolkits.mplot3d.axes3d import Axes3D
from math import floor
import seaborn as sns
#######################################################################
# Following are some utilities for tile coding from Rich.
# To make each file self-contained, I copied them from
# http://incompleteideas.net/tiles/tiles3.py-remove
# with some naming convention changes
#
# Tile coding starts
class IHT:
"Structure to handle collisions"
def __init__(self, size_val):
self.size = size_val
self.overfull_count = 0
self.dictionary = {}
def count(self):
return len(self.dictionary)
def full(self):
return len(self.dictionary) >= self.size
def get_index(self, obj, read_only=False):
d = self.dictionary
if obj in d:
return d[obj]
elif read_only:
return None
size = self.size
count = self.count()
if count >= size:
if self.overfull_count == 0: print('IHT full, starting to allow collisions')
self.overfull_count += 1
return hash(obj) % self.size
else:
d[obj] = count
return count
def hash_coords(coordinates, m, read_only=False):
if isinstance(m, IHT): return m.get_index(tuple(coordinates), read_only)
if isinstance(m, int): return hash(tuple(coordinates)) % m
if m is None: return coordinates
def tiles(iht_or_size, num_tilings, floats, ints=None, read_only=False):
"""returns num-tilings tile indices corresponding to the floats and ints"""
if ints is None:
ints = []
qfloats = [floor(f * num_tilings) for f in floats]
tiles = []
for tiling in range(num_tilings):
tilingX2 = tiling * 2
coords = [tiling]
b = tiling
for q in qfloats:
coords.append((q + b) // num_tilings)
b += tilingX2
coords.extend(ints)
tiles.append(hash_coords(coords, iht_or_size, read_only))
return tiles
# Tile coding ends
#######################################################################
# possible priorities
PRIORITIES = np.arange(0, 4)
# reward for each priority
REWARDS = np.power(2, np.arange(0, 4))
# possible actions
REJECT = 0
ACCEPT = 1
ACTIONS = [REJECT, ACCEPT]
# total number of servers
NUM_OF_SERVERS = 10
# at each time step, a busy server will be free w.p. 0.06
PROBABILITY_FREE = 0.06
# step size for learning state-action value
ALPHA = 0.01
# step size for learning average reward
BETA = 0.01
# probability for exploration
EPSILON = 0.1
# a wrapper class for differential semi-gradient Sarsa state-action function
class ValueFunction:
# In this example I use the tiling software instead of implementing standard tiling by myself
# One important thing is that tiling is only a map from (state, action) to a series of indices
# It doesn't matter whether the indices have meaning, only if this map satisfy some property
# View the following webpage for more information
# http://incompleteideas.net/sutton/tiles/tiles3.html
# @alpha: step size for learning state-action value
# @beta: step size for learning average reward
def __init__(self, num_of_tilings, alpha=ALPHA, beta=BETA):
self.num_of_tilings = num_of_tilings
self.max_size = 2048
self.hash_table = IHT(self.max_size)
self.weights = np.zeros(self.max_size)
# state features needs scaling to satisfy the tile software
self.server_scale = self.num_of_tilings / float(NUM_OF_SERVERS)
self.priority_scale = self.num_of_tilings / float(len(PRIORITIES) - 1)
self.average_reward = 0.0
# divide step size equally to each tiling
self.alpha = alpha / self.num_of_tilings
self.beta = beta
# get indices of active tiles for given state and action
def get_active_tiles(self, free_servers, priority, action):
active_tiles = tiles(self.hash_table, self.num_of_tilings,
[self.server_scale * free_servers, self.priority_scale * priority],
[action])
return active_tiles
# estimate the value of given state and action without subtracting average
def value(self, free_servers, priority, action):
active_tiles = self.get_active_tiles(free_servers, priority, action)
return np.sum(self.weights[active_tiles])
# estimate the value of given state without subtracting average
def state_value(self, free_servers, priority):
values = [self.value(free_servers, priority, action) for action in ACTIONS]
# if no free server, can't accept
if free_servers == 0:
return values[REJECT]
return np.max(values)
# learn with given sequence
def learn(self, free_servers, priority, action, new_free_servers, new_priority, new_action, reward):
active_tiles = self.get_active_tiles(free_servers, priority, action)
estimation = np.sum(self.weights[active_tiles])
delta = reward - self.average_reward + self.value(new_free_servers, new_priority, new_action) - estimation
# update average reward
self.average_reward += self.beta * delta
delta *= self.alpha
for active_tile in active_tiles:
self.weights[active_tile] += delta
# get action based on epsilon greedy policy and @valueFunction
def get_action(free_servers, priority, value_function):
# if no free server, can't accept
if free_servers == 0:
return REJECT
if np.random.binomial(1, EPSILON) == 1:
return np.random.choice(ACTIONS)
values = [value_function.value(free_servers, priority, action) for action in ACTIONS]
return np.random.choice([action_ for action_, value_ in enumerate(values) if value_ == np.max(values)])
# take an action
def take_action(free_servers, priority, action):
if free_servers > 0 and action == ACCEPT:
free_servers -= 1
reward = REWARDS[priority] * action
# some busy servers may become free
busy_servers = NUM_OF_SERVERS - free_servers
free_servers += np.random.binomial(busy_servers, PROBABILITY_FREE)
return free_servers, np.random.choice(PRIORITIES), reward
# differential semi-gradient Sarsa
# @valueFunction: state value function to learn
# @maxSteps: step limit in the continuing task
def differential_semi_gradient_sarsa(value_function, max_steps):
current_free_servers = NUM_OF_SERVERS
current_priority = np.random.choice(PRIORITIES)
current_action = get_action(current_free_servers, current_priority, value_function)
# track the hit for each number of free servers
freq = np.zeros(NUM_OF_SERVERS + 1)
for _ in tqdm(range(max_steps)):
freq[current_free_servers] += 1
new_free_servers, new_priority, reward = take_action(current_free_servers, current_priority, current_action)
new_action = get_action(new_free_servers, new_priority, value_function)
value_function.learn(current_free_servers, current_priority, current_action,
new_free_servers, new_priority, new_action, reward)
current_free_servers = new_free_servers
current_priority = new_priority
current_action = new_action
print('Frequency of number of free servers:')
print(freq / max_steps)
# Figure 10.5, Differential semi-gradient Sarsa on the access-control queuing task
def figure_10_5():
max_steps = int(1e6)
# use tile coding with 8 tilings
num_of_tilings = 8
value_function = ValueFunction(num_of_tilings)
differential_semi_gradient_sarsa(value_function, max_steps)
values = np.zeros((len(PRIORITIES), NUM_OF_SERVERS + 1))
for priority in PRIORITIES:
for free_servers in range(NUM_OF_SERVERS + 1):
values[priority, free_servers] = value_function.state_value(free_servers, priority)
fig = plt.figure(figsize=(10, 20))
plt.subplot(2, 1, 1)
for priority in PRIORITIES:
plt.plot(range(NUM_OF_SERVERS + 1), values[priority, :], label='priority %d' % (REWARDS[priority]))
plt.xlabel('Number of free servers')
plt.ylabel('Differential value of best action')
plt.legend()
ax = fig.add_subplot(2, 1, 2)
policy = np.zeros((len(PRIORITIES), NUM_OF_SERVERS + 1))
for priority in PRIORITIES:
for free_servers in range(NUM_OF_SERVERS + 1):
values = [value_function.value(free_servers, priority, action) for action in ACTIONS]
if free_servers == 0:
policy[priority, free_servers] = REJECT
else:
policy[priority, free_servers] = np.argmax(values)
fig = sns.heatmap(policy, cmap="YlGnBu", ax=ax, xticklabels=range(NUM_OF_SERVERS + 1), yticklabels=PRIORITIES)
fig.set_title('Policy (0 Reject, 1 Accept)')
fig.set_xlabel('Number of free servers')
fig.set_ylabel('Priority')
plt.savefig('../images/figure_10_5.png')
plt.close()
if __name__ == '__main__':
figure_10_5()
================================================
FILE: chapter10/mountain_car.py
================================================
#######################################################################
# Copyright (C) #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from tqdm import tqdm
from mpl_toolkits.mplot3d.axes3d import Axes3D
from math import floor
#######################################################################
# Following are some utilities for tile coding from Rich.
# To make each file self-contained, I copied them from
# http://incompleteideas.net/tiles/tiles3.py-remove
# with some naming convention changes
#
# Tile coding starts
class IHT:
"Structure to handle collisions"
def __init__(self, size_val):
self.size = size_val
self.overfull_count = 0
self.dictionary = {}
def count(self):
return len(self.dictionary)
def full(self):
return len(self.dictionary) >= self.size
def get_index(self, obj, read_only=False):
d = self.dictionary
if obj in d:
return d[obj]
elif read_only:
return None
size = self.size
count = self.count()
if count >= size:
if self.overfull_count == 0: print('IHT full, starting to allow collisions')
self.overfull_count += 1
return hash(obj) % self.size
else:
d[obj] = count
return count
def hash_coords(coordinates, m, read_only=False):
if isinstance(m, IHT): return m.get_index(tuple(coordinates), read_only)
if isinstance(m, int): return hash(tuple(coordinates)) % m
if m is None: return coordinates
def tiles(iht_or_size, num_tilings, floats, ints=None, read_only=False):
"""returns num-tilings tile indices corresponding to the floats and ints"""
if ints is None:
ints = []
qfloats = [floor(f * num_tilings) for f in floats]
tiles = []
for tiling in range(num_tilings):
tilingX2 = tiling * 2
coords = [tiling]
b = tiling
for q in qfloats:
coords.append((q + b) // num_tilings)
b += tilingX2
coords.extend(ints)
tiles.append(hash_coords(coords, iht_or_size, read_only))
return tiles
# Tile coding ends
#######################################################################
# all possible actions
ACTION_REVERSE = -1
ACTION_ZERO = 0
ACTION_FORWARD = 1
# order is important
ACTIONS = [ACTION_REVERSE, ACTION_ZERO, ACTION_FORWARD]
# bound for position and velocity
POSITION_MIN = -1.2
POSITION_MAX = 0.5
VELOCITY_MIN = -0.07
VELOCITY_MAX = 0.07
# use optimistic initial value, so it's ok to set epsilon to 0
EPSILON = 0
# take an @action at @position and @velocity
# @return: new position, new velocity, reward (always -1)
def step(position, velocity, action):
new_velocity = velocity + 0.001 * action - 0.0025 * np.cos(3 * position)
new_velocity = min(max(VELOCITY_MIN, new_velocity), VELOCITY_MAX)
new_position = position + new_velocity
new_position = min(max(POSITION_MIN, new_position), POSITION_MAX)
reward = -1.0
if new_position == POSITION_MIN:
new_velocity = 0.0
return new_position, new_velocity, reward
# wrapper class for state action value function
class ValueFunction:
# In this example I use the tiling software instead of implementing standard tiling by myself
# One important thing is that tiling is only a map from (state, action) to a series of indices
# It doesn't matter whether the indices have meaning, only if this map satisfy some property
# View the following webpage for more information
# http://incompleteideas.net/sutton/tiles/tiles3.html
# @max_size: the maximum # of indices
def __init__(self, step_size, num_of_tilings=8, max_size=2048):
self.max_size = max_size
self.num_of_tilings = num_of_tilings
# divide step size equally to each tiling
self.step_size = step_size / num_of_tilings
self.hash_table = IHT(max_size)
# weight for each tile
self.weights = np.zeros(max_size)
# position and velocity needs scaling to satisfy the tile software
self.position_scale = self.num_of_tilings / (POSITION_MAX - POSITION_MIN)
self.velocity_scale = self.num_of_tilings / (VELOCITY_MAX - VELOCITY_MIN)
# get indices of active tiles for given state and action
def get_active_tiles(self, position, velocity, action):
# I think positionScale * (position - position_min) would be a good normalization.
# However positionScale * position_min is a constant, so it's ok to ignore it.
active_tiles = tiles(self.hash_table, self.num_of_tilings,
[self.position_scale * position, self.velocity_scale * velocity],
[action])
return active_tiles
# estimate the value of given state and action
def value(self, position, velocity, action):
if position == POSITION_MAX:
return 0.0
active_tiles = self.get_active_tiles(position, velocity, action)
return np.sum(self.weights[active_tiles])
# learn with given state, action and target
def learn(self, position, velocity, action, target):
active_tiles = self.get_active_tiles(position, velocity, action)
estimation = np.sum(self.weights[active_tiles])
delta = self.step_size * (target - estimation)
for active_tile in active_tiles:
self.weights[active_tile] += delta
# get # of steps to reach the goal under current state value function
def cost_to_go(self, position, velocity):
costs = []
for action in ACTIONS:
costs.append(self.value(position, velocity, action))
return -np.max(costs)
# get action at @position and @velocity based on epsilon greedy policy and @valueFunction
def get_action(position, velocity, value_function):
if np.random.binomial(1, EPSILON) == 1:
return np.random.choice(ACTIONS)
values = []
for action in ACTIONS:
values.append(value_function.value(position, velocity, action))
return np.random.choice([action_ for action_, value_ in enumerate(values) if value_ == np.max(values)]) - 1
# semi-gradient n-step Sarsa
# @valueFunction: state value function to learn
# @n: # of steps
def semi_gradient_n_step_sarsa(value_function, n=1):
# start at a random position around the bottom of the valley
current_position = np.random.uniform(-0.6, -0.4)
# initial velocity is 0
current_velocity = 0.0
# get initial action
current_action = get_action(current_position, current_velocity, value_function)
# track previous position, velocity, action and reward
positions = [current_position]
velocities = [current_velocity]
actions = [current_action]
rewards = [0.0]
# track the time
time = 0
# the length of this episode
T = float('inf')
while True:
# go to next time step
time += 1
if time < T:
# take current action and go to the new state
new_position, new_velocity, reward = step(current_position, current_velocity, current_action)
# choose new action
new_action = get_action(new_position, new_velocity, value_function)
# track new state and action
positions.append(new_position)
velocities.append(new_velocity)
actions.append(new_action)
rewards.append(reward)
if new_position == POSITION_MAX:
T = time
# get the time of the state to update
update_time = time - n
if update_time >= 0:
returns = 0.0
# calculate corresponding rewards
for t in range(update_time + 1, min(T, update_time + n) + 1):
returns += rewards[t]
# add estimated state action value to the return
if update_time + n <= T:
returns += value_function.value(positions[update_time + n],
velocities[update_time + n],
actions[update_time + n])
# update the state value function
if positions[update_time] != POSITION_MAX:
value_function.learn(positions[update_time], velocities[update_time], actions[update_time], returns)
if update_time == T - 1:
break
current_position = new_position
current_velocity = new_velocity
current_action = new_action
return time
# print learned cost to go
def print_cost(value_function, episode, ax):
grid_size = 40
positions = np.linspace(POSITION_MIN, POSITION_MAX, grid_size)
# positionStep = (POSITION_MAX - POSITION_MIN) / grid_size
# positions = np.arange(POSITION_MIN, POSITION_MAX + positionStep, positionStep)
# velocityStep = (VELOCITY_MAX - VELOCITY_MIN) / grid_size
# velocities = np.arange(VELOCITY_MIN, VELOCITY_MAX + velocityStep, velocityStep)
velocities = np.linspace(VELOCITY_MIN, VELOCITY_MAX, grid_size)
axis_x = []
axis_y = []
axis_z = []
for position in positions:
for velocity in velocities:
axis_x.append(position)
axis_y.append(velocity)
axis_z.append(value_function.cost_to_go(position, velocity))
ax.scatter(axis_x, axis_y, axis_z)
ax.set_xlabel('Position')
ax.set_ylabel('Velocity')
ax.set_zlabel('Cost to go')
ax.set_title('Episode %d' % (episode + 1))
# Figure 10.1, cost to go in a single run
def figure_10_1():
episodes = 9000
plot_episodes = [0, 99, episodes - 1]
fig = plt.figure(figsize=(40, 10))
axes = [fig.add_subplot(1, len(plot_episodes), i+1, projection='3d') for i in range(len(plot_episodes))]
num_of_tilings = 8
alpha = 0.3
value_function = ValueFunction(alpha, num_of_tilings)
for ep in tqdm(range(episodes)):
semi_gradient_n_step_sarsa(value_function)
if ep in plot_episodes:
print_cost(value_function, ep, axes[plot_episodes.index(ep)])
plt.savefig('../images/figure_10_1.png')
plt.close()
# Figure 10.2, semi-gradient Sarsa with different alphas
def figure_10_2():
runs = 10
episodes = 500
num_of_tilings = 8
alphas = [0.1, 0.2, 0.5]
steps = np.zeros((len(alphas), episodes))
for run in range(runs):
value_functions = [ValueFunction(alpha, num_of_tilings) for alpha in alphas]
for index in range(len(value_functions)):
for episode in tqdm(range(episodes)):
step = semi_gradient_n_step_sarsa(value_functions[index])
steps[index, episode] += step
steps /= runs
for i in range(0, len(alphas)):
plt.plot(steps[i], label='alpha = '+str(alphas[i])+'/'+str(num_of_tilings))
plt.xlabel('Episode')
plt.ylabel('Steps per episode')
plt.yscale('log')
plt.legend()
plt.savefig('../images/figure_10_2.png')
plt.close()
# Figure 10.3, one-step semi-gradient Sarsa vs multi-step semi-gradient Sarsa
def figure_10_3():
runs = 10
episodes = 500
num_of_tilings = 8
alphas = [0.5, 0.3]
n_steps = [1, 8]
steps = np.zeros((len(alphas), episodes))
for run in range(runs):
value_functions = [ValueFunction(alpha, num_of_tilings) for alpha in alphas]
for index in range(len(value_functions)):
for episode in tqdm(range(episodes)):
step = semi_gradient_n_step_sarsa(value_functions[index], n_steps[index])
steps[index, episode] += step
steps /= runs
for i in range(0, len(alphas)):
plt.plot(steps[i], label='n = %.01f' % (n_steps[i]))
plt.xlabel('Episode')
plt.ylabel('Steps per episode')
plt.yscale('log')
plt.legend()
plt.savefig('../images/figure_10_3.png')
plt.close()
# Figure 10.4, effect of alpha and n on multi-step semi-gradient Sarsa
def figure_10_4():
alphas = np.arange(0.25, 1.75, 0.25)
n_steps = np.power(2, np.arange(0, 5))
episodes = 50
runs = 5
max_steps = 300
steps = np.zeros((len(n_steps), len(alphas)))
for run in range(runs):
for n_step_index, n_step in enumerate(n_steps):
for alpha_index, alpha in enumerate(alphas):
if (n_step == 8 and alpha > 1) or \
(n_step == 16 and alpha > 0.75):
# In these cases it won't converge, so ignore them
steps[n_step_index, alpha_index] += max_steps * episodes
continue
value_function = ValueFunction(alpha)
for episode in tqdm(range(episodes)):
step = semi_gradient_n_step_sarsa(value_function, n_step)
steps[n_step_index, alpha_index] += step
# average over independent runs and episodes
steps /= runs * episodes
for i in range(0, len(n_steps)):
plt.plot(alphas, steps[i, :], label='n = '+str(n_steps[i]))
plt.xlabel('alpha * number of tilings(8)')
plt.ylabel('Steps per episode')
plt.ylim([220, max_steps])
plt.legend()
plt.savefig('../images/figure_10_4.png')
plt.close()
if __name__ == '__main__':
figure_10_1()
figure_10_2()
figure_10_3()
figure_10_4()
================================================
FILE: chapter11/counterexample.py
================================================
#######################################################################
# Copyright (C) #
# 2016 - 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from tqdm import tqdm
from mpl_toolkits.mplot3d.axes3d import Axes3D
# all states: state 0-5 are upper states
STATES = np.arange(0, 7)
# state 6 is lower state
LOWER_STATE = 6
# discount factor
DISCOUNT = 0.99
# each state is represented by a vector of length 8
FEATURE_SIZE = 8
FEATURES = np.zeros((len(STATES), FEATURE_SIZE))
for i in range(LOWER_STATE):
FEATURES[i, i] = 2
FEATURES[i, 7] = 1
FEATURES[LOWER_STATE, 6] = 1
FEATURES[LOWER_STATE, 7] = 2
# all possible actions
DASHED = 0
SOLID = 1
ACTIONS = [DASHED, SOLID]
# reward is always zero
REWARD = 0
# take @action at @state, return the new state
def step(state, action):
if action == SOLID:
return LOWER_STATE
return np.random.choice(STATES[: LOWER_STATE])
# target policy
def target_policy(state):
return SOLID
# state distribution for the behavior policy
STATE_DISTRIBUTION = np.ones(len(STATES)) / 7
STATE_DISTRIBUTION_MAT = np.matrix(np.diag(STATE_DISTRIBUTION))
# projection matrix for minimize MSVE
PROJECTION_MAT = np.matrix(FEATURES) * \
np.linalg.pinv(np.matrix(FEATURES.T) * STATE_DISTRIBUTION_MAT * np.matrix(FEATURES)) * \
np.matrix(FEATURES.T) * \
STATE_DISTRIBUTION_MAT
# behavior policy
BEHAVIOR_SOLID_PROBABILITY = 1.0 / 7
def behavior_policy(state):
if np.random.binomial(1, BEHAVIOR_SOLID_PROBABILITY) == 1:
return SOLID
return DASHED
# Semi-gradient off-policy temporal difference
# @state: current state
# @theta: weight for each component of the feature vector
# @alpha: step size
# @return: next state
def semi_gradient_off_policy_TD(state, theta, alpha):
action = behavior_policy(state)
next_state = step(state, action)
# get the importance ratio
if action == DASHED:
rho = 0.0
else:
rho = 1.0 / BEHAVIOR_SOLID_PROBABILITY
delta = REWARD + DISCOUNT * np.dot(FEATURES[next_state, :], theta) - \
np.dot(FEATURES[state, :], theta)
delta *= rho * alpha
# derivatives happen to be the same matrix due to the linearity
theta += FEATURES[state, :] * delta
return next_state
# Semi-gradient DP
# @theta: weight for each component of the feature vector
# @alpha: step size
def semi_gradient_DP(theta, alpha):
delta = 0.0
# go through all the states
for state in STATES:
expected_return = 0.0
# compute bellman error for each state
for next_state in STATES:
if next_state == LOWER_STATE:
expected_return += REWARD + DISCOUNT * np.dot(theta, FEATURES[next_state, :])
bellmanError = expected_return - np.dot(theta, FEATURES[state, :])
# accumulate gradients
delta += bellmanError * FEATURES[state, :]
# derivatives happen to be the same matrix due to the linearity
theta += alpha / len(STATES) * delta
# temporal difference with gradient correction
# @state: current state
# @theta: weight of each component of the feature vector
# @weight: auxiliary trace for gradient correction
# @alpha: step size of @theta
# @beta: step size of @weight
def TDC(state, theta, weight, alpha, beta):
action = behavior_policy(state)
next_state = step(state, action)
# get the importance ratio
if action == DASHED:
rho = 0.0
else:
rho = 1.0 / BEHAVIOR_SOLID_PROBABILITY
delta = REWARD + DISCOUNT * np.dot(FEATURES[next_state, :], theta) - \
np.dot(FEATURES[state, :], theta)
theta += alpha * rho * (delta * FEATURES[state, :] - DISCOUNT * FEATURES[next_state, :] * np.dot(FEATURES[state, :], weight))
weight += beta * rho * (delta - np.dot(FEATURES[state, :], weight)) * FEATURES[state, :]
return next_state
# expected temporal difference with gradient correction
# @theta: weight of each component of the feature vector
# @weight: auxiliary trace for gradient correction
# @alpha: step size of @theta
# @beta: step size of @weight
def expected_TDC(theta, weight, alpha, beta):
for state in STATES:
# When computing expected update target, if next state is not lower state, importance ratio will be 0,
# so we can safely ignore this case and assume next state is always lower state
delta = REWARD + DISCOUNT * np.dot(FEATURES[LOWER_STATE, :], theta) - np.dot(FEATURES[state, :], theta)
rho = 1 / BEHAVIOR_SOLID_PROBABILITY
# Under behavior policy, state distribution is uniform, so the probability for each state is 1.0 / len(STATES)
expected_update_theta = 1.0 / len(STATES) * BEHAVIOR_SOLID_PROBABILITY * rho * (
delta * FEATURES[state, :] - DISCOUNT * FEATURES[LOWER_STATE, :] * np.dot(weight, FEATURES[state, :]))
theta += alpha * expected_update_theta
expected_update_weight = 1.0 / len(STATES) * BEHAVIOR_SOLID_PROBABILITY * rho * (
delta - np.dot(weight, FEATURES[state, :])) * FEATURES[state, :]
weight += beta * expected_update_weight
# if *accumulate* expected update and actually apply update here, then it's synchronous
# theta += alpha * expectedUpdateTheta
# weight += beta * expectedUpdateWeight
# interest is 1 for every state
INTEREST = 1
# expected update of ETD
# @theta: weight of each component of the feature vector
# @emphasis: current emphasis
# @alpha: step size of @theta
# @return: expected next emphasis
def expected_emphatic_TD(theta, emphasis, alpha):
# we perform synchronous update for both theta and emphasis
expected_update = 0
expected_next_emphasis = 0.0
# go through all the states
for state in STATES:
# compute rho(t-1)
if state == LOWER_STATE:
rho = 1.0 / BEHAVIOR_SOLID_PROBABILITY
else:
rho = 0
# update emphasis
next_emphasis = DISCOUNT * rho * emphasis + INTEREST
expected_next_emphasis += next_emphasis
# When computing expected update target, if next state is not lower state, importance ratio will be 0,
# so we can safely ignore this case and assume next state is always lower state
next_state = LOWER_STATE
delta = REWARD + DISCOUNT * np.dot(FEATURES[next_state, :], theta) - np.dot(FEATURES[state, :], theta)
expected_update += 1.0 / len(STATES) * BEHAVIOR_SOLID_PROBABILITY * next_emphasis * 1 / BEHAVIOR_SOLID_PROBABILITY * delta * FEATURES[state, :]
theta += alpha * expected_update
return expected_next_emphasis / len(STATES)
# compute RMSVE for a value function parameterized by @theta
# true value function is always 0 in this example
def compute_RMSVE(theta):
return np.sqrt(np.dot(np.power(np.dot(FEATURES, theta), 2), STATE_DISTRIBUTION))
# compute RMSPBE for a value function parameterized by @theta
# true value function is always 0 in this example
def compute_RMSPBE(theta):
bellman_error = np.zeros(len(STATES))
for state in STATES:
for next_state in STATES:
if next_state == LOWER_STATE:
bellman_error[state] += REWARD + DISCOUNT * np.dot(theta, FEATURES[next_state, :]) - np.dot(theta, FEATURES[state, :])
bellman_error = np.dot(np.asarray(PROJECTION_MAT), bellman_error)
return np.sqrt(np.dot(np.power(bellman_error, 2), STATE_DISTRIBUTION))
figureIndex = 0
# Figure 11.2(left), semi-gradient off-policy TD
def figure_11_2_left():
# Initialize the theta
theta = np.ones(FEATURE_SIZE)
theta[6] = 10
alpha = 0.01
steps = 1000
thetas = np.zeros((FEATURE_SIZE, steps))
state = np.random.choice(STATES)
for step in tqdm(range(steps)):
state = semi_gradient_off_policy_TD(state, theta, alpha)
thetas[:, step] = theta
for i in range(FEATURE_SIZE):
plt.plot(thetas[i, :], label='theta' + str(i + 1))
plt.xlabel('Steps')
plt.ylabel('Theta value')
plt.title('semi-gradient off-policy TD')
plt.legend()
# Figure 11.2(right), semi-gradient DP
def figure_11_2_right():
# Initialize the theta
theta = np.ones(FEATURE_SIZE)
theta[6] = 10
alpha = 0.01
sweeps = 1000
thetas = np.zeros((FEATURE_SIZE, sweeps))
for sweep in tqdm(range(sweeps)):
semi_gradient_DP(theta, alpha)
thetas[:, sweep] = theta
for i in range(FEATURE_SIZE):
plt.plot(thetas[i, :], label='theta' + str(i + 1))
plt.xlabel('Sweeps')
plt.ylabel('Theta value')
plt.title('semi-gradient DP')
plt.legend()
def figure_11_2():
plt.figure(figsize=(10, 20))
plt.subplot(2, 1, 1)
figure_11_2_left()
plt.subplot(2, 1, 2)
figure_11_2_right()
plt.savefig('../images/figure_11_2.png')
plt.close()
# Figure 11.6(left), temporal difference with gradient correction
def figure_11_6_left():
# Initialize the theta
theta = np.ones(FEATURE_SIZE)
theta[6] = 10
weight = np.zeros(FEATURE_SIZE)
alpha = 0.005
beta = 0.05
steps = 1000
thetas = np.zeros((FEATURE_SIZE, steps))
RMSVE = np.zeros(steps)
RMSPBE = np.zeros(steps)
state = np.random.choice(STATES)
for step in tqdm(range(steps)):
state = TDC(state, theta, weight, alpha, beta)
thetas[:, step] = theta
RMSVE[step] = compute_RMSVE(theta)
RMSPBE[step] = compute_RMSPBE(theta)
for i in range(FEATURE_SIZE):
plt.plot(thetas[i, :], label='theta' + str(i + 1))
plt.plot(RMSVE, label='RMSVE')
plt.plot(RMSPBE, label='RMSPBE')
plt.xlabel('Steps')
plt.title('TDC')
plt.legend()
# Figure 11.6(right), expected temporal difference with gradient correction
def figure_11_6_right():
# Initialize the theta
theta = np.ones(FEATURE_SIZE)
theta[6] = 10
weight = np.zeros(FEATURE_SIZE)
alpha = 0.005
beta = 0.05
sweeps = 1000
thetas = np.zeros((FEATURE_SIZE, sweeps))
RMSVE = np.zeros(sweeps)
RMSPBE = np.zeros(sweeps)
for sweep in tqdm(range(sweeps)):
expected_TDC(theta, weight, alpha, beta)
thetas[:, sweep] = theta
RMSVE[sweep] = compute_RMSVE(theta)
RMSPBE[sweep] = compute_RMSPBE(theta)
for i in range(FEATURE_SIZE):
plt.plot(thetas[i, :], label='theta' + str(i + 1))
plt.plot(RMSVE, label='RMSVE')
plt.plot(RMSPBE, label='RMSPBE')
plt.xlabel('Sweeps')
plt.title('Expected TDC')
plt.legend()
def figure_11_6():
plt.figure(figsize=(10, 20))
plt.subplot(2, 1, 1)
figure_11_6_left()
plt.subplot(2, 1, 2)
figure_11_6_right()
plt.savefig('../images/figure_11_6.png')
plt.close()
# Figure 11.7, expected ETD
def figure_11_7():
# Initialize the theta
theta = np.ones(FEATURE_SIZE)
theta[6] = 10
alpha = 0.03
sweeps = 1000
thetas = np.zeros((FEATURE_SIZE, sweeps))
RMSVE = np.zeros(sweeps)
emphasis = 0.0
for sweep in tqdm(range(sweeps)):
emphasis = expected_emphatic_TD(theta, emphasis, alpha)
thetas[:, sweep] = theta
RMSVE[sweep] = compute_RMSVE(theta)
for i in range(FEATURE_SIZE):
plt.plot(thetas[i, :], label='theta' + str(i + 1))
plt.plot(RMSVE, label='RMSVE')
plt.xlabel('Sweeps')
plt.title('emphatic TD')
plt.legend()
plt.savefig('../images/figure_11_7.png')
plt.close()
if __name__ == '__main__':
figure_11_2()
figure_11_6()
figure_11_7()
================================================
FILE: chapter12/lambda_effect.py
================================================
#######################################################################
# Copyright (C) #
# 2021 Johann Huber (huber.joh@hotmail.fr) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
"""
Description:
This script is meant to reproduce Figure 12.14 of Sutton and Barto's book. This example shows
the effect of λ on 4 reinforcement learning tasks.
Credits:
The "Cart and Pole" environment's code has been taken from openai gym source code.
Link : https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py#L7
The tile coding software has been taken from Sutton's website.
Link : http://www.incompleteideas.net/tiles/tiles3.html
Remark:
- The optimum step-size parameters search have been omitted to avoid an even longer code. This
problem has already been met several times in the chapter.
Structure:
1. Utils
1.1. Tiling utils
1.2. Eligibility traces utils
2. Random walk
3. Mountain Car
4. Cart and Pole
5. Results
5.1. Getting plot data
5.2. Reproducing figure 12.14
5.3. Main
""";
import math
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()
#############################################################################################
# 1. Utils #
#############################################################################################
#-------------------#
# 1.1. Tiling utils #
#-------------------#
# Credit : http://www.incompleteideas.net/tiles/tiles3.html
basehash = hash
class IHT:
"""Structure to handle collisions."""
def __init__(self, sizeval):
self.size = sizeval
self.overfullCount = 0
self.dictionary = {}
def __str__(self):
"""Prepares a string for printing whenever this object is printed."""
return "Collision table:" + \
" size:" + str(self.size) + \
" overfullCount:" + str(self.overfullCount) + \
" dictionary:" + str(len(self.dictionary)) + " items"
def count(self):
return len(self.dictionary)
def fullp(self):
return len(self.dictionary) >= self.size
def getindex(self, obj, readonly=False):
d = self.dictionary
if obj in d:
return d[obj]
elif readonly:
return None
size = self.size
count = self.count()
if count >= size:
if self.overfullCount == 0: print('IHT full, starting to allow collisions')
assert self.overfullCount != 0
self.overfullCount += 1
return basehash(obj) % self.size
else:
d[obj] = count
return count
def hashcoords(coordinates, m, readonly=False):
if type(m) == IHT: return m.getindex(tuple(coordinates), readonly)
if type(m) == int: return basehash(tuple(coordinates)) % m
if m == None: return coordinates
from math import floor, log
from itertools import zip_longest
def tiles(ihtORsize, numtilings, floats, ints=[], readonly=False):
"""Returns num-tilings tile indices corresponding to the floats and ints"""
qfloats = [floor(f * numtilings) for f in floats]
Tiles = []
for tiling in range(numtilings):
tilingX2 = tiling * 2
coords = [tiling]
b = tiling
for q in qfloats:
coords.append((q + b) // numtilings)
b += tilingX2
coords.extend(ints)
Tiles.append(hashcoords(coords, ihtORsize, readonly))
return Tiles
def tileswrap(ihtORsize, numtilings, floats, wrapwidths, ints=[], readonly=False):
"""Returns num-tilings tile indices corresponding to the floats and ints, wrapping some floats"""
qfloats = [floor(f * numtilings) for f in floats]
Tiles = []
for tiling in range(numtilings):
tilingX2 = tiling * 2
coords = [tiling]
b = tiling
for q, width in zip_longest(qfloats, wrapwidths):
c = (q + b % numtilings) // numtilings
coords.append(c % width if width else c)
b += tilingX2
coords.extend(ints)
Tiles.append(hashcoords(coords, ihtORsize, readonly))
return Tiles
class IndexHashTable:
def __init__(self, iht_size, num_tilings, tiling_size, obs_bounds):
# Index Hash Table size
self._iht = IHT(iht_size)
# Number of tilings
self._num_tilings = num_tilings
# Tiling size
self._tiling_size = tiling_size
# Observation boundaries
# (format : [[min_1, max_1], ..., [min_i, max_i], ... ] for i in state's components)
self._obs_bounds = obs_bounds
def get_tiles(self, state, action):
"""Get the encoded state_action using Sutton's grid tiling software."""
# List of floats numbers to be tiled
floats = [s * self._tiling_size/(obs_max - obs_min)
for (s, (obs_min, obs_max)) in zip(state, self._obs_bounds)]
return tiles(self._iht, self._num_tilings, floats, [action])
#-------------------------------#
# 1.2. Eligibility traces utils #
#-------------------------------#
def update_trace_vector(agent, method, state, action=None):
"""Updates agent's trace vector (z) with then current state (or state-action pair) using to the given method.
Returns the updated vector."""
assert method in ['replace', 'replace_reset', 'accumulating'], 'Invalid trace update method.'
# Trace step
z = agent._γ * agent._λ * agent._z
# Update last observations components
if action is not None:
x_ids = agent.get_active_features(state, action) # x(s,a)
else:
x_ids = agent.get_active_features(state) # x(s)
if method == 'replace_reset':
for a in agent._all_actions:
if a != action:
x_ids2clear = agent.get_active_features(state, a) # always x(s,a)
for id_w in x_ids2clear:
z[id_w] = 0
for id_w in x_ids:
if (method == 'replace') or (method == 'replace_reset'):
z[id_w] = 1
elif method == 'accumulating':
z[id_w] += 1
return z
#############################################################################################
# 2. Random walk #
################################################
gitextract_m4ci91yn/ ├── .gitignore ├── .travis.yml ├── LICENSE ├── README.md ├── chapter01/ │ └── tic_tac_toe.py ├── chapter02/ │ └── ten_armed_testbed.py ├── chapter03/ │ └── grid_world.py ├── chapter04/ │ ├── car_rental.py │ ├── car_rental_synchronous.py │ ├── gamblers_problem.py │ └── grid_world.py ├── chapter05/ │ ├── blackjack.py │ └── infinite_variance.py ├── chapter06/ │ ├── cliff_walking.py │ ├── maximization_bias.py │ ├── random_walk.py │ └── windy_grid_world.py ├── chapter07/ │ └── random_walk.py ├── chapter08/ │ ├── expectation_vs_sample.py │ ├── maze.py │ └── trajectory_sampling.py ├── chapter09/ │ ├── random_walk.py │ └── square_wave.py ├── chapter10/ │ ├── access_control.py │ └── mountain_car.py ├── chapter11/ │ └── counterexample.py ├── chapter12/ │ ├── lambda_effect.py │ ├── mountain_car.py │ └── random_walk.py ├── chapter13/ │ └── short_corridor.py └── requirements.txt
SYMBOL INDEX (404 symbols across 26 files)
FILE: chapter01/tic_tac_toe.py
class State (line 19) | class State:
method __init__ (line 20) | def __init__(self):
method hash (line 31) | def hash(self):
method is_end (line 39) | def is_end(self):
method next_state (line 82) | def next_state(self, i, j, symbol):
method print_state (line 89) | def print_state(self):
function get_all_states_impl (line 105) | def get_all_states_impl(current_state, current_symbol, all_states):
function get_all_states (line 118) | def get_all_states():
class Judger (line 131) | class Judger:
method __init__ (line 134) | def __init__(self, player1, player2):
method reset (line 144) | def reset(self):
method alternate (line 148) | def alternate(self):
method play (line 154) | def play(self, print_state=False):
class Player (line 176) | class Player:
method __init__ (line 179) | def __init__(self, step_size=0.1, epsilon=0.1):
method reset (line 187) | def reset(self):
method set_state (line 191) | def set_state(self, state):
method set_symbol (line 195) | def set_symbol(self, symbol):
method backup (line 211) | def backup(self):
method act (line 222) | def act(self):
method save_policy (line 249) | def save_policy(self):
method load_policy (line 253) | def load_policy(self):
class HumanPlayer (line 263) | class HumanPlayer:
method __init__ (line 264) | def __init__(self, **kwargs):
method reset (line 269) | def reset(self):
method set_state (line 272) | def set_state(self, state):
method set_symbol (line 275) | def set_symbol(self, symbol):
method act (line 278) | def act(self):
function train (line 287) | def train(epochs, print_every_n=500):
function compete (line 308) | def compete(turns):
function play (line 328) | def play():
FILE: chapter02/ten_armed_testbed.py
class Bandit (line 19) | class Bandit:
method __init__ (line 28) | def __init__(self, k_arm=10, epsilon=0., initial=0., step_size=0.1, sa...
method reset (line 43) | def reset(self):
method act (line 58) | def act(self):
method step (line 77) | def step(self, action):
function simulate (line 101) | def simulate(runs, time, bandits):
function figure_2_1 (line 118) | def figure_2_1():
function figure_2_2 (line 126) | def figure_2_2(runs=2000, time=1000):
function figure_2_3 (line 151) | def figure_2_3(runs=2000, time=1000):
function figure_2_4 (line 167) | def figure_2_4(runs=2000, time=1000):
function figure_2_5 (line 183) | def figure_2_5(runs=2000, time=1000):
function figure_2_6 (line 205) | def figure_2_6(runs=2000, time=1000):
FILE: chapter03/grid_world.py
function step (line 34) | def step(state, action):
function draw_image (line 50) | def draw_image(image):
function draw_policy (line 84) | def draw_policy(optimal_values):
function figure_3_2 (line 127) | def figure_3_2():
function figure_3_2_linear_system (line 145) | def figure_3_2_linear_system():
function figure_3_5 (line 168) | def figure_3_5():
FILE: chapter04/car_rental.py
function poisson_probability (line 56) | def poisson_probability(n, lam):
function expected_return (line 64) | def expected_return(state, action, state_value, constant_returned_cars):
function figure_4_2 (line 124) | def figure_4_2(constant_returned_cars=True):
FILE: chapter04/car_rental_synchronous.py
function poisson (line 43) | def poisson(n, lam):
class PolicyIteration (line 51) | class PolicyIteration:
method __init__ (line 52) | def __init__(self, truncate, parallel_processes, delta=1e-2, gamma=0.9...
method solve (line 63) | def solve(self):
method policy_evaluation (line 83) | def policy_evaluation(self, values, policy):
method policy_improvement (line 107) | def policy_improvement(self, actions, values, policy):
method bellman (line 129) | def bellman(self, values, action, state):
method expected_return_pe (line 179) | def expected_return_pe(self, policy, values, state):
method expected_return_pi (line 186) | def expected_return_pi(self, values, action, state):
method plot (line 193) | def plot(self):
FILE: chapter04/gamblers_problem.py
function figure_4_3 (line 25) | def figure_4_3():
FILE: chapter04/grid_world.py
function is_terminal (line 25) | def is_terminal(state):
function step (line 30) | def step(state, action):
function draw_image (line 44) | def draw_image(image):
function compute_state_value (line 66) | def compute_state_value(in_place=True, discount=1.0):
function figure_4_1 (line 93) | def figure_4_1():
FILE: chapter05/blackjack.py
function target_policy_player (line 30) | def target_policy_player(usable_ace_player, player_sum, dealer_card):
function behavior_policy_player (line 34) | def behavior_policy_player(usable_ace_player, player_sum, dealer_card):
function get_card (line 47) | def get_card():
function card_value (line 53) | def card_value(card_id):
function play (line 60) | def play(policy_player, initial_state=None, initial_action=None):
function monte_carlo_on_policy (line 181) | def monte_carlo_on_policy(episodes):
function monte_carlo_es (line 202) | def monte_carlo_es(episodes):
function monte_carlo_off_policy (line 243) | def monte_carlo_off_policy(episodes):
function figure_5_1 (line 279) | def figure_5_1():
function figure_5_2 (line 307) | def figure_5_2():
function figure_5_3 (line 341) | def figure_5_3():
FILE: chapter05/infinite_variance.py
function behavior_policy (line 18) | def behavior_policy():
function target_policy (line 22) | def target_policy():
function play (line 26) | def play():
function figure_5_4 (line 37) | def figure_5_4():
FILE: chapter06/cliff_walking.py
function step (line 41) | def step(state, action):
function choose_action (line 85) | def choose_action(state, q_value):
function sarsa (line 97) | def sarsa(q_value, expected=False, step_size=ALPHA):
function q_learning (line 128) | def q_learning(q_value, step_size=ALPHA):
function print_optimal_policy (line 143) | def print_optimal_policy(q_value):
function figure_6_4 (line 167) | def figure_6_4():
function figure_6_6 (line 210) | def figure_6_6():
FILE: chapter06/maximization_bias.py
function choose_action (line 54) | def choose_action(state, q_value):
function take_action (line 62) | def take_action(state, action):
function q_learning (line 69) | def q_learning(q1, q2=None):
function figure_6_7 (line 103) | def figure_6_7():
FILE: chapter06/random_walk.py
function temporal_difference (line 36) | def temporal_difference(values, alpha=0.1, batch=False):
function monte_carlo (line 60) | def monte_carlo(values, alpha=0.1, batch=False):
function compute_state_value (line 86) | def compute_state_value():
function rms_error (line 100) | def rms_error():
function batch_updating (line 132) | def batch_updating(method, episodes, alpha=0.001):
function example_6_2 (line 170) | def example_6_2():
function figure_6_2 (line 182) | def figure_6_2():
FILE: chapter06/windy_grid_world.py
function step (line 42) | def step(state, action):
function episode (line 56) | def episode(q_value):
function figure_6_3 (line 88) | def figure_6_3():
FILE: chapter07/random_walk.py
function temporal_difference (line 40) | def temporal_difference(value, n, alpha):
function figure7_2 (line 98) | def figure7_2():
FILE: chapter08/expectation_vs_sample.py
function b_steps (line 15) | def b_steps(b):
function figure_8_7 (line 34) | def figure_8_7():
FILE: chapter08/maze.py
class PriorityQueue (line 17) | class PriorityQueue:
method __init__ (line 18) | def __init__(self):
method add_item (line 24) | def add_item(self, item, priority=0):
method remove_item (line 32) | def remove_item(self, item):
method pop_item (line 36) | def pop_item(self):
method empty (line 44) | def empty(self):
class Maze (line 50) | class Maze:
method __init__ (line 51) | def __init__(self):
method extend_state (line 94) | def extend_state(self, state, factor):
method extend_maze (line 104) | def extend_maze(self, factor):
method step (line 120) | def step(self, state, action):
class DynaParams (line 139) | class DynaParams:
method __init__ (line 140) | def __init__(self):
function choose_action (line 167) | def choose_action(state, q_value, maze, dyna_params):
class TrivialModel (line 175) | class TrivialModel:
method __init__ (line 177) | def __init__(self, rand=np.random):
method feed (line 182) | def feed(self, state, action, next_state, reward):
method sample (line 190) | def sample(self):
class TimeModel (line 201) | class TimeModel:
method __init__ (line 205) | def __init__(self, maze, time_weight=1e-4, rand=np.random):
method feed (line 216) | def feed(self, state, action, next_state, reward):
method sample (line 233) | def sample(self):
class PriorityModel (line 249) | class PriorityModel(TrivialModel):
method __init__ (line 250) | def __init__(self, rand=np.random):
method insert (line 258) | def insert(self, priority, state, action):
method empty (line 263) | def empty(self):
method sample (line 267) | def sample(self):
method feed (line 275) | def feed(self, state, action, next_state, reward):
method predecessor (line 284) | def predecessor(self, state):
function dyna_q (line 298) | def dyna_q(q_value, model, maze, dyna_params):
function prioritized_sweeping (line 340) | def prioritized_sweeping(q_value, model, maze, dyna_params):
function figure_8_2 (line 398) | def figure_8_2():
function changing_maze (line 434) | def changing_maze(maze, dyna_params):
function figure_8_4 (line 476) | def figure_8_4():
function figure_8_5 (line 517) | def figure_8_5():
function check_path (line 558) | def check_path(q_values, maze):
function example_8_4 (line 574) | def example_8_4():
FILE: chapter08/trajectory_sampling.py
function argmax (line 29) | def argmax(value):
class Task (line 34) | class Task:
method __init__ (line 38) | def __init__(self, n_states, b):
method step (line 49) | def step(self, state, action):
function evaluate_pi (line 58) | def evaluate_pi(q, task):
function uniform (line 75) | def uniform(task, eval_interval):
function on_policy (line 95) | def on_policy(task, eval_interval):
function figure_8_8 (line 122) | def figure_8_8():
FILE: chapter09/random_walk.py
function compute_true_value (line 35) | def compute_true_value():
function step (line 61) | def step(state, action):
function get_action (line 75) | def get_action():
class ValueFunction (line 81) | class ValueFunction:
method __init__ (line 83) | def __init__(self, num_of_groups):
method value (line 91) | def value(self, state):
method update (line 100) | def update(self, delta, state):
class TilingsValueFunction (line 105) | class TilingsValueFunction:
method __init__ (line 109) | def __init__(self, numOfTilings, tileWidth, tilingOffset):
method value (line 126) | def value(self, state):
method update (line 138) | def update(self, delta, state):
class BasesValueFunction (line 153) | class BasesValueFunction:
method __init__ (line 156) | def __init__(self, order, type):
method value (line 170) | def value(self, state):
method update (line 177) | def update(self, delta, state):
function gradient_monte_carlo (line 188) | def gradient_monte_carlo(value_function, alpha, distribution=None):
function semi_gradient_temporal_difference (line 211) | def semi_gradient_temporal_difference(value_function, n, alpha):
function figure_9_1 (line 261) | def figure_9_1(true_value):
function figure_9_2_left (line 293) | def figure_9_2_left(true_value):
function figure_9_2_right (line 308) | def figure_9_2_right(true_value):
function figure_9_2 (line 343) | def figure_9_2(true_value):
function figure_9_5 (line 354) | def figure_9_5(true_value):
function figure_9_10 (line 398) | def figure_9_10(true_value):
FILE: chapter09/square_wave.py
class Interval (line 17) | class Interval:
method __init__ (line 19) | def __init__(self, left, right):
method contain (line 24) | def contain(self, x):
method size (line 28) | def size(self):
function square_wave (line 35) | def square_wave(x):
function sample (line 41) | def sample(n):
class ValueFunction (line 50) | class ValueFunction:
method __init__ (line 53) | def __init__(self, feature_width, domain=DOMAIN, alpha=0.2, num_of_fea...
method get_active_features (line 73) | def get_active_features(self, x):
method value (line 81) | def value(self, x):
method update (line 87) | def update(self, delta, x):
function approximate (line 94) | def approximate(samples, value_function):
function figure_9_8 (line 100) | def figure_9_8():
FILE: chapter10/access_control.py
class IHT (line 25) | class IHT:
method __init__ (line 27) | def __init__(self, size_val):
method count (line 32) | def count(self):
method full (line 35) | def full(self):
method get_index (line 38) | def get_index(self, obj, read_only=False):
function hash_coords (line 54) | def hash_coords(coordinates, m, read_only=False):
function tiles (line 59) | def tiles(iht_or_size, num_tilings, floats, ints=None, read_only=False):
class ValueFunction (line 104) | class ValueFunction:
method __init__ (line 112) | def __init__(self, num_of_tilings, alpha=ALPHA, beta=BETA):
method get_active_tiles (line 130) | def get_active_tiles(self, free_servers, priority, action):
method value (line 137) | def value(self, free_servers, priority, action):
method state_value (line 142) | def state_value(self, free_servers, priority):
method learn (line 150) | def learn(self, free_servers, priority, action, new_free_servers, new_...
function get_action (line 161) | def get_action(free_servers, priority, value_function):
function take_action (line 171) | def take_action(free_servers, priority, action):
function differential_semi_gradient_sarsa (line 183) | def differential_semi_gradient_sarsa(value_function, max_steps):
function figure_10_5 (line 203) | def figure_10_5():
FILE: chapter10/mountain_car.py
class IHT (line 24) | class IHT:
method __init__ (line 26) | def __init__(self, size_val):
method count (line 31) | def count(self):
method full (line 34) | def full(self):
method get_index (line 37) | def get_index(self, obj, read_only=False):
function hash_coords (line 53) | def hash_coords(coordinates, m, read_only=False):
function tiles (line 58) | def tiles(iht_or_size, num_tilings, floats, ints=None, read_only=False):
function step (line 95) | def step(position, velocity, action):
class ValueFunction (line 106) | class ValueFunction:
method __init__ (line 113) | def __init__(self, step_size, num_of_tilings=8, max_size=2048):
method get_active_tiles (line 130) | def get_active_tiles(self, position, velocity, action):
method value (line 139) | def value(self, position, velocity, action):
method learn (line 146) | def learn(self, position, velocity, action, target):
method cost_to_go (line 154) | def cost_to_go(self, position, velocity):
function get_action (line 161) | def get_action(position, velocity, value_function):
function semi_gradient_n_step_sarsa (line 172) | def semi_gradient_n_step_sarsa(value_function, n=1):
function print_cost (line 234) | def print_cost(value_function, episode, ax):
function figure_10_1 (line 258) | def figure_10_1():
function figure_10_2 (line 275) | def figure_10_2():
function figure_10_3 (line 302) | def figure_10_3():
function figure_10_4 (line 330) | def figure_10_4():
FILE: chapter11/counterexample.py
function step (line 41) | def step(state, action):
function target_policy (line 47) | def target_policy(state):
function behavior_policy (line 61) | def behavior_policy(state):
function semi_gradient_off_policy_TD (line 71) | def semi_gradient_off_policy_TD(state, theta, alpha):
function semi_gradient_DP (line 89) | def semi_gradient_DP(theta, alpha):
function TDC (line 110) | def TDC(state, theta, weight, alpha, beta):
function expected_TDC (line 129) | def expected_TDC(theta, weight, alpha, beta):
function expected_emphatic_TD (line 155) | def expected_emphatic_TD(theta, emphasis, alpha):
function compute_RMSVE (line 179) | def compute_RMSVE(theta):
function compute_RMSPBE (line 184) | def compute_RMSPBE(theta):
function figure_11_2_left (line 196) | def figure_11_2_left():
function figure_11_2_right (line 218) | def figure_11_2_right():
function figure_11_2 (line 238) | def figure_11_2():
function figure_11_6_left (line 249) | def figure_11_6_left():
function figure_11_6_right (line 278) | def figure_11_6_right():
function figure_11_6 (line 305) | def figure_11_6():
function figure_11_7 (line 316) | def figure_11_7():
FILE: chapter12/lambda_effect.py
class IHT (line 60) | class IHT:
method __init__ (line 63) | def __init__(self, sizeval):
method __str__ (line 68) | def __str__(self):
method count (line 75) | def count(self):
method fullp (line 78) | def fullp(self):
method getindex (line 81) | def getindex(self, obj, readonly=False):
function hashcoords (line 98) | def hashcoords(coordinates, m, readonly=False):
function tiles (line 106) | def tiles(ihtORsize, numtilings, floats, ints=[], readonly=False):
function tileswrap (line 122) | def tileswrap(ihtORsize, numtilings, floats, wrapwidths, ints=[], readon...
class IndexHashTable (line 139) | class IndexHashTable:
method __init__ (line 141) | def __init__(self, iht_size, num_tilings, tiling_size, obs_bounds):
method get_tiles (line 153) | def get_tiles(self, state, action):
function update_trace_vector (line 167) | def update_trace_vector(agent, method, state, action=None):
class RandomWalkEnvironment (line 202) | class RandomWalkEnvironment:
method __init__ (line 204) | def __init__(self):
method step (line 214) | def step(self, state, action):
class RandomWalkAgent (line 220) | class RandomWalkAgent:
method __init__ (line 221) | def __init__(self, lmbda, alpha):
method error_hist (line 246) | def error_hist(self):
method get_all_v_hat (line 249) | def get_all_v_hat(self):
method policy (line 253) | def policy(self, state):
method v_hat (line 257) | def v_hat(self, state):
method grad_v_hat (line 264) | def grad_v_hat(self, state):
method get_active_features (line 270) | def get_active_features(self, state):
method run_td_lambda (line 274) | def run_td_lambda(self, env, n_episodes, method):
class RandomWalk (line 316) | class RandomWalk:
method __init__ (line 317) | def __init__(self, lmbda, alpha):
method error_hist (line 322) | def error_hist(self):
method train (line 325) | def train(self, n_episodes, method):
class MountainCarEnvironment (line 334) | class MountainCarEnvironment:
method __init__ (line 336) | def __init__(self):
method step (line 350) | def step(self, state, action):
class MountainCarAgent (line 368) | class MountainCarAgent:
method __init__ (line 369) | def __init__(self, alpha, lmbda, iht_args):
method n_step_hist (line 401) | def n_step_hist(self):
method policy (line 404) | def policy(self, state):
method get_init_state (line 415) | def get_init_state(self):
method is_terminal_state (line 421) | def is_terminal_state(self, state):
method q_hat (line 424) | def q_hat(self, state, action):
method get_active_features (line 433) | def get_active_features(self, state, action):
method run_sarsa_lambda (line 437) | def run_sarsa_lambda(self, env, n_episodes, method):
class MountainCar (line 507) | class MountainCar:
method __init__ (line 508) | def __init__(self, lmbda, alpha):
method n_step_hist (line 529) | def n_step_hist(self):
method train (line 532) | def train(self, n_episodes, method):
class CartPoleEnvironment (line 541) | class CartPoleEnvironment:
method __init__ (line 544) | def __init__(self):
method is_state_valid (line 563) | def is_state_valid(self, state):
method step (line 574) | def step(self, state, action):
class CartPoleAgent (line 603) | class CartPoleAgent:
method __init__ (line 604) | def __init__(self, iht_args, alpha, lmbda):
method n_failures (line 633) | def n_failures(self):
method policy (line 636) | def policy(self, state):
method is_state_valid (line 649) | def is_state_valid(self, state):
method get_init_state (line 657) | def get_init_state(self):
method is_state_over_bounds (line 662) | def is_state_over_bounds(self, state):
method q_hat (line 674) | def q_hat(self, state, action):
method get_active_features (line 682) | def get_active_features(self, state, action):
method run_sarsa_lambda (line 686) | def run_sarsa_lambda(self, env, n_step_max, method):
class CartPole (line 754) | class CartPole:
method __init__ (line 755) | def __init__(self, lmbda, alpha):
method n_failures (line 779) | def n_failures(self):
method train (line 782) | def train(self, n_step_max, method):
class PuddleWorldGrid (line 792) | class PuddleWorldGrid:
method __init__ (line 793) | def __init__(self):
method height (line 808) | def height(self):
method width (line 812) | def width(self):
method is_state_goal (line 815) | def is_state_goal(self, state):
method get_dist2puddle (line 821) | def get_dist2puddle(self, state):
method cvt_ij2xy (line 868) | def cvt_ij2xy(self, pos_ij):
method draw (line 871) | def draw(self):
class PuddleWorldEnvironment (line 907) | class PuddleWorldEnvironment:
method __init__ (line 908) | def __init__(self, grid):
method step (line 920) | def step(self, state, action):
class PuddleWorldAgent (line 937) | class PuddleWorldAgent:
method __init__ (line 938) | def __init__(self, grid, alpha, lmbda, iht_args):
method cost_per_ep_hist (line 961) | def cost_per_ep_hist(self):
method policy (line 964) | def policy(self, state):
method get_start_pos (line 978) | def get_start_pos(self):
method is_terminal_state (line 993) | def is_terminal_state(self, state):
method q_hat (line 996) | def q_hat(self, state, action):
method get_active_features (line 1005) | def get_active_features(self, state, action):
method run_sarsa_lambda (line 1009) | def run_sarsa_lambda(self, env, n_episodes, method):
class PuddleWorld (line 1058) | class PuddleWorld:
method __init__ (line 1059) | def __init__(self, lmbda, alpha):
method cost_per_ep_hist (line 1083) | def cost_per_ep_hist(self):
method draw (line 1086) | def draw(self):
method train (line 1089) | def train(self, n_episodes, method):
function get_puddle_world_map (line 1094) | def get_puddle_world_map():
function get_random_walk_plot_data (line 1109) | def get_random_walk_plot_data():
function get_mountain_car_plot_data (line 1145) | def get_mountain_car_plot_data():
function get_cart_pole_plot_data (line 1176) | def get_cart_pole_plot_data():
function get_puddle_world_plot_data (line 1206) | def get_puddle_world_plot_data():
function figure_12_14 (line 1243) | def figure_12_14():
FILE: chapter12/mountain_car.py
class IHT (line 22) | class IHT:
method __init__ (line 24) | def __init__(self, size_val):
method count (line 29) | def count(self):
method full (line 32) | def full(self):
method get_index (line 35) | def get_index(self, obj, read_only=False):
function hash_coords (line 51) | def hash_coords(coordinates, m, read_only=False):
function tiles (line 56) | def tiles(iht_or_size, num_tilings, floats, ints=None, read_only=False):
function step (line 99) | def step(position, velocity, action):
function accumulating_trace (line 114) | def accumulating_trace(trace, active_tiles, lam):
function replacing_trace (line 124) | def replacing_trace(trace, activeTiles, lam):
function replacing_trace_with_clearing (line 136) | def replacing_trace_with_clearing(trace, active_tiles, lam, clearing_til...
function dutch_trace (line 149) | def dutch_trace(trace, active_tiles, lam, alpha):
class Sarsa (line 156) | class Sarsa:
method __init__ (line 163) | def __init__(self, step_size, lam, trace_update=accumulating_trace, nu...
method get_active_tiles (line 185) | def get_active_tiles(self, position, velocity, action):
method value (line 194) | def value(self, position, velocity, action):
method learn (line 201) | def learn(self, position, velocity, action, target):
method cost_to_go (line 220) | def cost_to_go(self, position, velocity):
function get_action (line 227) | def get_action(position, velocity, valueFunction):
function play (line 237) | def play(evaluator):
function figure_12_10 (line 259) | def figure_12_10():
function figure_12_11 (line 292) | def figure_12_11():
FILE: chapter12/random_walk.py
class ValueFunction (line 36) | class ValueFunction:
method __init__ (line 39) | def __init__(self, rate, step_size):
method value (line 45) | def value(self, state):
method learn (line 50) | def learn(self, state, reward):
method new_episode (line 56) | def new_episode(self):
class OffLineLambdaReturn (line 60) | class OffLineLambdaReturn(ValueFunction):
method __init__ (line 61) | def __init__(self, rate, step_size):
method new_episode (line 66) | def new_episode(self):
method learn (line 72) | def learn(self, state, reward):
method n_step_return_from_time (line 82) | def n_step_return_from_time(self, n, time):
method lambda_return_from_time (line 92) | def lambda_return_from_time(self, time):
method off_line_learn (line 107) | def off_line_learn(self):
class TemporalDifferenceLambda (line 116) | class TemporalDifferenceLambda(ValueFunction):
method __init__ (line 117) | def __init__(self, rate, step_size):
method new_episode (line 121) | def new_episode(self):
method learn (line 127) | def learn(self, state, reward):
class TrueOnlineTemporalDifferenceLambda (line 137) | class TrueOnlineTemporalDifferenceLambda(ValueFunction):
method __init__ (line 138) | def __init__(self, rate, step_size):
method new_episode (line 141) | def new_episode(self):
method learn (line 149) | def learn(self, state, reward):
function random_walk (line 163) | def random_walk(value_function):
function parameter_sweep (line 182) | def parameter_sweep(value_function_generator, runs, lambdas, alphas):
function figure_12_3 (line 207) | def figure_12_3():
function figure_12_6 (line 223) | def figure_12_6():
function figure_12_8 (line 239) | def figure_12_8():
FILE: chapter13/short_corridor.py
function true_value (line 15) | def true_value(p):
class ShortCorridor (line 26) | class ShortCorridor:
method __init__ (line 30) | def __init__(self):
method reset (line 33) | def reset(self):
method step (line 36) | def step(self, go_right):
function softmax (line 60) | def softmax(x):
class ReinforceAgent (line 64) | class ReinforceAgent:
method __init__ (line 69) | def __init__(self, alpha, gamma):
method get_pi (line 80) | def get_pi(self):
method get_p_right (line 95) | def get_p_right(self):
method choose_action (line 98) | def choose_action(self, reward):
method episode_end (line 108) | def episode_end(self, last_reward):
class ReinforceBaselineAgent (line 132) | class ReinforceBaselineAgent(ReinforceAgent):
method __init__ (line 133) | def __init__(self, alpha, gamma, alpha_w):
method episode_end (line 138) | def episode_end(self, last_reward):
function trial (line 164) | def trial(num_episodes, agent_generator):
function example_13_1 (line 187) | def example_13_1():
function figure_13_1 (line 215) | def figure_13_1():
function figure_13_2 (line 243) | def figure_13_2():
Condensed preview — 31 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (276K chars).
[
{
"path": ".gitignore",
"chars": 49,
"preview": ".idea\n*.pyc\nlatex\n*.bin\nextra\n.DS_Store\n.vscode/\n"
},
{
"path": ".travis.yml",
"chars": 148,
"preview": "language: python\npython:\n - \"3.6\"\ninstall:\n - pip install -r requirements.txt\nscript:\n - ls chapter*/*.py | xargs -n "
},
{
"path": "LICENSE",
"chars": 1072,
"preview": "MIT License\n\nCopyright (c) 2019 Shangtong Zhang\n\nPermission is hereby granted, free of charge, to any person obtaining a"
},
{
"path": "README.md",
"chars": 10485,
"preview": "# Reinforcement Learning: An Introduction\n\n[ "
},
{
"path": "chapter02/ten_armed_testbed.py",
"chars": 9105,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter03/grid_world.py",
"chars": 6304,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter04/car_rental.py",
"chars": 7647,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter04/car_rental_synchronous.py",
"chars": 8746,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter04/gamblers_problem.py",
"chars": 2677,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter04/grid_world.py",
"chars": 3331,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter05/blackjack.py",
"chars": 13474,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter05/infinite_variance.py",
"chars": 1814,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter06/cliff_walking.py",
"chars": 9355,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter06/maximization_bias.py",
"chars": 4269,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter06/random_walk.py",
"chars": 6830,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter06/windy_grid_world.py",
"chars": 4018,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter07/random_walk.py",
"chars": 4222,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter08/expectation_vs_sample.py",
"chars": 1627,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter08/maze.py",
"chars": 23222,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter08/trajectory_sampling.py",
"chars": 4918,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter09/random_walk.py",
"chars": 15941,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter09/square_wave.py",
"chars": 4262,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter10/access_control.py",
"chars": 9605,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter10/mountain_car.py",
"chars": 13682,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter11/counterexample.py",
"chars": 11839,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter12/lambda_effect.py",
"chars": 46460,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter12/mountain_car.py",
"chars": 12140,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter12/random_walk.py",
"chars": 9637,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "chapter13/short_corridor.py",
"chars": 8401,
"preview": "#######################################################################\n# Copyright (C) "
},
{
"path": "requirements.txt",
"chars": 36,
"preview": "numpy\nmatplotlib\nseaborn\ntqdm\nscipy\n"
}
]
About this extraction
This page contains the full source code of the ShangtongZhang/reinforcement-learning-an-introduction GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 31 files (260.1 KB), approximately 66.9k tokens, and a symbol index with 404 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.