Repository: ShangtongZhang/reinforcement-learning-an-introduction Branch: master Commit: 96bc203617a7 Files: 31 Total size: 260.1 KB Directory structure: gitextract_m4ci91yn/ ├── .gitignore ├── .travis.yml ├── LICENSE ├── README.md ├── chapter01/ │ └── tic_tac_toe.py ├── chapter02/ │ └── ten_armed_testbed.py ├── chapter03/ │ └── grid_world.py ├── chapter04/ │ ├── car_rental.py │ ├── car_rental_synchronous.py │ ├── gamblers_problem.py │ └── grid_world.py ├── chapter05/ │ ├── blackjack.py │ └── infinite_variance.py ├── chapter06/ │ ├── cliff_walking.py │ ├── maximization_bias.py │ ├── random_walk.py │ └── windy_grid_world.py ├── chapter07/ │ └── random_walk.py ├── chapter08/ │ ├── expectation_vs_sample.py │ ├── maze.py │ └── trajectory_sampling.py ├── chapter09/ │ ├── random_walk.py │ └── square_wave.py ├── chapter10/ │ ├── access_control.py │ └── mountain_car.py ├── chapter11/ │ └── counterexample.py ├── chapter12/ │ ├── lambda_effect.py │ ├── mountain_car.py │ └── random_walk.py ├── chapter13/ │ └── short_corridor.py └── requirements.txt ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ .idea *.pyc latex *.bin extra .DS_Store .vscode/ ================================================ FILE: .travis.yml ================================================ language: python python: - "3.6" install: - pip install -r requirements.txt script: - ls chapter*/*.py | xargs -n 1 -P 1 python -m py_compile ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2019 Shangtong Zhang Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # Reinforcement Learning: An Introduction [![Build Status](https://travis-ci.org/ShangtongZhang/reinforcement-learning-an-introduction.svg?branch=master)](https://travis-ci.org/ShangtongZhang/reinforcement-learning-an-introduction) Python replication for Sutton & Barto's book [*Reinforcement Learning: An Introduction (2nd Edition)*](http://incompleteideas.net/book/the-book-2nd.html) > If you have any confusion about the code or want to report a bug, please open an issue instead of emailing me directly, and unfortunately I do not have exercise answers for the book. # Contents ### Chapter 1 1. Tic-Tac-Toe ### Chapter 2 1. [Figure 2.1: An exemplary bandit problem from the 10-armed testbed](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_1.png) 2. [Figure 2.2: Average performance of epsilon-greedy action-value methods on the 10-armed testbed](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_2.png) 3. [Figure 2.3: Optimistic initial action-value estimates](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_3.png) 4. [Figure 2.4: Average performance of UCB action selection on the 10-armed testbed](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_4.png) 5. [Figure 2.5: Average performance of the gradient bandit algorithm](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_5.png) 6. [Figure 2.6: A parameter study of the various bandit algorithms](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_6.png) ### Chapter 3 1. [Figure 3.2: Grid example with random policy](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_3_2.png) 2. [Figure 3.5: Optimal solutions to the gridworld example](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_3_5.png) ### Chapter 4 1. [Figure 4.1: Convergence of iterative policy evaluation on a small gridworld](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_4_1.png) 2. [Figure 4.2: Jack’s car rental problem](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_4_2.png) 3. [Figure 4.3: The solution to the gambler’s problem](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_4_3.png) ### Chapter 5 1. [Figure 5.1: Approximate state-value functions for the blackjack policy](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_5_1.png) 2. [Figure 5.2: The optimal policy and state-value function for blackjack found by Monte Carlo ES](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_5_2.png) 3. [Figure 5.3: Weighted importance sampling](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_5_3.png) 4. [Figure 5.4: Ordinary importance sampling with surprisingly unstable estimates](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_5_4.png) ### Chapter 6 1. [Example 6.2: Random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/example_6_2.png) 2. [Figure 6.2: Batch updating](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_2.png) 3. [Figure 6.3: Sarsa applied to windy grid world](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_3.png) 4. [Figure 6.4: The cliff-walking task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_4.png) 5. [Figure 6.6: Interim and asymptotic performance of TD control methods](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_6.png) 6. [Figure 6.7: Comparison of Q-learning and Double Q-learning](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_7.png) ### Chapter 7 1. [Figure 7.2: Performance of n-step TD methods on 19-state random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_7_2.png) ### Chapter 8 1. [Figure 8.2: Average learning curves for Dyna-Q agents varying in their number of planning steps](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_2.png) 2. [Figure 8.4: Average performance of Dyna agents on a blocking task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_4.png) 3. [Figure 8.5: Average performance of Dyna agents on a shortcut task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_5.png) 4. [Example 8.4: Prioritized sweeping significantly shortens learning time on the Dyna maze task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/example_8_4.png) 5. [Figure 8.7: Comparison of efficiency of expected and sample updates](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_7.png) 6. [Figure 8.8: Relative efficiency of different update distributions](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_8.png) ### Chapter 9 1. [Figure 9.1: Gradient Monte Carlo algorithm on the 1000-state random walk task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_1.png) 2. [Figure 9.2: Semi-gradient n-steps TD algorithm on the 1000-state random walk task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_2.png) 3. [Figure 9.5: Fourier basis vs polynomials on the 1000-state random walk task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_5.png) 4. [Figure 9.8: Example of feature width’s effect on initial generalization and asymptotic accuracy](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_8.png) 5. [Figure 9.10: Single tiling and multiple tilings on the 1000-state random walk task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_10.png) ### Chapter 10 1. [Figure 10.1: The cost-to-go function for Mountain Car task in one run](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_1.png) 2. [Figure 10.2: Learning curves for semi-gradient Sarsa on Mountain Car task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_2.png) 3. [Figure 10.3: One-step vs multi-step performance of semi-gradient Sarsa on the Mountain Car task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_3.png) 4. [Figure 10.4: Effect of the alpha and n on early performance of n-step semi-gradient Sarsa](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_4.png) 5. [Figure 10.5: Differential semi-gradient Sarsa on the access-control queuing task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_5.png) ### Chapter 11 1. [Figure 11.2: Baird's Counterexample](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_11_2.png) 2. [Figure 11.6: The behavior of the TDC algorithm on Baird’s counterexample](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_11_6.png) 3. [Figure 11.7: The behavior of the ETD algorithm in expectation on Baird’s counterexample](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_11_7.png) ### Chapter 12 1. [Figure 12.3: Off-line λ-return algorithm on 19-state random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_3.png) 2. [Figure 12.6: TD(λ) algorithm on 19-state random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_6.png) 3. [Figure 12.8: True online TD(λ) algorithm on 19-state random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_8.png) 4. [Figure 12.10: Sarsa(λ) with replacing traces on Mountain Car](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_10.png) 5. [Figure 12.11: Summary comparison of Sarsa(λ) algorithms on Mountain Car](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_11.png) ### Chapter 13 1. [Example 13.1: Short corridor with switched actions](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/example_13_1.png) 2. [Figure 13.1: REINFORCE on the short-corridor grid world](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_13_1.png) 3. [Figure 13.2: REINFORCE with baseline on the short-corridor grid-world](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_13_2.png) # Environment * python 3.6 * numpy * matplotlib * [seaborn](https://seaborn.pydata.org/index.html) * [tqdm](https://pypi.org/project/tqdm/) # Usage > All files are self-contained ```commandline python any_file_you_want.py ``` # Contribution If you want to contribute some missing examples or fix some bugs, feel free to open an issue or make a pull request. ================================================ FILE: chapter01/tic_tac_toe.py ================================================ ####################################################################### # Copyright (C) # # 2016 - 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Jan Hakenberg(jan.hakenberg@gmail.com) # # 2016 Tian Jun(tianjun.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import pickle BOARD_ROWS = 3 BOARD_COLS = 3 BOARD_SIZE = BOARD_ROWS * BOARD_COLS class State: def __init__(self): # the board is represented by an n * n array, # 1 represents a chessman of the player who moves first, # -1 represents a chessman of another player # 0 represents an empty position self.data = np.zeros((BOARD_ROWS, BOARD_COLS)) self.winner = None self.hash_val = None self.end = None # compute the hash value for one state, it's unique def hash(self): if self.hash_val is None: self.hash_val = 0 for i in np.nditer(self.data): self.hash_val = self.hash_val * 3 + i + 1 return self.hash_val # check whether a player has won the game, or it's a tie def is_end(self): if self.end is not None: return self.end results = [] # check row for i in range(BOARD_ROWS): results.append(np.sum(self.data[i, :])) # check columns for i in range(BOARD_COLS): results.append(np.sum(self.data[:, i])) # check diagonals trace = 0 reverse_trace = 0 for i in range(BOARD_ROWS): trace += self.data[i, i] reverse_trace += self.data[i, BOARD_ROWS - 1 - i] results.append(trace) results.append(reverse_trace) for result in results: if result == 3: self.winner = 1 self.end = True return self.end if result == -3: self.winner = -1 self.end = True return self.end # whether it's a tie sum_values = np.sum(np.abs(self.data)) if sum_values == BOARD_SIZE: self.winner = 0 self.end = True return self.end # game is still going on self.end = False return self.end # @symbol: 1 or -1 # put chessman symbol in position (i, j) def next_state(self, i, j, symbol): new_state = State() new_state.data = np.copy(self.data) new_state.data[i, j] = symbol return new_state # print the board def print_state(self): for i in range(BOARD_ROWS): print('-------------') out = '| ' for j in range(BOARD_COLS): if self.data[i, j] == 1: token = '*' elif self.data[i, j] == -1: token = 'x' else: token = '0' out += token + ' | ' print(out) print('-------------') def get_all_states_impl(current_state, current_symbol, all_states): for i in range(BOARD_ROWS): for j in range(BOARD_COLS): if current_state.data[i][j] == 0: new_state = current_state.next_state(i, j, current_symbol) new_hash = new_state.hash() if new_hash not in all_states: is_end = new_state.is_end() all_states[new_hash] = (new_state, is_end) if not is_end: get_all_states_impl(new_state, -current_symbol, all_states) def get_all_states(): current_symbol = 1 current_state = State() all_states = dict() all_states[current_state.hash()] = (current_state, current_state.is_end()) get_all_states_impl(current_state, current_symbol, all_states) return all_states # all possible board configurations all_states = get_all_states() class Judger: # @player1: the player who will move first, its chessman will be 1 # @player2: another player with a chessman -1 def __init__(self, player1, player2): self.p1 = player1 self.p2 = player2 self.current_player = None self.p1_symbol = 1 self.p2_symbol = -1 self.p1.set_symbol(self.p1_symbol) self.p2.set_symbol(self.p2_symbol) self.current_state = State() def reset(self): self.p1.reset() self.p2.reset() def alternate(self): while True: yield self.p1 yield self.p2 # @print_state: if True, print each board during the game def play(self, print_state=False): alternator = self.alternate() self.reset() current_state = State() self.p1.set_state(current_state) self.p2.set_state(current_state) if print_state: current_state.print_state() while True: player = next(alternator) i, j, symbol = player.act() next_state_hash = current_state.next_state(i, j, symbol).hash() current_state, is_end = all_states[next_state_hash] self.p1.set_state(current_state) self.p2.set_state(current_state) if print_state: current_state.print_state() if is_end: return current_state.winner # AI player class Player: # @step_size: the step size to update estimations # @epsilon: the probability to explore def __init__(self, step_size=0.1, epsilon=0.1): self.estimations = dict() self.step_size = step_size self.epsilon = epsilon self.states = [] self.greedy = [] self.symbol = 0 def reset(self): self.states = [] self.greedy = [] def set_state(self, state): self.states.append(state) self.greedy.append(True) def set_symbol(self, symbol): self.symbol = symbol for hash_val in all_states: state, is_end = all_states[hash_val] if is_end: if state.winner == self.symbol: self.estimations[hash_val] = 1.0 elif state.winner == 0: # we need to distinguish between a tie and a lose self.estimations[hash_val] = 0.5 else: self.estimations[hash_val] = 0 else: self.estimations[hash_val] = 0.5 # update value estimation def backup(self): states = [state.hash() for state in self.states] for i in reversed(range(len(states) - 1)): state = states[i] td_error = self.greedy[i] * ( self.estimations[states[i + 1]] - self.estimations[state] ) self.estimations[state] += self.step_size * td_error # choose an action based on the state def act(self): state = self.states[-1] next_states = [] next_positions = [] for i in range(BOARD_ROWS): for j in range(BOARD_COLS): if state.data[i, j] == 0: next_positions.append([i, j]) next_states.append(state.next_state( i, j, self.symbol).hash()) if np.random.rand() < self.epsilon: action = next_positions[np.random.randint(len(next_positions))] action.append(self.symbol) self.greedy[-1] = False return action values = [] for hash_val, pos in zip(next_states, next_positions): values.append((self.estimations[hash_val], pos)) # to select one of the actions of equal value at random due to Python's sort is stable np.random.shuffle(values) values.sort(key=lambda x: x[0], reverse=True) action = values[0][1] action.append(self.symbol) return action def save_policy(self): with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'wb') as f: pickle.dump(self.estimations, f) def load_policy(self): with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'rb') as f: self.estimations = pickle.load(f) # human interface # input a number to put a chessman # | q | w | e | # | a | s | d | # | z | x | c | class HumanPlayer: def __init__(self, **kwargs): self.symbol = None self.keys = ['q', 'w', 'e', 'a', 's', 'd', 'z', 'x', 'c'] self.state = None def reset(self): pass def set_state(self, state): self.state = state def set_symbol(self, symbol): self.symbol = symbol def act(self): self.state.print_state() key = input("Input your position:") data = self.keys.index(key) i = data // BOARD_COLS j = data % BOARD_COLS return i, j, self.symbol def train(epochs, print_every_n=500): player1 = Player(epsilon=0.01) player2 = Player(epsilon=0.01) judger = Judger(player1, player2) player1_win = 0.0 player2_win = 0.0 for i in range(1, epochs + 1): winner = judger.play(print_state=False) if winner == 1: player1_win += 1 if winner == -1: player2_win += 1 if i % print_every_n == 0: print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i)) player1.backup() player2.backup() judger.reset() player1.save_policy() player2.save_policy() def compete(turns): player1 = Player(epsilon=0) player2 = Player(epsilon=0) judger = Judger(player1, player2) player1.load_policy() player2.load_policy() player1_win = 0.0 player2_win = 0.0 for _ in range(turns): winner = judger.play() if winner == 1: player1_win += 1 if winner == -1: player2_win += 1 judger.reset() print('%d turns, player 1 win %.02f, player 2 win %.02f' % (turns, player1_win / turns, player2_win / turns)) # The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie. # So we test whether the AI can guarantee at least a tie if it goes second. def play(): while True: player1 = HumanPlayer() player2 = Player(epsilon=0) judger = Judger(player1, player2) player2.load_policy() winner = judger.play() if winner == player2.symbol: print("You lose!") elif winner == player1.symbol: print("You win!") else: print("It is a tie!") if __name__ == '__main__': train(int(1e5)) compete(int(1e3)) play() ================================================ FILE: chapter02/ten_armed_testbed.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Tian Jun(tianjun.cpp@gmail.com) # # 2016 Artem Oboturov(oboturov@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import matplotlib import matplotlib.pyplot as plt import numpy as np from tqdm import trange matplotlib.use('Agg') class Bandit: # @k_arm: # of arms # @epsilon: probability for exploration in epsilon-greedy algorithm # @initial: initial estimation for each action # @step_size: constant step size for updating estimations # @sample_averages: if True, use sample averages to update estimations instead of constant step size # @UCB_param: if not None, use UCB algorithm to select action # @gradient: if True, use gradient based bandit algorithm # @gradient_baseline: if True, use average reward as baseline for gradient based bandit algorithm def __init__(self, k_arm=10, epsilon=0., initial=0., step_size=0.1, sample_averages=False, UCB_param=None, gradient=False, gradient_baseline=False, true_reward=0.): self.k = k_arm self.step_size = step_size self.sample_averages = sample_averages self.indices = np.arange(self.k) self.time = 0 self.UCB_param = UCB_param self.gradient = gradient self.gradient_baseline = gradient_baseline self.average_reward = 0 self.true_reward = true_reward self.epsilon = epsilon self.initial = initial def reset(self): # real reward for each action self.q_true = np.random.randn(self.k) + self.true_reward # estimation for each action self.q_estimation = np.zeros(self.k) + self.initial # # of chosen times for each action self.action_count = np.zeros(self.k) self.best_action = np.argmax(self.q_true) self.time = 0 # get an action for this bandit def act(self): if np.random.rand() < self.epsilon: return np.random.choice(self.indices) if self.UCB_param is not None: UCB_estimation = self.q_estimation + \ self.UCB_param * np.sqrt(np.log(self.time + 1) / (self.action_count + 1e-5)) q_best = np.max(UCB_estimation) return np.random.choice(np.where(UCB_estimation == q_best)[0]) if self.gradient: exp_est = np.exp(self.q_estimation) self.action_prob = exp_est / np.sum(exp_est) return np.random.choice(self.indices, p=self.action_prob) q_best = np.max(self.q_estimation) return np.random.choice(np.where(self.q_estimation == q_best)[0]) # take an action, update estimation for this action def step(self, action): # generate the reward under N(real reward, 1) reward = np.random.randn() + self.q_true[action] self.time += 1 self.action_count[action] += 1 self.average_reward += (reward - self.average_reward) / self.time if self.sample_averages: # update estimation using sample averages self.q_estimation[action] += (reward - self.q_estimation[action]) / self.action_count[action] elif self.gradient: one_hot = np.zeros(self.k) one_hot[action] = 1 if self.gradient_baseline: baseline = self.average_reward else: baseline = 0 self.q_estimation += self.step_size * (reward - baseline) * (one_hot - self.action_prob) else: # update estimation with constant step size self.q_estimation[action] += self.step_size * (reward - self.q_estimation[action]) return reward def simulate(runs, time, bandits): rewards = np.zeros((len(bandits), runs, time)) best_action_counts = np.zeros(rewards.shape) for i, bandit in enumerate(bandits): for r in trange(runs): bandit.reset() for t in range(time): action = bandit.act() reward = bandit.step(action) rewards[i, r, t] = reward if action == bandit.best_action: best_action_counts[i, r, t] = 1 mean_best_action_counts = best_action_counts.mean(axis=1) mean_rewards = rewards.mean(axis=1) return mean_best_action_counts, mean_rewards def figure_2_1(): plt.violinplot(dataset=np.random.randn(200, 10) + np.random.randn(10)) plt.xlabel("Action") plt.ylabel("Reward distribution") plt.savefig('../images/figure_2_1.png') plt.close() def figure_2_2(runs=2000, time=1000): epsilons = [0, 0.1, 0.01] bandits = [Bandit(epsilon=eps, sample_averages=True) for eps in epsilons] best_action_counts, rewards = simulate(runs, time, bandits) plt.figure(figsize=(10, 20)) plt.subplot(2, 1, 1) for eps, rewards in zip(epsilons, rewards): plt.plot(rewards, label='$\epsilon = %.02f$' % (eps)) plt.xlabel('steps') plt.ylabel('average reward') plt.legend() plt.subplot(2, 1, 2) for eps, counts in zip(epsilons, best_action_counts): plt.plot(counts, label='$\epsilon = %.02f$' % (eps)) plt.xlabel('steps') plt.ylabel('% optimal action') plt.legend() plt.savefig('../images/figure_2_2.png') plt.close() def figure_2_3(runs=2000, time=1000): bandits = [] bandits.append(Bandit(epsilon=0, initial=5, step_size=0.1)) bandits.append(Bandit(epsilon=0.1, initial=0, step_size=0.1)) best_action_counts, _ = simulate(runs, time, bandits) plt.plot(best_action_counts[0], label='$\epsilon = 0, q = 5$') plt.plot(best_action_counts[1], label='$\epsilon = 0.1, q = 0$') plt.xlabel('Steps') plt.ylabel('% optimal action') plt.legend() plt.savefig('../images/figure_2_3.png') plt.close() def figure_2_4(runs=2000, time=1000): bandits = [] bandits.append(Bandit(epsilon=0, UCB_param=2, sample_averages=True)) bandits.append(Bandit(epsilon=0.1, sample_averages=True)) _, average_rewards = simulate(runs, time, bandits) plt.plot(average_rewards[0], label='UCB $c = 2$') plt.plot(average_rewards[1], label='epsilon greedy $\epsilon = 0.1$') plt.xlabel('Steps') plt.ylabel('Average reward') plt.legend() plt.savefig('../images/figure_2_4.png') plt.close() def figure_2_5(runs=2000, time=1000): bandits = [] bandits.append(Bandit(gradient=True, step_size=0.1, gradient_baseline=True, true_reward=4)) bandits.append(Bandit(gradient=True, step_size=0.1, gradient_baseline=False, true_reward=4)) bandits.append(Bandit(gradient=True, step_size=0.4, gradient_baseline=True, true_reward=4)) bandits.append(Bandit(gradient=True, step_size=0.4, gradient_baseline=False, true_reward=4)) best_action_counts, _ = simulate(runs, time, bandits) labels = [r'$\alpha = 0.1$, with baseline', r'$\alpha = 0.1$, without baseline', r'$\alpha = 0.4$, with baseline', r'$\alpha = 0.4$, without baseline'] for i in range(len(bandits)): plt.plot(best_action_counts[i], label=labels[i]) plt.xlabel('Steps') plt.ylabel('% Optimal action') plt.legend() plt.savefig('../images/figure_2_5.png') plt.close() def figure_2_6(runs=2000, time=1000): labels = ['epsilon-greedy', 'gradient bandit', 'UCB', 'optimistic initialization'] generators = [lambda epsilon: Bandit(epsilon=epsilon, sample_averages=True), lambda alpha: Bandit(gradient=True, step_size=alpha, gradient_baseline=True), lambda coef: Bandit(epsilon=0, UCB_param=coef, sample_averages=True), lambda initial: Bandit(epsilon=0, initial=initial, step_size=0.1)] parameters = [np.arange(-7, -1, dtype=np.float), np.arange(-5, 2, dtype=np.float), np.arange(-4, 3, dtype=np.float), np.arange(-2, 3, dtype=np.float)] bandits = [] for generator, parameter in zip(generators, parameters): for param in parameter: bandits.append(generator(pow(2, param))) _, average_rewards = simulate(runs, time, bandits) rewards = np.mean(average_rewards, axis=1) i = 0 for label, parameter in zip(labels, parameters): l = len(parameter) plt.plot(parameter, rewards[i:i+l], label=label) i += l plt.xlabel('Parameter($2^x$)') plt.ylabel('Average reward') plt.legend() plt.savefig('../images/figure_2_6.png') plt.close() if __name__ == '__main__': figure_2_1() figure_2_2() figure_2_3() figure_2_4() figure_2_5() figure_2_6() ================================================ FILE: chapter03/grid_world.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import matplotlib import matplotlib.pyplot as plt import numpy as np from matplotlib.table import Table matplotlib.use('Agg') WORLD_SIZE = 5 A_POS = [0, 1] A_PRIME_POS = [4, 1] B_POS = [0, 3] B_PRIME_POS = [2, 3] DISCOUNT = 0.9 # left, up, right, down ACTIONS = [np.array([0, -1]), np.array([-1, 0]), np.array([0, 1]), np.array([1, 0])] ACTIONS_FIGS=[ '←', '↑', '→', '↓'] ACTION_PROB = 0.25 def step(state, action): if state == A_POS: return A_PRIME_POS, 10 if state == B_POS: return B_PRIME_POS, 5 next_state = (np.array(state) + action).tolist() x, y = next_state if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE: reward = -1.0 next_state = state else: reward = 0 return next_state, reward def draw_image(image): fig, ax = plt.subplots() ax.set_axis_off() tb = Table(ax, bbox=[0, 0, 1, 1]) nrows, ncols = image.shape width, height = 1.0 / ncols, 1.0 / nrows # Add cells for (i, j), val in np.ndenumerate(image): # add state labels if [i, j] == A_POS: val = str(val) + " (A)" if [i, j] == A_PRIME_POS: val = str(val) + " (A')" if [i, j] == B_POS: val = str(val) + " (B)" if [i, j] == B_PRIME_POS: val = str(val) + " (B')" tb.add_cell(i, j, width, height, text=val, loc='center', facecolor='white') # Row and column labels... for i in range(len(image)): tb.add_cell(i, -1, width, height, text=i+1, loc='right', edgecolor='none', facecolor='none') tb.add_cell(-1, i, width, height/2, text=i+1, loc='center', edgecolor='none', facecolor='none') ax.add_table(tb) def draw_policy(optimal_values): fig, ax = plt.subplots() ax.set_axis_off() tb = Table(ax, bbox=[0, 0, 1, 1]) nrows, ncols = optimal_values.shape width, height = 1.0 / ncols, 1.0 / nrows # Add cells for (i, j), val in np.ndenumerate(optimal_values): next_vals=[] for action in ACTIONS: next_state, _ = step([i, j], action) next_vals.append(optimal_values[next_state[0],next_state[1]]) best_actions=np.where(next_vals == np.max(next_vals))[0] val='' for ba in best_actions: val+=ACTIONS_FIGS[ba] # add state labels if [i, j] == A_POS: val = str(val) + " (A)" if [i, j] == A_PRIME_POS: val = str(val) + " (A')" if [i, j] == B_POS: val = str(val) + " (B)" if [i, j] == B_PRIME_POS: val = str(val) + " (B')" tb.add_cell(i, j, width, height, text=val, loc='center', facecolor='white') # Row and column labels... for i in range(len(optimal_values)): tb.add_cell(i, -1, width, height, text=i+1, loc='right', edgecolor='none', facecolor='none') tb.add_cell(-1, i, width, height/2, text=i+1, loc='center', edgecolor='none', facecolor='none') ax.add_table(tb) def figure_3_2(): value = np.zeros((WORLD_SIZE, WORLD_SIZE)) while True: # keep iteration until convergence new_value = np.zeros_like(value) for i in range(WORLD_SIZE): for j in range(WORLD_SIZE): for action in ACTIONS: (next_i, next_j), reward = step([i, j], action) # bellman equation new_value[i, j] += ACTION_PROB * (reward + DISCOUNT * value[next_i, next_j]) if np.sum(np.abs(value - new_value)) < 1e-4: draw_image(np.round(new_value, decimals=2)) plt.savefig('../images/figure_3_2.png') plt.close() break value = new_value def figure_3_2_linear_system(): ''' Here we solve the linear system of equations to find the exact solution. We do this by filling the coefficients for each of the states with their respective right side constant. ''' A = -1 * np.eye(WORLD_SIZE * WORLD_SIZE) b = np.zeros(WORLD_SIZE * WORLD_SIZE) for i in range(WORLD_SIZE): for j in range(WORLD_SIZE): s = [i, j] # current state index_s = np.ravel_multi_index(s, (WORLD_SIZE, WORLD_SIZE)) for a in ACTIONS: s_, r = step(s, a) index_s_ = np.ravel_multi_index(s_, (WORLD_SIZE, WORLD_SIZE)) A[index_s, index_s_] += ACTION_PROB * DISCOUNT b[index_s] -= ACTION_PROB * r x = np.linalg.solve(A, b) draw_image(np.round(x.reshape(WORLD_SIZE, WORLD_SIZE), decimals=2)) plt.savefig('../images/figure_3_2_linear_system.png') plt.close() def figure_3_5(): value = np.zeros((WORLD_SIZE, WORLD_SIZE)) while True: # keep iteration until convergence new_value = np.zeros_like(value) for i in range(WORLD_SIZE): for j in range(WORLD_SIZE): values = [] for action in ACTIONS: (next_i, next_j), reward = step([i, j], action) # value iteration values.append(reward + DISCOUNT * value[next_i, next_j]) new_value[i, j] = np.max(values) if np.sum(np.abs(new_value - value)) < 1e-4: draw_image(np.round(new_value, decimals=2)) plt.savefig('../images/figure_3_5.png') plt.close() draw_policy(new_value) plt.savefig('../images/figure_3_5_policy.png') plt.close() break value = new_value if __name__ == '__main__': figure_3_2_linear_system() figure_3_2() figure_3_5() ================================================ FILE: chapter04/car_rental.py ================================================ ####################################################################### # Copyright (C) # # 2016 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # 2017 Aja Rangaswamy (aja004@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import matplotlib import matplotlib.pyplot as plt import numpy as np import seaborn as sns from scipy.stats import poisson matplotlib.use('Agg') # maximum # of cars in each location MAX_CARS = 20 # maximum # of cars to move during night MAX_MOVE_OF_CARS = 5 # expectation for rental requests in first location RENTAL_REQUEST_FIRST_LOC = 3 # expectation for rental requests in second location RENTAL_REQUEST_SECOND_LOC = 4 # expectation for # of cars returned in first location RETURNS_FIRST_LOC = 3 # expectation for # of cars returned in second location RETURNS_SECOND_LOC = 2 DISCOUNT = 0.9 # credit earned by a car RENTAL_CREDIT = 10 # cost of moving a car MOVE_CAR_COST = 2 # all possible actions actions = np.arange(-MAX_MOVE_OF_CARS, MAX_MOVE_OF_CARS + 1) # An up bound for poisson distribution # If n is greater than this value, then the probability of getting n is truncated to 0 POISSON_UPPER_BOUND = 11 # Probability for poisson distribution # @lam: lambda should be less than 10 for this function poisson_cache = dict() def poisson_probability(n, lam): global poisson_cache key = n * 10 + lam if key not in poisson_cache: poisson_cache[key] = poisson.pmf(n, lam) return poisson_cache[key] def expected_return(state, action, state_value, constant_returned_cars): """ @state: [# of cars in first location, # of cars in second location] @action: positive if moving cars from first location to second location, negative if moving cars from second location to first location @stateValue: state value matrix @constant_returned_cars: if set True, model is simplified such that the # of cars returned in daytime becomes constant rather than a random value from poisson distribution, which will reduce calculation time and leave the optimal policy/value state matrix almost the same """ # initailize total return returns = 0.0 # cost for moving cars returns -= MOVE_CAR_COST * abs(action) # moving cars NUM_OF_CARS_FIRST_LOC = min(state[0] - action, MAX_CARS) NUM_OF_CARS_SECOND_LOC = min(state[1] + action, MAX_CARS) # go through all possible rental requests for rental_request_first_loc in range(POISSON_UPPER_BOUND): for rental_request_second_loc in range(POISSON_UPPER_BOUND): # probability for current combination of rental requests prob = poisson_probability(rental_request_first_loc, RENTAL_REQUEST_FIRST_LOC) * \ poisson_probability(rental_request_second_loc, RENTAL_REQUEST_SECOND_LOC) num_of_cars_first_loc = NUM_OF_CARS_FIRST_LOC num_of_cars_second_loc = NUM_OF_CARS_SECOND_LOC # valid rental requests should be less than actual # of cars valid_rental_first_loc = min(num_of_cars_first_loc, rental_request_first_loc) valid_rental_second_loc = min(num_of_cars_second_loc, rental_request_second_loc) # get credits for renting reward = (valid_rental_first_loc + valid_rental_second_loc) * RENTAL_CREDIT num_of_cars_first_loc -= valid_rental_first_loc num_of_cars_second_loc -= valid_rental_second_loc if constant_returned_cars: # get returned cars, those cars can be used for renting tomorrow returned_cars_first_loc = RETURNS_FIRST_LOC returned_cars_second_loc = RETURNS_SECOND_LOC num_of_cars_first_loc = min(num_of_cars_first_loc + returned_cars_first_loc, MAX_CARS) num_of_cars_second_loc = min(num_of_cars_second_loc + returned_cars_second_loc, MAX_CARS) returns += prob * (reward + DISCOUNT * state_value[num_of_cars_first_loc, num_of_cars_second_loc]) else: for returned_cars_first_loc in range(POISSON_UPPER_BOUND): for returned_cars_second_loc in range(POISSON_UPPER_BOUND): prob_return = poisson_probability( returned_cars_first_loc, RETURNS_FIRST_LOC) * poisson_probability(returned_cars_second_loc, RETURNS_SECOND_LOC) num_of_cars_first_loc_ = min(num_of_cars_first_loc + returned_cars_first_loc, MAX_CARS) num_of_cars_second_loc_ = min(num_of_cars_second_loc + returned_cars_second_loc, MAX_CARS) prob_ = prob_return * prob returns += prob_ * (reward + DISCOUNT * state_value[num_of_cars_first_loc_, num_of_cars_second_loc_]) return returns def figure_4_2(constant_returned_cars=True): value = np.zeros((MAX_CARS + 1, MAX_CARS + 1)) policy = np.zeros(value.shape, dtype=np.int) iterations = 0 _, axes = plt.subplots(2, 3, figsize=(40, 20)) plt.subplots_adjust(wspace=0.1, hspace=0.2) axes = axes.flatten() while True: fig = sns.heatmap(np.flipud(policy), cmap="YlGnBu", ax=axes[iterations]) fig.set_ylabel('# cars at first location', fontsize=30) fig.set_yticks(list(reversed(range(MAX_CARS + 1)))) fig.set_xlabel('# cars at second location', fontsize=30) fig.set_title('policy {}'.format(iterations), fontsize=30) # policy evaluation (in-place) while True: old_value = value.copy() for i in range(MAX_CARS + 1): for j in range(MAX_CARS + 1): new_state_value = expected_return([i, j], policy[i, j], value, constant_returned_cars) value[i, j] = new_state_value max_value_change = abs(old_value - value).max() print('max value change {}'.format(max_value_change)) if max_value_change < 1e-4: break # policy improvement policy_stable = True for i in range(MAX_CARS + 1): for j in range(MAX_CARS + 1): old_action = policy[i, j] action_returns = [] for action in actions: if (0 <= action <= i) or (-j <= action <= 0): action_returns.append(expected_return([i, j], action, value, constant_returned_cars)) else: action_returns.append(-np.inf) new_action = actions[np.argmax(action_returns)] policy[i, j] = new_action if policy_stable and old_action != new_action: policy_stable = False print('policy stable {}'.format(policy_stable)) if policy_stable: fig = sns.heatmap(np.flipud(value), cmap="YlGnBu", ax=axes[-1]) fig.set_ylabel('# cars at first location', fontsize=30) fig.set_yticks(list(reversed(range(MAX_CARS + 1)))) fig.set_xlabel('# cars at second location', fontsize=30) fig.set_title('optimal value', fontsize=30) break iterations += 1 plt.savefig('../images/figure_4_2.png') plt.close() if __name__ == '__main__': figure_4_2() ================================================ FILE: chapter04/car_rental_synchronous.py ================================================ ####################################################################### # Copyright (C) # # 2016 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # 2017 Aja Rangaswamy (aja004@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### # This file is contributed by Tahsincan Köse which implements a synchronous policy evaluation, while the car_rental.py # implements an asynchronous policy evaluation. This file also utilizes multi-processing for acceleration and contains # an answer to Exercise 4.5 import numpy as np import matplotlib.pyplot as plt import math import tqdm import multiprocessing as mp from functools import partial import time import itertools ############# PROBLEM SPECIFIC CONSTANTS ####################### MAX_CARS = 20 MAX_MOVE = 5 MOVE_COST = -2 ADDITIONAL_PARK_COST = -4 RENT_REWARD = 10 # expectation for rental requests in first location RENTAL_REQUEST_FIRST_LOC = 3 # expectation for rental requests in second location RENTAL_REQUEST_SECOND_LOC = 4 # expectation for # of cars returned in first location RETURNS_FIRST_LOC = 3 # expectation for # of cars returned in second location RETURNS_SECOND_LOC = 2 ################################################################ poisson_cache = dict() def poisson(n, lam): global poisson_cache key = n * 10 + lam if key not in poisson_cache.keys(): poisson_cache[key] = math.exp(-lam) * math.pow(lam, n) / math.factorial(n) return poisson_cache[key] class PolicyIteration: def __init__(self, truncate, parallel_processes, delta=1e-2, gamma=0.9, solve_4_5=False): self.TRUNCATE = truncate self.NR_PARALLEL_PROCESSES = parallel_processes self.actions = np.arange(-MAX_MOVE, MAX_MOVE + 1) self.inverse_actions = {el: ind[0] for ind, el in np.ndenumerate(self.actions)} self.values = np.zeros((MAX_CARS + 1, MAX_CARS + 1)) self.policy = np.zeros(self.values.shape, dtype=np.int) self.delta = delta self.gamma = gamma self.solve_extension = solve_4_5 def solve(self): iterations = 0 total_start_time = time.time() while True: start_time = time.time() self.values = self.policy_evaluation(self.values, self.policy) elapsed_time = time.time() - start_time print(f'PE => Elapsed time {elapsed_time} seconds') start_time = time.time() policy_change, self.policy = self.policy_improvement(self.actions, self.values, self.policy) elapsed_time = time.time() - start_time print(f'PI => Elapsed time {elapsed_time} seconds') if policy_change == 0: break iterations += 1 total_elapsed_time = time.time() - total_start_time print(f'Optimal policy is reached after {iterations} iterations in {total_elapsed_time} seconds') # out-place def policy_evaluation(self, values, policy): global MAX_CARS while True: new_values = np.copy(values) k = np.arange(MAX_CARS + 1) # cartesian product all_states = ((i, j) for i, j in itertools.product(k, k)) results = [] with mp.Pool(processes=self.NR_PARALLEL_PROCESSES) as p: cook = partial(self.expected_return_pe, policy, values) results = p.map(cook, all_states) for v, i, j in results: new_values[i, j] = v difference = np.abs(new_values - values).sum() print(f'Difference: {difference}') values = new_values if difference < self.delta: print(f'Values are converged!') return values def policy_improvement(self, actions, values, policy): new_policy = np.copy(policy) expected_action_returns = np.zeros((MAX_CARS + 1, MAX_CARS + 1, np.size(actions))) cooks = dict() with mp.Pool(processes=8) as p: for action in actions: k = np.arange(MAX_CARS + 1) all_states = ((i, j) for i, j in itertools.product(k, k)) cooks[action] = partial(self.expected_return_pi, values, action) results = p.map(cooks[action], all_states) for v, i, j, a in results: expected_action_returns[i, j, self.inverse_actions[a]] = v for i in range(expected_action_returns.shape[0]): for j in range(expected_action_returns.shape[1]): new_policy[i, j] = actions[np.argmax(expected_action_returns[i, j])] policy_change = (new_policy != policy).sum() print(f'Policy changed in {policy_change} states') return policy_change, new_policy # O(n^4) computation for all possible requests and returns def bellman(self, values, action, state): expected_return = 0 if self.solve_extension: if action > 0: # Free shuttle to the second location expected_return += MOVE_COST * (action - 1) else: expected_return += MOVE_COST * abs(action) else: expected_return += MOVE_COST * abs(action) for req1 in range(0, self.TRUNCATE): for req2 in range(0, self.TRUNCATE): # moving cars num_of_cars_first_loc = int(min(state[0] - action, MAX_CARS)) num_of_cars_second_loc = int(min(state[1] + action, MAX_CARS)) # valid rental requests should be less than actual # of cars real_rental_first_loc = min(num_of_cars_first_loc, req1) real_rental_second_loc = min(num_of_cars_second_loc, req2) # get credits for renting reward = (real_rental_first_loc + real_rental_second_loc) * RENT_REWARD if self.solve_extension: if num_of_cars_first_loc >= 10: reward += ADDITIONAL_PARK_COST if num_of_cars_second_loc >= 10: reward += ADDITIONAL_PARK_COST num_of_cars_first_loc -= real_rental_first_loc num_of_cars_second_loc -= real_rental_second_loc # probability for current combination of rental requests prob = poisson(req1, RENTAL_REQUEST_FIRST_LOC) * \ poisson(req2, RENTAL_REQUEST_SECOND_LOC) for ret1 in range(0, self.TRUNCATE): for ret2 in range(0, self.TRUNCATE): num_of_cars_first_loc_ = min(num_of_cars_first_loc + ret1, MAX_CARS) num_of_cars_second_loc_ = min(num_of_cars_second_loc + ret2, MAX_CARS) prob_ = poisson(ret1, RETURNS_FIRST_LOC) * \ poisson(ret2, RETURNS_SECOND_LOC) * prob # Classic Bellman equation for state-value # prob_ corresponds to p(s'|s,a) for each possible s' -> (num_of_cars_first_loc_,num_of_cars_second_loc_) expected_return += prob_ * ( reward + self.gamma * values[num_of_cars_first_loc_, num_of_cars_second_loc_]) return expected_return # Parallelization enforced different helper functions # Expected return calculator for Policy Evaluation def expected_return_pe(self, policy, values, state): action = policy[state[0], state[1]] expected_return = self.bellman(values, action, state) return expected_return, state[0], state[1] # Expected return calculator for Policy Improvement def expected_return_pi(self, values, action, state): if ((action >= 0 and state[0] >= action) or (action < 0 and state[1] >= abs(action))) == False: return -float('inf'), state[0], state[1], action expected_return = self.bellman(values, action, state) return expected_return, state[0], state[1], action def plot(self): print(self.policy) plt.figure() plt.xlim(0, MAX_CARS + 1) plt.ylim(0, MAX_CARS + 1) plt.table(cellText=np.flipud(self.policy), loc=(0, 0), cellLoc='center') plt.show() if __name__ == '__main__': TRUNCATE = 9 solver = PolicyIteration(TRUNCATE, parallel_processes=4, delta=1e-1, gamma=0.9, solve_4_5=True) solver.solve() solver.plot() ================================================ FILE: chapter04/gamblers_problem.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import matplotlib import matplotlib.pyplot as plt import numpy as np matplotlib.use('Agg') # goal GOAL = 100 # all states, including state 0 and state 100 STATES = np.arange(GOAL + 1) # probability of head HEAD_PROB = 0.4 def figure_4_3(): # state value state_value = np.zeros(GOAL + 1) state_value[GOAL] = 1.0 sweeps_history = [] # value iteration while True: old_state_value = state_value.copy() sweeps_history.append(old_state_value) for state in STATES[1:GOAL]: # get possilbe actions for current state actions = np.arange(min(state, GOAL - state) + 1) action_returns = [] for action in actions: action_returns.append( HEAD_PROB * state_value[state + action] + (1 - HEAD_PROB) * state_value[state - action]) new_value = np.max(action_returns) state_value[state] = new_value delta = abs(state_value - old_state_value).max() if delta < 1e-9: sweeps_history.append(state_value) break # compute the optimal policy policy = np.zeros(GOAL + 1) for state in STATES[1:GOAL]: actions = np.arange(min(state, GOAL - state) + 1) action_returns = [] for action in actions: action_returns.append( HEAD_PROB * state_value[state + action] + (1 - HEAD_PROB) * state_value[state - action]) # round to resemble the figure in the book, see # https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/issues/83 policy[state] = actions[np.argmax(np.round(action_returns[1:], 5)) + 1] plt.figure(figsize=(10, 20)) plt.subplot(2, 1, 1) for sweep, state_value in enumerate(sweeps_history): plt.plot(state_value, label='sweep {}'.format(sweep)) plt.xlabel('Capital') plt.ylabel('Value estimates') plt.legend(loc='best') plt.subplot(2, 1, 2) plt.scatter(STATES, policy) plt.xlabel('Capital') plt.ylabel('Final policy (stake)') plt.savefig('../images/figure_4_3.png') plt.close() if __name__ == '__main__': figure_4_3() ================================================ FILE: chapter04/grid_world.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import matplotlib import matplotlib.pyplot as plt import numpy as np from matplotlib.table import Table matplotlib.use('Agg') WORLD_SIZE = 4 # left, up, right, down ACTIONS = [np.array([0, -1]), np.array([-1, 0]), np.array([0, 1]), np.array([1, 0])] ACTION_PROB = 0.25 def is_terminal(state): x, y = state return (x == 0 and y == 0) or (x == WORLD_SIZE - 1 and y == WORLD_SIZE - 1) def step(state, action): if is_terminal(state): return state, 0 next_state = (np.array(state) + action).tolist() x, y = next_state if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE: next_state = state reward = -1 return next_state, reward def draw_image(image): fig, ax = plt.subplots() ax.set_axis_off() tb = Table(ax, bbox=[0, 0, 1, 1]) nrows, ncols = image.shape width, height = 1.0 / ncols, 1.0 / nrows # Add cells for (i, j), val in np.ndenumerate(image): tb.add_cell(i, j, width, height, text=val, loc='center', facecolor='white') # Row and column labels... for i in range(len(image)): tb.add_cell(i, -1, width, height, text=i+1, loc='right', edgecolor='none', facecolor='none') tb.add_cell(-1, i, width, height/2, text=i+1, loc='center', edgecolor='none', facecolor='none') ax.add_table(tb) def compute_state_value(in_place=True, discount=1.0): new_state_values = np.zeros((WORLD_SIZE, WORLD_SIZE)) iteration = 0 while True: if in_place: state_values = new_state_values else: state_values = new_state_values.copy() old_state_values = state_values.copy() for i in range(WORLD_SIZE): for j in range(WORLD_SIZE): value = 0 for action in ACTIONS: (next_i, next_j), reward = step([i, j], action) value += ACTION_PROB * (reward + discount * state_values[next_i, next_j]) new_state_values[i, j] = value max_delta_value = abs(old_state_values - new_state_values).max() if max_delta_value < 1e-4: break iteration += 1 return new_state_values, iteration def figure_4_1(): # While the author suggests using in-place iterative policy evaluation, # Figure 4.1 actually uses out-of-place version. _, asycn_iteration = compute_state_value(in_place=True) values, sync_iteration = compute_state_value(in_place=False) draw_image(np.round(values, decimals=2)) print('In-place: {} iterations'.format(asycn_iteration)) print('Synchronous: {} iterations'.format(sync_iteration)) plt.savefig('../images/figure_4_1.png') plt.close() if __name__ == '__main__': figure_4_1() ================================================ FILE: chapter05/blackjack.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # 2017 Nicky van Foreest(vanforeest@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt import seaborn as sns from tqdm import tqdm # actions: hit or stand ACTION_HIT = 0 ACTION_STAND = 1 # "strike" in the book ACTIONS = [ACTION_HIT, ACTION_STAND] # policy for player POLICY_PLAYER = np.zeros(22, dtype=np.int) for i in range(12, 20): POLICY_PLAYER[i] = ACTION_HIT POLICY_PLAYER[20] = ACTION_STAND POLICY_PLAYER[21] = ACTION_STAND # function form of target policy of player def target_policy_player(usable_ace_player, player_sum, dealer_card): return POLICY_PLAYER[player_sum] # function form of behavior policy of player def behavior_policy_player(usable_ace_player, player_sum, dealer_card): if np.random.binomial(1, 0.5) == 1: return ACTION_STAND return ACTION_HIT # policy for dealer POLICY_DEALER = np.zeros(22) for i in range(12, 17): POLICY_DEALER[i] = ACTION_HIT for i in range(17, 22): POLICY_DEALER[i] = ACTION_STAND # get a new card def get_card(): card = np.random.randint(1, 14) card = min(card, 10) return card # get the value of a card (11 for ace). def card_value(card_id): return 11 if card_id == 1 else card_id # play a game # @policy_player: specify policy for player # @initial_state: [whether player has a usable Ace, sum of player's cards, one card of dealer] # @initial_action: the initial action def play(policy_player, initial_state=None, initial_action=None): # player status # sum of player player_sum = 0 # trajectory of player player_trajectory = [] # whether player uses Ace as 11 usable_ace_player = False # dealer status dealer_card1 = 0 dealer_card2 = 0 usable_ace_dealer = False if initial_state is None: # generate a random initial state while player_sum < 12: # if sum of player is less than 12, always hit card = get_card() player_sum += card_value(card) # If the player's sum is larger than 21, he may hold one or two aces. if player_sum > 21: assert player_sum == 22 # last card must be ace player_sum -= 10 else: usable_ace_player |= (1 == card) # initialize cards of dealer, suppose dealer will show the first card he gets dealer_card1 = get_card() dealer_card2 = get_card() else: # use specified initial state usable_ace_player, player_sum, dealer_card1 = initial_state dealer_card2 = get_card() # initial state of the game state = [usable_ace_player, player_sum, dealer_card1] # initialize dealer's sum dealer_sum = card_value(dealer_card1) + card_value(dealer_card2) usable_ace_dealer = 1 in (dealer_card1, dealer_card2) # if the dealer's sum is larger than 21, he must hold two aces. if dealer_sum > 21: assert dealer_sum == 22 # use one Ace as 1 rather than 11 dealer_sum -= 10 assert dealer_sum <= 21 assert player_sum <= 21 # game starts! # player's turn while True: if initial_action is not None: action = initial_action initial_action = None else: # get action based on current sum action = policy_player(usable_ace_player, player_sum, dealer_card1) # track player's trajectory for importance sampling player_trajectory.append([(usable_ace_player, player_sum, dealer_card1), action]) if action == ACTION_STAND: break # if hit, get new card card = get_card() # Keep track of the ace count. the usable_ace_player flag is insufficient alone as it cannot # distinguish between having one ace or two. ace_count = int(usable_ace_player) if card == 1: ace_count += 1 player_sum += card_value(card) # If the player has a usable ace, use it as 1 to avoid busting and continue. while player_sum > 21 and ace_count: player_sum -= 10 ace_count -= 1 # player busts if player_sum > 21: return state, -1, player_trajectory assert player_sum <= 21 usable_ace_player = (ace_count == 1) # dealer's turn while True: # get action based on current sum action = POLICY_DEALER[dealer_sum] if action == ACTION_STAND: break # if hit, get a new card new_card = get_card() ace_count = int(usable_ace_dealer) if new_card == 1: ace_count += 1 dealer_sum += card_value(new_card) # If the dealer has a usable ace, use it as 1 to avoid busting and continue. while dealer_sum > 21 and ace_count: dealer_sum -= 10 ace_count -= 1 # dealer busts if dealer_sum > 21: return state, 1, player_trajectory usable_ace_dealer = (ace_count == 1) # compare the sum between player and dealer assert player_sum <= 21 and dealer_sum <= 21 if player_sum > dealer_sum: return state, 1, player_trajectory elif player_sum == dealer_sum: return state, 0, player_trajectory else: return state, -1, player_trajectory # Monte Carlo Sample with On-Policy def monte_carlo_on_policy(episodes): states_usable_ace = np.zeros((10, 10)) # initialze counts to 1 to avoid 0 being divided states_usable_ace_count = np.ones((10, 10)) states_no_usable_ace = np.zeros((10, 10)) # initialze counts to 1 to avoid 0 being divided states_no_usable_ace_count = np.ones((10, 10)) for i in tqdm(range(0, episodes)): _, reward, player_trajectory = play(target_policy_player) for (usable_ace, player_sum, dealer_card), _ in player_trajectory: player_sum -= 12 dealer_card -= 1 if usable_ace: states_usable_ace_count[player_sum, dealer_card] += 1 states_usable_ace[player_sum, dealer_card] += reward else: states_no_usable_ace_count[player_sum, dealer_card] += 1 states_no_usable_ace[player_sum, dealer_card] += reward return states_usable_ace / states_usable_ace_count, states_no_usable_ace / states_no_usable_ace_count # Monte Carlo with Exploring Starts def monte_carlo_es(episodes): # (playerSum, dealerCard, usableAce, action) state_action_values = np.zeros((10, 10, 2, 2)) # initialze counts to 1 to avoid division by 0 state_action_pair_count = np.ones((10, 10, 2, 2)) # behavior policy is greedy def behavior_policy(usable_ace, player_sum, dealer_card): usable_ace = int(usable_ace) player_sum -= 12 dealer_card -= 1 # get argmax of the average returns(s, a) values_ = state_action_values[player_sum, dealer_card, usable_ace, :] / \ state_action_pair_count[player_sum, dealer_card, usable_ace, :] return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)]) # play for several episodes for episode in tqdm(range(episodes)): # for each episode, use a randomly initialized state and action initial_state = [bool(np.random.choice([0, 1])), np.random.choice(range(12, 22)), np.random.choice(range(1, 11))] initial_action = np.random.choice(ACTIONS) current_policy = behavior_policy if episode else target_policy_player _, reward, trajectory = play(current_policy, initial_state, initial_action) first_visit_check = set() for (usable_ace, player_sum, dealer_card), action in trajectory: usable_ace = int(usable_ace) player_sum -= 12 dealer_card -= 1 state_action = (usable_ace, player_sum, dealer_card, action) if state_action in first_visit_check: continue first_visit_check.add(state_action) # update values of state-action pairs state_action_values[player_sum, dealer_card, usable_ace, action] += reward state_action_pair_count[player_sum, dealer_card, usable_ace, action] += 1 return state_action_values / state_action_pair_count # Monte Carlo Sample with Off-Policy def monte_carlo_off_policy(episodes): initial_state = [True, 13, 2] rhos = [] returns = [] for i in range(0, episodes): _, reward, player_trajectory = play(behavior_policy_player, initial_state=initial_state) # get the importance ratio numerator = 1.0 denominator = 1.0 for (usable_ace, player_sum, dealer_card), action in player_trajectory: if action == target_policy_player(usable_ace, player_sum, dealer_card): denominator *= 0.5 else: numerator = 0.0 break rho = numerator / denominator rhos.append(rho) returns.append(reward) rhos = np.asarray(rhos) returns = np.asarray(returns) weighted_returns = rhos * returns weighted_returns = np.add.accumulate(weighted_returns) rhos = np.add.accumulate(rhos) ordinary_sampling = weighted_returns / np.arange(1, episodes + 1) with np.errstate(divide='ignore',invalid='ignore'): weighted_sampling = np.where(rhos != 0, weighted_returns / rhos, 0) return ordinary_sampling, weighted_sampling def figure_5_1(): states_usable_ace_1, states_no_usable_ace_1 = monte_carlo_on_policy(10000) states_usable_ace_2, states_no_usable_ace_2 = monte_carlo_on_policy(500000) states = [states_usable_ace_1, states_usable_ace_2, states_no_usable_ace_1, states_no_usable_ace_2] titles = ['Usable Ace, 10000 Episodes', 'Usable Ace, 500000 Episodes', 'No Usable Ace, 10000 Episodes', 'No Usable Ace, 500000 Episodes'] _, axes = plt.subplots(2, 2, figsize=(40, 30)) plt.subplots_adjust(wspace=0.1, hspace=0.2) axes = axes.flatten() for state, title, axis in zip(states, titles, axes): fig = sns.heatmap(np.flipud(state), cmap="YlGnBu", ax=axis, xticklabels=range(1, 11), yticklabels=list(reversed(range(12, 22)))) fig.set_ylabel('player sum', fontsize=30) fig.set_xlabel('dealer showing', fontsize=30) fig.set_title(title, fontsize=30) plt.savefig('../images/figure_5_1.png') plt.close() def figure_5_2(): state_action_values = monte_carlo_es(500000) state_value_no_usable_ace = np.max(state_action_values[:, :, 0, :], axis=-1) state_value_usable_ace = np.max(state_action_values[:, :, 1, :], axis=-1) # get the optimal policy action_no_usable_ace = np.argmax(state_action_values[:, :, 0, :], axis=-1) action_usable_ace = np.argmax(state_action_values[:, :, 1, :], axis=-1) images = [action_usable_ace, state_value_usable_ace, action_no_usable_ace, state_value_no_usable_ace] titles = ['Optimal policy with usable Ace', 'Optimal value with usable Ace', 'Optimal policy without usable Ace', 'Optimal value without usable Ace'] _, axes = plt.subplots(2, 2, figsize=(40, 30)) plt.subplots_adjust(wspace=0.1, hspace=0.2) axes = axes.flatten() for image, title, axis in zip(images, titles, axes): fig = sns.heatmap(np.flipud(image), cmap="YlGnBu", ax=axis, xticklabels=range(1, 11), yticklabels=list(reversed(range(12, 22)))) fig.set_ylabel('player sum', fontsize=30) fig.set_xlabel('dealer showing', fontsize=30) fig.set_title(title, fontsize=30) plt.savefig('../images/figure_5_2.png') plt.close() def figure_5_3(): true_value = -0.27726 episodes = 10000 runs = 100 error_ordinary = np.zeros(episodes) error_weighted = np.zeros(episodes) for i in tqdm(range(0, runs)): ordinary_sampling_, weighted_sampling_ = monte_carlo_off_policy(episodes) # get the squared error error_ordinary += np.power(ordinary_sampling_ - true_value, 2) error_weighted += np.power(weighted_sampling_ - true_value, 2) error_ordinary /= runs error_weighted /= runs plt.plot(np.arange(1, episodes + 1), error_ordinary, color='green', label='Ordinary Importance Sampling') plt.plot(np.arange(1, episodes + 1), error_weighted, color='red', label='Weighted Importance Sampling') plt.ylim(-0.1, 5) plt.xlabel('Episodes (log scale)') plt.ylabel(f'Mean square error\n(average over {runs} runs)') plt.xscale('log') plt.legend() plt.savefig('../images/figure_5_3.png') plt.close() if __name__ == '__main__': figure_5_1() figure_5_2() figure_5_3() ================================================ FILE: chapter05/infinite_variance.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt ACTION_BACK = 0 ACTION_END = 1 # behavior policy def behavior_policy(): return np.random.binomial(1, 0.5) # target policy def target_policy(): return ACTION_BACK # one turn def play(): # track the action for importance ratio trajectory = [] while True: action = behavior_policy() trajectory.append(action) if action == ACTION_END: return 0, trajectory if np.random.binomial(1, 0.9) == 0: return 1, trajectory def figure_5_4(): runs = 10 episodes = 100000 for run in range(runs): rewards = [] for episode in range(0, episodes): reward, trajectory = play() if trajectory[-1] == ACTION_END: rho = 0 else: rho = 1.0 / pow(0.5, len(trajectory)) rewards.append(rho * reward) rewards = np.add.accumulate(rewards) estimations = np.asarray(rewards) / np.arange(1, episodes + 1) plt.plot(estimations) plt.xlabel('Episodes (log scale)') plt.ylabel('Ordinary Importance Sampling') plt.xscale('log') plt.savefig('../images/figure_5_4.png') plt.close() if __name__ == '__main__': figure_5_4() ================================================ FILE: chapter06/cliff_walking.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from tqdm import tqdm # world height WORLD_HEIGHT = 4 # world width WORLD_WIDTH = 12 # probability for exploration EPSILON = 0.1 # step size ALPHA = 0.5 # gamma for Q-Learning and Expected Sarsa GAMMA = 1 # all possible actions ACTION_UP = 0 ACTION_DOWN = 1 ACTION_LEFT = 2 ACTION_RIGHT = 3 ACTIONS = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT] # initial state action pair values START = [3, 0] GOAL = [3, 11] def step(state, action): i, j = state if action == ACTION_UP: next_state = [max(i - 1, 0), j] elif action == ACTION_LEFT: next_state = [i, max(j - 1, 0)] elif action == ACTION_RIGHT: next_state = [i, min(j + 1, WORLD_WIDTH - 1)] elif action == ACTION_DOWN: next_state = [min(i + 1, WORLD_HEIGHT - 1), j] else: assert False reward = -1 if (action == ACTION_DOWN and i == 2 and 1 <= j <= 10) or ( action == ACTION_RIGHT and state == START): reward = -100 next_state = START return next_state, reward # reward for each action in each state # actionRewards = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4)) # actionRewards[:, :, :] = -1.0 # actionRewards[2, 1:11, ACTION_DOWN] = -100.0 # actionRewards[3, 0, ACTION_RIGHT] = -100.0 # set up destinations for each action in each state # actionDestination = [] # for i in range(0, WORLD_HEIGHT): # actionDestination.append([]) # for j in range(0, WORLD_WIDTH): # destinaion = dict() # destinaion[ACTION_UP] = [max(i - 1, 0), j] # destinaion[ACTION_LEFT] = [i, max(j - 1, 0)] # destinaion[ACTION_RIGHT] = [i, min(j + 1, WORLD_WIDTH - 1)] # if i == 2 and 1 <= j <= 10: # destinaion[ACTION_DOWN] = START # else: # destinaion[ACTION_DOWN] = [min(i + 1, WORLD_HEIGHT - 1), j] # actionDestination[-1].append(destinaion) # actionDestination[3][0][ACTION_RIGHT] = START # choose an action based on epsilon greedy algorithm def choose_action(state, q_value): if np.random.binomial(1, EPSILON) == 1: return np.random.choice(ACTIONS) else: values_ = q_value[state[0], state[1], :] return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)]) # an episode with Sarsa # @q_value: values for state action pair, will be updated # @expected: if True, will use expected Sarsa algorithm # @step_size: step size for updating # @return: total rewards within this episode def sarsa(q_value, expected=False, step_size=ALPHA): state = START action = choose_action(state, q_value) rewards = 0.0 while state != GOAL: next_state, reward = step(state, action) next_action = choose_action(next_state, q_value) rewards += reward if not expected: target = q_value[next_state[0], next_state[1], next_action] else: # calculate the expected value of new state target = 0.0 q_next = q_value[next_state[0], next_state[1], :] best_actions = np.argwhere(q_next == np.max(q_next)) for action_ in ACTIONS: if action_ in best_actions: target += ((1.0 - EPSILON) / len(best_actions) + EPSILON / len(ACTIONS)) * q_value[next_state[0], next_state[1], action_] else: target += EPSILON / len(ACTIONS) * q_value[next_state[0], next_state[1], action_] target *= GAMMA q_value[state[0], state[1], action] += step_size * ( reward + target - q_value[state[0], state[1], action]) state = next_state action = next_action return rewards # an episode with Q-Learning # @q_value: values for state action pair, will be updated # @step_size: step size for updating # @return: total rewards within this episode def q_learning(q_value, step_size=ALPHA): state = START rewards = 0.0 while state != GOAL: action = choose_action(state, q_value) next_state, reward = step(state, action) rewards += reward # Q-Learning update q_value[state[0], state[1], action] += step_size * ( reward + GAMMA * np.max(q_value[next_state[0], next_state[1], :]) - q_value[state[0], state[1], action]) state = next_state return rewards # print optimal policy def print_optimal_policy(q_value): optimal_policy = [] for i in range(0, WORLD_HEIGHT): optimal_policy.append([]) for j in range(0, WORLD_WIDTH): if [i, j] == GOAL: optimal_policy[-1].append('G') continue bestAction = np.argmax(q_value[i, j, :]) if bestAction == ACTION_UP: optimal_policy[-1].append('U') elif bestAction == ACTION_DOWN: optimal_policy[-1].append('D') elif bestAction == ACTION_LEFT: optimal_policy[-1].append('L') elif bestAction == ACTION_RIGHT: optimal_policy[-1].append('R') for row in optimal_policy: print(row) # Use multiple runs instead of a single run and a sliding window # With a single run I failed to present a smooth curve # However the optimal policy converges well with a single run # Sarsa converges to the safe path, while Q-Learning converges to the optimal path def figure_6_4(): # episodes of each run episodes = 500 # perform 40 independent runs runs = 50 rewards_sarsa = np.zeros(episodes) rewards_q_learning = np.zeros(episodes) for r in tqdm(range(runs)): q_sarsa = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4)) q_q_learning = np.copy(q_sarsa) for i in range(0, episodes): # cut off the value by -100 to draw the figure more elegantly # rewards_sarsa[i] += max(sarsa(q_sarsa), -100) # rewards_q_learning[i] += max(q_learning(q_q_learning), -100) rewards_sarsa[i] += sarsa(q_sarsa) rewards_q_learning[i] += q_learning(q_q_learning) # averaging over independt runs rewards_sarsa /= runs rewards_q_learning /= runs # draw reward curves plt.plot(rewards_sarsa, label='Sarsa') plt.plot(rewards_q_learning, label='Q-Learning') plt.xlabel('Episodes') plt.ylabel('Sum of rewards during episode') plt.ylim([-100, 0]) plt.legend() plt.savefig('../images/figure_6_4.png') plt.close() # display optimal policy print('Sarsa Optimal Policy:') print_optimal_policy(q_sarsa) print('Q-Learning Optimal Policy:') print_optimal_policy(q_q_learning) # Due to limited capacity of calculation of my machine, I can't complete this experiment # with 100,000 episodes and 50,000 runs to get the fully averaged performance # However even I only play for 1,000 episodes and 10 runs, the curves looks still good. def figure_6_6(): step_sizes = np.arange(0.1, 1.1, 0.1) episodes = 1000 runs = 10 ASY_SARSA = 0 ASY_EXPECTED_SARSA = 1 ASY_QLEARNING = 2 INT_SARSA = 3 INT_EXPECTED_SARSA = 4 INT_QLEARNING = 5 methods = range(0, 6) performace = np.zeros((6, len(step_sizes))) for run in range(runs): for ind, step_size in tqdm(list(zip(range(0, len(step_sizes)), step_sizes))): q_sarsa = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4)) q_expected_sarsa = np.copy(q_sarsa) q_q_learning = np.copy(q_sarsa) for ep in range(episodes): sarsa_reward = sarsa(q_sarsa, expected=False, step_size=step_size) expected_sarsa_reward = sarsa(q_expected_sarsa, expected=True, step_size=step_size) q_learning_reward = q_learning(q_q_learning, step_size=step_size) performace[ASY_SARSA, ind] += sarsa_reward performace[ASY_EXPECTED_SARSA, ind] += expected_sarsa_reward performace[ASY_QLEARNING, ind] += q_learning_reward if ep < 100: performace[INT_SARSA, ind] += sarsa_reward performace[INT_EXPECTED_SARSA, ind] += expected_sarsa_reward performace[INT_QLEARNING, ind] += q_learning_reward performace[:3, :] /= episodes * runs performace[3:, :] /= 100 * runs labels = ['Asymptotic Sarsa', 'Asymptotic Expected Sarsa', 'Asymptotic Q-Learning', 'Interim Sarsa', 'Interim Expected Sarsa', 'Interim Q-Learning'] for method, label in zip(methods, labels): plt.plot(step_sizes, performace[method, :], label=label) plt.xlabel('alpha') plt.ylabel('reward per episode') plt.legend() plt.savefig('../images/figure_6_6.png') plt.close() if __name__ == '__main__': figure_6_4() figure_6_6() ================================================ FILE: chapter06/maximization_bias.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from tqdm import tqdm import copy # state A STATE_A = 0 # state B STATE_B = 1 # use one terminal state STATE_TERMINAL = 2 # starts from state A STATE_START = STATE_A # possible actions in A ACTION_A_RIGHT = 0 ACTION_A_LEFT = 1 # probability for exploration EPSILON = 0.1 # step size ALPHA = 0.1 # discount for max value GAMMA = 1.0 # possible actions in B, maybe 10 actions ACTIONS_B = range(0, 10) # all possible actions STATE_ACTIONS = [[ACTION_A_RIGHT, ACTION_A_LEFT], ACTIONS_B] # state action pair values, if a state is a terminal state, then the value is always 0 INITIAL_Q = [np.zeros(2), np.zeros(len(ACTIONS_B)), np.zeros(1)] # set up destination for each state and each action TRANSITION = [[STATE_TERMINAL, STATE_B], [STATE_TERMINAL] * len(ACTIONS_B)] # choose an action based on epsilon greedy algorithm def choose_action(state, q_value): if np.random.binomial(1, EPSILON) == 1: return np.random.choice(STATE_ACTIONS[state]) else: values_ = q_value[state] return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)]) # take @action in @state, return the reward def take_action(state, action): if state == STATE_A: return 0 return np.random.normal(-0.1, 1) # if there are two state action pair value array, use double Q-Learning # otherwise use normal Q-Learning def q_learning(q1, q2=None): state = STATE_START # track the # of action left in state A left_count = 0 while state != STATE_TERMINAL: if q2 is None: action = choose_action(state, q1) else: # derive a action form Q1 and Q2 action = choose_action(state, [item1 + item2 for item1, item2 in zip(q1, q2)]) if state == STATE_A and action == ACTION_A_LEFT: left_count += 1 reward = take_action(state, action) next_state = TRANSITION[state][action] if q2 is None: active_q = q1 target = np.max(active_q[next_state]) else: if np.random.binomial(1, 0.5) == 1: active_q = q1 target_q = q2 else: active_q = q2 target_q = q1 best_action = np.random.choice([action_ for action_, value_ in enumerate(active_q[next_state]) if value_ == np.max(active_q[next_state])]) target = target_q[next_state][best_action] # Q-Learning update active_q[state][action] += ALPHA * ( reward + GAMMA * target - active_q[state][action]) state = next_state return left_count # Figure 6.7, 1,000 runs may be enough, # of actions in state B will also affect the curves def figure_6_7(): # each independent run has 300 episodes episodes = 300 runs = 1000 left_counts_q = np.zeros((runs, episodes)) left_counts_double_q = np.zeros((runs, episodes)) for run in tqdm(range(runs)): q = copy.deepcopy(INITIAL_Q) q1 = copy.deepcopy(INITIAL_Q) q2 = copy.deepcopy(INITIAL_Q) for ep in range(0, episodes): left_counts_q[run, ep] = q_learning(q) left_counts_double_q[run, ep] = q_learning(q1, q2) left_counts_q = left_counts_q.mean(axis=0) left_counts_double_q = left_counts_double_q.mean(axis=0) plt.plot(left_counts_q, label='Q-Learning') plt.plot(left_counts_double_q, label='Double Q-Learning') plt.plot(np.ones(episodes) * 0.05, label='Optimal') plt.xlabel('episodes') plt.ylabel('% left actions from A') plt.legend() plt.savefig('../images/figure_6_7.png') plt.close() if __name__ == '__main__': figure_6_7() ================================================ FILE: chapter06/random_walk.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from tqdm import tqdm # 0 is the left terminal state # 6 is the right terminal state # 1 ... 5 represents A ... E VALUES = np.zeros(7) VALUES[1:6] = 0.5 # For convenience, we assume all rewards are 0 # and the left terminal state has value 0, the right terminal state has value 1 # This trick has been used in Gambler's Problem VALUES[6] = 1 # set up true state values TRUE_VALUE = np.zeros(7) TRUE_VALUE[1:6] = np.arange(1, 6) / 6.0 TRUE_VALUE[6] = 1 ACTION_LEFT = 0 ACTION_RIGHT = 1 # @values: current states value, will be updated if @batch is False # @alpha: step size # @batch: whether to update @values def temporal_difference(values, alpha=0.1, batch=False): state = 3 trajectory = [state] rewards = [0] while True: old_state = state if np.random.binomial(1, 0.5) == ACTION_LEFT: state -= 1 else: state += 1 # Assume all rewards are 0 reward = 0 trajectory.append(state) # TD update if not batch: values[old_state] += alpha * (reward + values[state] - values[old_state]) if state == 6 or state == 0: break rewards.append(reward) return trajectory, rewards # @values: current states value, will be updated if @batch is False # @alpha: step size # @batch: whether to update @values def monte_carlo(values, alpha=0.1, batch=False): state = 3 trajectory = [state] # if end up with left terminal state, all returns are 0 # if end up with right terminal state, all returns are 1 while True: if np.random.binomial(1, 0.5) == ACTION_LEFT: state -= 1 else: state += 1 trajectory.append(state) if state == 6: returns = 1.0 break elif state == 0: returns = 0.0 break if not batch: for state_ in trajectory[:-1]: # MC update values[state_] += alpha * (returns - values[state_]) return trajectory, [returns] * (len(trajectory) - 1) # Example 6.2 left def compute_state_value(): episodes = [0, 1, 10, 100] current_values = np.copy(VALUES) plt.figure(1) for i in range(episodes[-1] + 1): if i in episodes: plt.plot(("A", "B", "C", "D", "E"), current_values[1:6], label=str(i) + ' episodes') temporal_difference(current_values) plt.plot(("A", "B", "C", "D", "E"), TRUE_VALUE[1:6], label='true values') plt.xlabel('State') plt.ylabel('Estimated Value') plt.legend() # Example 6.2 right def rms_error(): # Same alpha value can appear in both arrays td_alphas = [0.15, 0.1, 0.05] mc_alphas = [0.01, 0.02, 0.03, 0.04] episodes = 100 + 1 runs = 100 for i, alpha in enumerate(td_alphas + mc_alphas): total_errors = np.zeros(episodes) if i < len(td_alphas): method = 'TD' linestyle = 'solid' else: method = 'MC' linestyle = 'dashdot' for r in tqdm(range(runs)): errors = [] current_values = np.copy(VALUES) for i in range(0, episodes): errors.append(np.sqrt(np.sum(np.power(TRUE_VALUE - current_values, 2)) / 5.0)) if method == 'TD': temporal_difference(current_values, alpha=alpha) else: monte_carlo(current_values, alpha=alpha) total_errors += np.asarray(errors) total_errors /= runs plt.plot(total_errors, linestyle=linestyle, label=method + ', $\\alpha$ = %.02f' % (alpha)) plt.xlabel('Walks/Episodes') plt.ylabel('Empirical RMS error, averaged over states') plt.legend() # Figure 6.2 # @method: 'TD' or 'MC' def batch_updating(method, episodes, alpha=0.001): # perform 100 independent runs runs = 100 total_errors = np.zeros(episodes) for r in tqdm(range(0, runs)): current_values = np.copy(VALUES) current_values[1:6] = -1 errors = [] # track shown trajectories and reward/return sequences trajectories = [] rewards = [] for ep in range(episodes): if method == 'TD': trajectory_, rewards_ = temporal_difference(current_values, batch=True) else: trajectory_, rewards_ = monte_carlo(current_values, batch=True) trajectories.append(trajectory_) rewards.append(rewards_) while True: # keep feeding our algorithm with trajectories seen so far until state value function converges updates = np.zeros(7) for trajectory_, rewards_ in zip(trajectories, rewards): for i in range(0, len(trajectory_) - 1): if method == 'TD': updates[trajectory_[i]] += rewards_[i] + current_values[trajectory_[i + 1]] - current_values[trajectory_[i]] else: updates[trajectory_[i]] += rewards_[i] - current_values[trajectory_[i]] updates *= alpha if np.sum(np.abs(updates)) < 1e-3: break # perform batch updating current_values += updates # calculate rms error errors.append(np.sqrt(np.sum(np.power(current_values - TRUE_VALUE, 2)) / 5.0)) total_errors += np.asarray(errors) total_errors /= runs return total_errors def example_6_2(): plt.figure(figsize=(10, 20)) plt.subplot(2, 1, 1) compute_state_value() plt.subplot(2, 1, 2) rms_error() plt.tight_layout() plt.savefig('../images/example_6_2.png') plt.close() def figure_6_2(): episodes = 100 + 1 td_errors = batch_updating('TD', episodes) mc_errors = batch_updating('MC', episodes) plt.plot(td_errors, label='TD') plt.plot(mc_errors, label='MC') plt.title("Batch Training") plt.xlabel('Walks/Episodes') plt.ylabel('RMS error, averaged over states') plt.xlim(0, 100) plt.ylim(0, 0.25) plt.legend() plt.savefig('../images/figure_6_2.png') plt.close() if __name__ == '__main__': example_6_2() figure_6_2() ================================================ FILE: chapter06/windy_grid_world.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt # world height WORLD_HEIGHT = 7 # world width WORLD_WIDTH = 10 # wind strength for each column WIND = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0] # possible actions ACTION_UP = 0 ACTION_DOWN = 1 ACTION_LEFT = 2 ACTION_RIGHT = 3 # probability for exploration EPSILON = 0.1 # Sarsa step size ALPHA = 0.5 # reward for each step REWARD = -1.0 START = [3, 0] GOAL = [3, 7] ACTIONS = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT] def step(state, action): i, j = state if action == ACTION_UP: return [max(i - 1 - WIND[j], 0), j] elif action == ACTION_DOWN: return [max(min(i + 1 - WIND[j], WORLD_HEIGHT - 1), 0), j] elif action == ACTION_LEFT: return [max(i - WIND[j], 0), max(j - 1, 0)] elif action == ACTION_RIGHT: return [max(i - WIND[j], 0), min(j + 1, WORLD_WIDTH - 1)] else: assert False # play for an episode def episode(q_value): # track the total time steps in this episode time = 0 # initialize state state = START # choose an action based on epsilon-greedy algorithm if np.random.binomial(1, EPSILON) == 1: action = np.random.choice(ACTIONS) else: values_ = q_value[state[0], state[1], :] action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)]) # keep going until get to the goal state while state != GOAL: next_state = step(state, action) if np.random.binomial(1, EPSILON) == 1: next_action = np.random.choice(ACTIONS) else: values_ = q_value[next_state[0], next_state[1], :] next_action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)]) # Sarsa update q_value[state[0], state[1], action] += \ ALPHA * (REWARD + q_value[next_state[0], next_state[1], next_action] - q_value[state[0], state[1], action]) state = next_state action = next_action time += 1 return time def figure_6_3(): q_value = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4)) episode_limit = 500 steps = [] ep = 0 while ep < episode_limit: steps.append(episode(q_value)) # time = episode(q_value) # episodes.extend([ep] * time) ep += 1 steps = np.add.accumulate(steps) plt.plot(steps, np.arange(1, len(steps) + 1)) plt.xlabel('Time steps') plt.ylabel('Episodes') plt.savefig('../images/figure_6_3.png') plt.close() # display the optimal policy optimal_policy = [] for i in range(0, WORLD_HEIGHT): optimal_policy.append([]) for j in range(0, WORLD_WIDTH): if [i, j] == GOAL: optimal_policy[-1].append('G') continue bestAction = np.argmax(q_value[i, j, :]) if bestAction == ACTION_UP: optimal_policy[-1].append('U') elif bestAction == ACTION_DOWN: optimal_policy[-1].append('D') elif bestAction == ACTION_LEFT: optimal_policy[-1].append('L') elif bestAction == ACTION_RIGHT: optimal_policy[-1].append('R') print('Optimal policy is:') for row in optimal_policy: print(row) print('Wind strength for each column:\n{}'.format([str(w) for w in WIND])) if __name__ == '__main__': figure_6_3() ================================================ FILE: chapter07/random_walk.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from tqdm import tqdm # all states N_STATES = 19 # discount GAMMA = 1 # all states but terminal states STATES = np.arange(1, N_STATES + 1) # start from the middle state START_STATE = 10 # two terminal states # an action leading to the left terminal state has reward -1 # an action leading to the right terminal state has reward 1 END_STATES = [0, N_STATES + 1] # true state value from bellman equation TRUE_VALUE = np.arange(-20, 22, 2) / 20.0 TRUE_VALUE[0] = TRUE_VALUE[-1] = 0 # n-steps TD method # @value: values for each state, will be updated # @n: # of steps # @alpha: # step size def temporal_difference(value, n, alpha): # initial starting state state = START_STATE # arrays to store states and rewards for an episode # space isn't a major consideration, so I didn't use the mod trick states = [state] rewards = [0] # track the time time = 0 # the length of this episode T = float('inf') while True: # go to next time step time += 1 if time < T: # choose an action randomly if np.random.binomial(1, 0.5) == 1: next_state = state + 1 else: next_state = state - 1 if next_state == 0: reward = -1 elif next_state == 20: reward = 1 else: reward = 0 # store new state and new reward states.append(next_state) rewards.append(reward) if next_state in END_STATES: T = time # get the time of the state to update update_time = time - n if update_time >= 0: returns = 0.0 # calculate corresponding rewards for t in range(update_time + 1, min(T, update_time + n) + 1): returns += pow(GAMMA, t - update_time - 1) * rewards[t] # add state value to the return if update_time + n <= T: returns += pow(GAMMA, n) * value[states[(update_time + n)]] state_to_update = states[update_time] # update the state value if not state_to_update in END_STATES: value[state_to_update] += alpha * (returns - value[state_to_update]) if update_time == T - 1: break state = next_state # Figure 7.2, it will take quite a while def figure7_2(): # all possible steps steps = np.power(2, np.arange(0, 10)) # all possible alphas alphas = np.arange(0, 1.1, 0.1) # each run has 10 episodes episodes = 10 # perform 100 independent runs runs = 100 # track the errors for each (step, alpha) combination errors = np.zeros((len(steps), len(alphas))) for run in tqdm(range(0, runs)): for step_ind, step in enumerate(steps): for alpha_ind, alpha in enumerate(alphas): # print('run:', run, 'step:', step, 'alpha:', alpha) value = np.zeros(N_STATES + 2) for ep in range(0, episodes): temporal_difference(value, step, alpha) # calculate the RMS error errors[step_ind, alpha_ind] += np.sqrt(np.sum(np.power(value - TRUE_VALUE, 2)) / N_STATES) # take average errors /= episodes * runs for i in range(0, len(steps)): plt.plot(alphas, errors[i, :], label='n = %d' % (steps[i])) plt.xlabel('alpha') plt.ylabel('RMS error') plt.ylim([0.25, 0.55]) plt.legend() plt.savefig('../images/figure_7_2.png') plt.close() if __name__ == '__main__': figure7_2() ================================================ FILE: chapter08/expectation_vs_sample.py ================================================ ####################################################################### # Copyright (C) # # 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from tqdm import tqdm # for figure 8.7, run a simulation of 2 * @b steps def b_steps(b): # set the value of the next b states # it is not clear how to set this distribution = np.random.randn(b) # true value of the current state true_v = np.mean(distribution) samples = [] errors = [] # sample 2b steps for t in range(2 * b): v = np.random.choice(distribution) samples.append(v) errors.append(np.abs(np.mean(samples) - true_v)) return errors def figure_8_7(): runs = 100 branch = [2, 10, 100, 1000] for b in branch: errors = np.zeros((runs, 2 * b)) for r in tqdm(np.arange(runs)): errors[r] = b_steps(b) errors = errors.mean(axis=0) x_axis = (np.arange(len(errors)) + 1) / float(b) plt.plot(x_axis, errors, label='b = %d' % (b)) plt.xlabel('number of computations') plt.xticks([0, 1.0, 2.0], ['0', 'b', '2b']) plt.ylabel('RMS error') plt.legend() plt.savefig('../images/figure_8_7.png') plt.close() if __name__ == '__main__': figure_8_7() ================================================ FILE: chapter08/maze.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from tqdm import tqdm import heapq from copy import deepcopy class PriorityQueue: def __init__(self): self.pq = [] self.entry_finder = {} self.REMOVED = '' self.counter = 0 def add_item(self, item, priority=0): if item in self.entry_finder: self.remove_item(item) entry = [priority, self.counter, item] self.counter += 1 self.entry_finder[item] = entry heapq.heappush(self.pq, entry) def remove_item(self, item): entry = self.entry_finder.pop(item) entry[-1] = self.REMOVED def pop_item(self): while self.pq: priority, count, item = heapq.heappop(self.pq) if item is not self.REMOVED: del self.entry_finder[item] return item, priority raise KeyError('pop from an empty priority queue') def empty(self): return not self.entry_finder # A wrapper class for a maze, containing all the information about the maze. # Basically it's initialized to DynaMaze by default, however it can be easily adapted # to other maze class Maze: def __init__(self): # maze width self.WORLD_WIDTH = 9 # maze height self.WORLD_HEIGHT = 6 # all possible actions self.ACTION_UP = 0 self.ACTION_DOWN = 1 self.ACTION_LEFT = 2 self.ACTION_RIGHT = 3 self.actions = [self.ACTION_UP, self.ACTION_DOWN, self.ACTION_LEFT, self.ACTION_RIGHT] # start state self.START_STATE = [2, 0] # goal state self.GOAL_STATES = [[0, 8]] # all obstacles self.obstacles = [[1, 2], [2, 2], [3, 2], [0, 7], [1, 7], [2, 7], [4, 5]] self.old_obstacles = None self.new_obstacles = None # time to change obstacles self.obstacle_switch_time = None # initial state action pair values # self.stateActionValues = np.zeros((self.WORLD_HEIGHT, self.WORLD_WIDTH, len(self.actions))) # the size of q value self.q_size = (self.WORLD_HEIGHT, self.WORLD_WIDTH, len(self.actions)) # max steps self.max_steps = float('inf') # track the resolution for this maze self.resolution = 1 # extend a state to a higher resolution maze # @state: state in lower resolution maze # @factor: extension factor, one state will become factor^2 states after extension def extend_state(self, state, factor): new_state = [state[0] * factor, state[1] * factor] new_states = [] for i in range(0, factor): for j in range(0, factor): new_states.append([new_state[0] + i, new_state[1] + j]) return new_states # extend a state into higher resolution # one state in original maze will become @factor^2 states in @return new maze def extend_maze(self, factor): new_maze = Maze() new_maze.WORLD_WIDTH = self.WORLD_WIDTH * factor new_maze.WORLD_HEIGHT = self.WORLD_HEIGHT * factor new_maze.START_STATE = [self.START_STATE[0] * factor, self.START_STATE[1] * factor] new_maze.GOAL_STATES = self.extend_state(self.GOAL_STATES[0], factor) new_maze.obstacles = [] for state in self.obstacles: new_maze.obstacles.extend(self.extend_state(state, factor)) new_maze.q_size = (new_maze.WORLD_HEIGHT, new_maze.WORLD_WIDTH, len(new_maze.actions)) # new_maze.stateActionValues = np.zeros((new_maze.WORLD_HEIGHT, new_maze.WORLD_WIDTH, len(new_maze.actions))) new_maze.resolution = factor return new_maze # take @action in @state # @return: [new state, reward] def step(self, state, action): x, y = state if action == self.ACTION_UP: x = max(x - 1, 0) elif action == self.ACTION_DOWN: x = min(x + 1, self.WORLD_HEIGHT - 1) elif action == self.ACTION_LEFT: y = max(y - 1, 0) elif action == self.ACTION_RIGHT: y = min(y + 1, self.WORLD_WIDTH - 1) if [x, y] in self.obstacles: x, y = state if [x, y] in self.GOAL_STATES: reward = 1.0 else: reward = 0.0 return [x, y], reward # a wrapper class for parameters of dyna algorithms class DynaParams: def __init__(self): # discount self.gamma = 0.95 # probability for exploration self.epsilon = 0.1 # step size self.alpha = 0.1 # weight for elapsed time self.time_weight = 0 # n-step planning self.planning_steps = 5 # average over several independent runs self.runs = 10 # algorithm names self.methods = ['Dyna-Q', 'Dyna-Q+'] # threshold for priority queue self.theta = 0 # choose an action based on epsilon-greedy algorithm def choose_action(state, q_value, maze, dyna_params): if np.random.binomial(1, dyna_params.epsilon) == 1: return np.random.choice(maze.actions) else: values = q_value[state[0], state[1], :] return np.random.choice([action for action, value in enumerate(values) if value == np.max(values)]) # Trivial model for planning in Dyna-Q class TrivialModel: # @rand: an instance of np.random.RandomState for sampling def __init__(self, rand=np.random): self.model = dict() self.rand = rand # feed the model with previous experience def feed(self, state, action, next_state, reward): state = deepcopy(state) next_state = deepcopy(next_state) if tuple(state) not in self.model.keys(): self.model[tuple(state)] = dict() self.model[tuple(state)][action] = [list(next_state), reward] # randomly sample from previous experience def sample(self): state_index = self.rand.choice(range(len(self.model.keys()))) state = list(self.model)[state_index] action_index = self.rand.choice(range(len(self.model[state].keys()))) action = list(self.model[state])[action_index] next_state, reward = self.model[state][action] state = deepcopy(state) next_state = deepcopy(next_state) return list(state), action, list(next_state), reward # Time-based model for planning in Dyna-Q+ class TimeModel: # @maze: the maze instance. Indeed it's not very reasonable to give access to maze to the model. # @timeWeight: also called kappa, the weight for elapsed time in sampling reward, it need to be small # @rand: an instance of np.random.RandomState for sampling def __init__(self, maze, time_weight=1e-4, rand=np.random): self.rand = rand self.model = dict() # track the total time self.time = 0 self.time_weight = time_weight self.maze = maze # feed the model with previous experience def feed(self, state, action, next_state, reward): state = deepcopy(state) next_state = deepcopy(next_state) self.time += 1 if tuple(state) not in self.model.keys(): self.model[tuple(state)] = dict() # Actions that had never been tried before from a state were allowed to be considered in the planning step for action_ in self.maze.actions: if action_ != action: # Such actions would lead back to the same state with a reward of zero # Notice that the minimum time stamp is 1 instead of 0 self.model[tuple(state)][action_] = [list(state), 0, 1] self.model[tuple(state)][action] = [list(next_state), reward, self.time] # randomly sample from previous experience def sample(self): state_index = self.rand.choice(range(len(self.model.keys()))) state = list(self.model)[state_index] action_index = self.rand.choice(range(len(self.model[state].keys()))) action = list(self.model[state])[action_index] next_state, reward, time = self.model[state][action] # adjust reward with elapsed time since last vist reward += self.time_weight * np.sqrt(self.time - time) state = deepcopy(state) next_state = deepcopy(next_state) return list(state), action, list(next_state), reward # Model containing a priority queue for Prioritized Sweeping class PriorityModel(TrivialModel): def __init__(self, rand=np.random): TrivialModel.__init__(self, rand) # maintain a priority queue self.priority_queue = PriorityQueue() # track predecessors for every state self.predecessors = dict() # add a @state-@action pair into the priority queue with priority @priority def insert(self, priority, state, action): # note the priority queue is a minimum heap, so we use -priority self.priority_queue.add_item((tuple(state), action), -priority) # @return: whether the priority queue is empty def empty(self): return self.priority_queue.empty() # get the first item in the priority queue def sample(self): (state, action), priority = self.priority_queue.pop_item() next_state, reward = self.model[state][action] state = deepcopy(state) next_state = deepcopy(next_state) return -priority, list(state), action, list(next_state), reward # feed the model with previous experience def feed(self, state, action, next_state, reward): state = deepcopy(state) next_state = deepcopy(next_state) TrivialModel.feed(self, state, action, next_state, reward) if tuple(next_state) not in self.predecessors.keys(): self.predecessors[tuple(next_state)] = set() self.predecessors[tuple(next_state)].add((tuple(state), action)) # get all seen predecessors of a state @state def predecessor(self, state): if tuple(state) not in self.predecessors.keys(): return [] predecessors = [] for state_pre, action_pre in list(self.predecessors[tuple(state)]): predecessors.append([list(state_pre), action_pre, self.model[state_pre][action_pre][1]]) return predecessors # play for an episode for Dyna-Q algorithm # @q_value: state action pair values, will be updated # @model: model instance for planning # @maze: a maze instance containing all information about the environment # @dyna_params: several params for the algorithm def dyna_q(q_value, model, maze, dyna_params): state = maze.START_STATE steps = 0 while state not in maze.GOAL_STATES: # track the steps steps += 1 # get action action = choose_action(state, q_value, maze, dyna_params) # take action next_state, reward = maze.step(state, action) # Q-Learning update q_value[state[0], state[1], action] += \ dyna_params.alpha * (reward + dyna_params.gamma * np.max(q_value[next_state[0], next_state[1], :]) - q_value[state[0], state[1], action]) # feed the model with experience model.feed(state, action, next_state, reward) # sample experience from the model for t in range(0, dyna_params.planning_steps): state_, action_, next_state_, reward_ = model.sample() q_value[state_[0], state_[1], action_] += \ dyna_params.alpha * (reward_ + dyna_params.gamma * np.max(q_value[next_state_[0], next_state_[1], :]) - q_value[state_[0], state_[1], action_]) state = next_state # check whether it has exceeded the step limit if steps > maze.max_steps: break return steps # play for an episode for prioritized sweeping algorithm # @q_value: state action pair values, will be updated # @model: model instance for planning # @maze: a maze instance containing all information about the environment # @dyna_params: several params for the algorithm # @return: # of backups during this episode def prioritized_sweeping(q_value, model, maze, dyna_params): state = maze.START_STATE # track the steps in this episode steps = 0 # track the backups in planning phase backups = 0 while state not in maze.GOAL_STATES: steps += 1 # get action action = choose_action(state, q_value, maze, dyna_params) # take action next_state, reward = maze.step(state, action) # feed the model with experience model.feed(state, action, next_state, reward) # get the priority for current state action pair priority = np.abs(reward + dyna_params.gamma * np.max(q_value[next_state[0], next_state[1], :]) - q_value[state[0], state[1], action]) if priority > dyna_params.theta: model.insert(priority, state, action) # start planning planning_step = 0 # planning for several steps, # although keep planning until the priority queue becomes empty will converge much faster while planning_step < dyna_params.planning_steps and not model.empty(): # get a sample with highest priority from the model priority, state_, action_, next_state_, reward_ = model.sample() # update the state action value for the sample delta = reward_ + dyna_params.gamma * np.max(q_value[next_state_[0], next_state_[1], :]) - \ q_value[state_[0], state_[1], action_] q_value[state_[0], state_[1], action_] += dyna_params.alpha * delta # deal with all the predecessors of the sample state for state_pre, action_pre, reward_pre in model.predecessor(state_): priority = np.abs(reward_pre + dyna_params.gamma * np.max(q_value[state_[0], state_[1], :]) - q_value[state_pre[0], state_pre[1], action_pre]) if priority > dyna_params.theta: model.insert(priority, state_pre, action_pre) planning_step += 1 state = next_state # update the # of backups backups += planning_step + 1 return backups # Figure 8.2, DynaMaze, use 10 runs instead of 30 runs def figure_8_2(): # set up an instance for DynaMaze dyna_maze = Maze() dyna_params = DynaParams() runs = 10 episodes = 50 planning_steps = [0, 5, 50] steps = np.zeros((len(planning_steps), episodes)) for run in tqdm(range(runs)): for i, planning_step in enumerate(planning_steps): dyna_params.planning_steps = planning_step q_value = np.zeros(dyna_maze.q_size) # generate an instance of Dyna-Q model model = TrivialModel() for ep in range(episodes): # print('run:', run, 'planning step:', planning_step, 'episode:', ep) steps[i, ep] += dyna_q(q_value, model, dyna_maze, dyna_params) # averaging over runs steps /= runs for i in range(len(planning_steps)): plt.plot(steps[i, :], label='%d planning steps' % (planning_steps[i])) plt.xlabel('episodes') plt.ylabel('steps per episode') plt.legend() plt.savefig('../images/figure_8_2.png') plt.close() # wrapper function for changing maze # @maze: a maze instance # @dynaParams: several parameters for dyna algorithms def changing_maze(maze, dyna_params): # set up max steps max_steps = maze.max_steps # track the cumulative rewards rewards = np.zeros((dyna_params.runs, 2, max_steps)) for run in tqdm(range(dyna_params.runs)): # set up models models = [TrivialModel(), TimeModel(maze, time_weight=dyna_params.time_weight)] # initialize state action values q_values = [np.zeros(maze.q_size), np.zeros(maze.q_size)] for i in range(len(dyna_params.methods)): # print('run:', run, dyna_params.methods[i]) # set old obstacles for the maze maze.obstacles = maze.old_obstacles steps = 0 last_steps = steps while steps < max_steps: # play for an episode steps += dyna_q(q_values[i], models[i], maze, dyna_params) # update cumulative rewards rewards[run, i, last_steps: steps] = rewards[run, i, last_steps] rewards[run, i, min(steps, max_steps - 1)] = rewards[run, i, last_steps] + 1 last_steps = steps if steps > maze.obstacle_switch_time: # change the obstacles maze.obstacles = maze.new_obstacles # averaging over runs rewards = rewards.mean(axis=0) return rewards # Figure 8.4, BlockingMaze def figure_8_4(): # set up a blocking maze instance blocking_maze = Maze() blocking_maze.START_STATE = [5, 3] blocking_maze.GOAL_STATES = [[0, 8]] blocking_maze.old_obstacles = [[3, i] for i in range(0, 8)] # new obstalces will block the optimal path blocking_maze.new_obstacles = [[3, i] for i in range(1, 9)] # step limit blocking_maze.max_steps = 3000 # obstacles will change after 1000 steps # the exact step for changing will be different # However given that 1000 steps is long enough for both algorithms to converge, # the difference is guaranteed to be very small blocking_maze.obstacle_switch_time = 1000 # set up parameters dyna_params = DynaParams() dyna_params.alpha = 1.0 dyna_params.planning_steps = 10 dyna_params.runs = 20 # kappa must be small, as the reward for getting the goal is only 1 dyna_params.time_weight = 1e-4 # play rewards = changing_maze(blocking_maze, dyna_params) for i in range(len(dyna_params.methods)): plt.plot(rewards[i, :], label=dyna_params.methods[i]) plt.xlabel('time steps') plt.ylabel('cumulative reward') plt.legend() plt.savefig('../images/figure_8_4.png') plt.close() # Figure 8.5, ShortcutMaze def figure_8_5(): # set up a shortcut maze instance shortcut_maze = Maze() shortcut_maze.START_STATE = [5, 3] shortcut_maze.GOAL_STATES = [[0, 8]] shortcut_maze.old_obstacles = [[3, i] for i in range(1, 9)] # new obstacles will have a shorter path shortcut_maze.new_obstacles = [[3, i] for i in range(1, 8)] # step limit shortcut_maze.max_steps = 6000 # obstacles will change after 3000 steps # the exact step for changing will be different # However given that 3000 steps is long enough for both algorithms to converge, # the difference is guaranteed to be very small shortcut_maze.obstacle_switch_time = 3000 # set up parameters dyna_params = DynaParams() # 50-step planning dyna_params.planning_steps = 50 dyna_params.runs = 5 dyna_params.time_weight = 1e-3 dyna_params.alpha = 1.0 # play rewards = changing_maze(shortcut_maze, dyna_params) for i in range(len(dyna_params.methods)): plt.plot( rewards[i, :], label=dyna_params.methods[i]) plt.xlabel('time steps') plt.ylabel('cumulative reward') plt.legend() plt.savefig('../images/figure_8_5.png') plt.close() # Check whether state-action values are already optimal def check_path(q_values, maze): # get the length of optimal path # 14 is the length of optimal path of the original maze # 1.2 means it's a relaxed optifmal path max_steps = 14 * maze.resolution * 1.2 state = maze.START_STATE steps = 0 while state not in maze.GOAL_STATES: action = np.argmax(q_values[state[0], state[1], :]) state, _ = maze.step(state, action) steps += 1 if steps > max_steps: return False return True # Example 8.4, mazes with different resolution def example_8_4(): # get the original 6 * 9 maze original_maze = Maze() # set up the parameters for each algorithm params_dyna = DynaParams() params_dyna.planning_steps = 5 params_dyna.alpha = 0.5 params_dyna.gamma = 0.95 params_prioritized = DynaParams() params_prioritized.theta = 0.0001 params_prioritized.planning_steps = 5 params_prioritized.alpha = 0.5 params_prioritized.gamma = 0.95 params = [params_prioritized, params_dyna] # set up models for planning models = [PriorityModel, TrivialModel] method_names = ['Prioritized Sweeping', 'Dyna-Q'] # due to limitation of my machine, I can only perform experiments for 5 mazes # assuming the 1st maze has w * h states, then k-th maze has w * h * k * k states num_of_mazes = 5 # build all the mazes mazes = [original_maze.extend_maze(i) for i in range(1, num_of_mazes + 1)] methods = [prioritized_sweeping, dyna_q] # My machine cannot afford too many runs... runs = 5 # track the # of backups backups = np.zeros((runs, 2, num_of_mazes)) for run in range(0, runs): for i in range(0, len(method_names)): for mazeIndex, maze in zip(range(0, len(mazes)), mazes): print('run %d, %s, maze size %d' % (run, method_names[i], maze.WORLD_HEIGHT * maze.WORLD_WIDTH)) # initialize the state action values q_value = np.zeros(maze.q_size) # track steps / backups for each episode steps = [] # generate the model model = models[i]() # play for an episode while True: steps.append(methods[i](q_value, model, maze, params[i])) # print best actions w.r.t. current state-action values # printActions(currentStateActionValues, maze) # check whether the (relaxed) optimal path is found if check_path(q_value, maze): break # update the total steps / backups for this maze backups[run, i, mazeIndex] = np.sum(steps) backups = backups.mean(axis=0) # Dyna-Q performs several backups per step backups[1, :] *= params_dyna.planning_steps + 1 for i in range(0, len(method_names)): plt.plot(np.arange(1, num_of_mazes + 1), backups[i, :], label=method_names[i]) plt.xlabel('maze resolution factor') plt.ylabel('backups until optimal solution') plt.yscale('log') plt.legend() plt.savefig('../images/example_8_4.png') plt.close() if __name__ == '__main__': figure_8_2() figure_8_4() figure_8_5() example_8_4() ================================================ FILE: chapter08/trajectory_sampling.py ================================================ ####################################################################### # Copyright (C) # # 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib import matplotlib.pyplot as plt from tqdm import tqdm matplotlib.use('Agg') # 2 actions ACTIONS = [0, 1] # each transition has a probability to terminate with 0 TERMINATION_PROB = 0.1 # maximum expected updates MAX_STEPS = 20000 # epsilon greedy for behavior policy EPSILON = 0.1 # break tie randomly def argmax(value): max_q = np.max(value) return np.random.choice([a for a, q in enumerate(value) if q == max_q]) class Task: # @n_states: number of non-terminal states # @b: branch # Each episode starts with state 0, and state n_states is a terminal state def __init__(self, n_states, b): self.n_states = n_states self.b = b # transition matrix, each state-action pair leads to b possible states self.transition = np.random.randint(n_states, size=(n_states, len(ACTIONS), b)) # it is not clear how to set the reward, I use a unit normal distribution here # reward is determined by (s, a, s') self.reward = np.random.randn(n_states, len(ACTIONS), b) def step(self, state, action): if np.random.rand() < TERMINATION_PROB: return self.n_states, 0 next_ = np.random.randint(self.b) return self.transition[state, action, next_], self.reward[state, action, next_] # Evaluate the value of the start state for the greedy policy # derived from @q under the MDP @task def evaluate_pi(q, task): # use Monte Carlo method to estimate the state value runs = 1000 returns = [] for r in range(runs): rewards = 0 state = 0 while state < task.n_states: action = argmax(q[state]) state, r = task.step(state, action) rewards += r returns.append(rewards) return np.mean(returns) # perform expected update from a uniform state-action distribution of the MDP @task # evaluate the learned q value every @eval_interval steps def uniform(task, eval_interval): performance = [] q = np.zeros((task.n_states, 2)) for step in tqdm(range(MAX_STEPS)): state = step // len(ACTIONS) % task.n_states action = step % len(ACTIONS) next_states = task.transition[state, action] q[state, action] = (1 - TERMINATION_PROB) * np.mean( task.reward[state, action] + np.max(q[next_states, :], axis=1)) if step % eval_interval == 0: v_pi = evaluate_pi(q, task) performance.append([step, v_pi]) return zip(*performance) # perform expected update from an on-policy distribution of the MDP @task # evaluate the learned q value every @eval_interval steps def on_policy(task, eval_interval): performance = [] q = np.zeros((task.n_states, 2)) state = 0 for step in tqdm(range(MAX_STEPS)): if np.random.rand() < EPSILON: action = np.random.choice(ACTIONS) else: action = argmax(q[state]) next_state, _ = task.step(state, action) next_states = task.transition[state, action] q[state, action] = (1 - TERMINATION_PROB) * np.mean( task.reward[state, action] + np.max(q[next_states, :], axis=1)) if next_state == task.n_states: next_state = 0 state = next_state if step % eval_interval == 0: v_pi = evaluate_pi(q, task) performance.append([step, v_pi]) return zip(*performance) def figure_8_8(): num_states = [1000, 10000] branch = [1, 3, 10] methods = [on_policy, uniform] # average across 30 tasks n_tasks = 30 # number of evaluation points x_ticks = 100 plt.figure(figsize=(10, 20)) for i, n in enumerate(num_states): plt.subplot(2, 1, i+1) for b in branch: tasks = [Task(n, b) for _ in range(n_tasks)] for method in methods: steps = None value = [] for task in tasks: steps, v = method(task, MAX_STEPS / x_ticks) value.append(v) value = np.mean(np.asarray(value), axis=0) plt.plot(steps, value, label=f'b = {b}, {method.__name__}') plt.title(f'{n} states') plt.ylabel('value of start state') plt.legend() plt.subplot(2, 1, 2) plt.xlabel('computation time, in expected updates') plt.savefig('../images/figure_8_8.png') plt.close() if __name__ == '__main__': figure_8_8() ================================================ FILE: chapter09/random_walk.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from tqdm import tqdm # # of states except for terminal states N_STATES = 1000 # all states STATES = np.arange(1, N_STATES + 1) # start from a central state START_STATE = 500 # terminal states END_STATES = [0, N_STATES + 1] # possible actions ACTION_LEFT = -1 ACTION_RIGHT = 1 ACTIONS = [ACTION_LEFT, ACTION_RIGHT] # maximum stride for an action STEP_RANGE = 100 def compute_true_value(): # true state value, just a promising guess true_value = np.arange(-1001, 1003, 2) / 1001.0 # Dynamic programming to find the true state values, based on the promising guess above # Assume all rewards are 0, given that we have already given value -1 and 1 to terminal states while True: old_value = np.copy(true_value) for state in STATES: true_value[state] = 0 for action in ACTIONS: for step in range(1, STEP_RANGE + 1): step *= action next_state = state + step next_state = max(min(next_state, N_STATES + 1), 0) # asynchronous update for faster convergence true_value[state] += 1.0 / (2 * STEP_RANGE) * true_value[next_state] error = np.sum(np.abs(old_value - true_value)) if error < 1e-2: break # correct the state value for terminal states to 0 true_value[0] = true_value[-1] = 0 return true_value # take an @action at @state, return new state and reward for this transition def step(state, action): step = np.random.randint(1, STEP_RANGE + 1) step *= action state += step state = max(min(state, N_STATES + 1), 0) if state == 0: reward = -1 elif state == N_STATES + 1: reward = 1 else: reward = 0 return state, reward # get an action, following random policy def get_action(): if np.random.binomial(1, 0.5) == 1: return 1 return -1 # a wrapper class for aggregation value function class ValueFunction: # @num_of_groups: # of aggregations def __init__(self, num_of_groups): self.num_of_groups = num_of_groups self.group_size = N_STATES // num_of_groups # thetas self.params = np.zeros(num_of_groups) # get the value of @state def value(self, state): if state in END_STATES: return 0 group_index = (state - 1) // self.group_size return self.params[group_index] # update parameters # @delta: step size * (target - old estimation) # @state: state of current sample def update(self, delta, state): group_index = (state - 1) // self.group_size self.params[group_index] += delta # a wrapper class for tile coding value function class TilingsValueFunction: # @num_of_tilings: # of tilings # @tileWidth: each tiling has several tiles, this parameter specifies the width of each tile # @tilingOffset: specifies how tilings are put together def __init__(self, numOfTilings, tileWidth, tilingOffset): self.numOfTilings = numOfTilings self.tileWidth = tileWidth self.tilingOffset = tilingOffset # To make sure that each sate is covered by same number of tiles, # we need one more tile for each tiling self.tilingSize = N_STATES // tileWidth + 1 # weight for each tile self.params = np.zeros((self.numOfTilings, self.tilingSize)) # For performance, only track the starting position for each tiling # As we have one more tile for each tiling, the starting position will be negative self.tilings = np.arange(-tileWidth + 1, 0, tilingOffset) # get the value of @state def value(self, state): stateValue = 0.0 # go through all the tilings for tilingIndex in range(0, len(self.tilings)): # find the active tile in current tiling tileIndex = (state - self.tilings[tilingIndex]) // self.tileWidth stateValue += self.params[tilingIndex, tileIndex] return stateValue # update parameters # @delta: step size * (target - old estimation) # @state: state of current sample def update(self, delta, state): # each state is covered by same number of tilings # so the delta should be divided equally into each tiling (tile) delta /= self.numOfTilings # go through all the tilings for tilingIndex in range(0, len(self.tilings)): # find the active tile in current tiling tileIndex = (state - self.tilings[tilingIndex]) // self.tileWidth self.params[tilingIndex, tileIndex] += delta # a wrapper class for polynomial / Fourier -based value function POLYNOMIAL_BASES = 0 FOURIER_BASES = 1 class BasesValueFunction: # @order: # of bases, each function also has one more constant parameter (called bias in machine learning) # @type: polynomial bases or Fourier bases def __init__(self, order, type): self.order = order self.weights = np.zeros(order + 1) # set up bases function self.bases = [] if type == POLYNOMIAL_BASES: for i in range(0, order + 1): self.bases.append(lambda s, i=i: pow(s, i)) elif type == FOURIER_BASES: for i in range(0, order + 1): self.bases.append(lambda s, i=i: np.cos(i * np.pi * s)) # get the value of @state def value(self, state): # map the state space into [0, 1] state /= float(N_STATES) # get the feature vector feature = np.asarray([func(state) for func in self.bases]) return np.dot(self.weights, feature) def update(self, delta, state): # map the state space into [0, 1] state /= float(N_STATES) # get derivative value derivative_value = np.asarray([func(state) for func in self.bases]) self.weights += delta * derivative_value # gradient Monte Carlo algorithm # @value_function: an instance of class ValueFunction # @alpha: step size # @distribution: array to store the distribution statistics def gradient_monte_carlo(value_function, alpha, distribution=None): state = START_STATE trajectory = [state] # We assume gamma = 1, so return is just the same as the latest reward reward = 0.0 while state not in END_STATES: action = get_action() next_state, reward = step(state, action) trajectory.append(next_state) state = next_state # Gradient update for each state in this trajectory for state in trajectory[:-1]: delta = alpha * (reward - value_function.value(state)) value_function.update(delta, state) if distribution is not None: distribution[state] += 1 # semi-gradient n-step TD algorithm # @valueFunction: an instance of class ValueFunction # @n: # of steps # @alpha: step size def semi_gradient_temporal_difference(value_function, n, alpha): # initial starting state state = START_STATE # arrays to store states and rewards for an episode # space isn't a major consideration, so I didn't use the mod trick states = [state] rewards = [0] # track the time time = 0 # the length of this episode T = float('inf') while True: # go to next time step time += 1 if time < T: # choose an action randomly action = get_action() next_state, reward = step(state, action) # store new state and new reward states.append(next_state) rewards.append(reward) if next_state in END_STATES: T = time # get the time of the state to update update_time = time - n if update_time >= 0: returns = 0.0 # calculate corresponding rewards for t in range(update_time + 1, min(T, update_time + n) + 1): returns += rewards[t] # add state value to the return if update_time + n <= T: returns += value_function.value(states[update_time + n]) state_to_update = states[update_time] # update the value function if not state_to_update in END_STATES: delta = alpha * (returns - value_function.value(state_to_update)) value_function.update(delta, state_to_update) if update_time == T - 1: break state = next_state # Figure 9.1, gradient Monte Carlo algorithm def figure_9_1(true_value): episodes = int(1e5) alpha = 2e-5 # we have 10 aggregations in this example, each has 100 states value_function = ValueFunction(10) distribution = np.zeros(N_STATES + 2) for ep in tqdm(range(episodes)): gradient_monte_carlo(value_function, alpha, distribution) distribution /= np.sum(distribution) state_values = [value_function.value(i) for i in STATES] plt.figure(figsize=(10, 20)) plt.subplot(2, 1, 1) plt.plot(STATES, state_values, label='Approximate MC value') plt.plot(STATES, true_value[1: -1], label='True value') plt.xlabel('State') plt.ylabel('Value') plt.legend() plt.subplot(2, 1, 2) plt.plot(STATES, distribution[1: -1], label='State distribution') plt.xlabel('State') plt.ylabel('Distribution') plt.legend() plt.savefig('../images/figure_9_1.png') plt.close() # semi-gradient TD on 1000-state random walk def figure_9_2_left(true_value): episodes = int(1e5) alpha = 2e-4 value_function = ValueFunction(10) for ep in tqdm(range(episodes)): semi_gradient_temporal_difference(value_function, 1, alpha) stateValues = [value_function.value(i) for i in STATES] plt.plot(STATES, stateValues, label='Approximate TD value') plt.plot(STATES, true_value[1: -1], label='True value') plt.xlabel('State') plt.ylabel('Value') plt.legend() # different alphas and steps for semi-gradient TD def figure_9_2_right(true_value): # all possible steps steps = np.power(2, np.arange(0, 10)) # all possible alphas alphas = np.arange(0, 1.1, 0.1) # each run has 10 episodes episodes = 10 # perform 100 independent runs runs = 100 # track the errors for each (step, alpha) combination errors = np.zeros((len(steps), len(alphas))) for run in tqdm(range(runs)): for step_ind, step in zip(range(len(steps)), steps): for alpha_ind, alpha in zip(range(len(alphas)), alphas): # we have 20 aggregations in this example value_function = ValueFunction(20) for ep in range(0, episodes): semi_gradient_temporal_difference(value_function, step, alpha) # calculate the RMS error state_value = np.asarray([value_function.value(i) for i in STATES]) errors[step_ind, alpha_ind] += np.sqrt(np.sum(np.power(state_value - true_value[1: -1], 2)) / N_STATES) # take average errors /= episodes * runs # truncate the error for i in range(len(steps)): plt.plot(alphas, errors[i, :], label='n = ' + str(steps[i])) plt.xlabel('alpha') plt.ylabel('RMS error') plt.ylim([0.25, 0.55]) plt.legend() def figure_9_2(true_value): plt.figure(figsize=(10, 20)) plt.subplot(2, 1, 1) figure_9_2_left(true_value) plt.subplot(2, 1, 2) figure_9_2_right(true_value) plt.savefig('../images/figure_9_2.png') plt.close() # Figure 9.5, Fourier basis and polynomials def figure_9_5(true_value): # my machine can only afford 1 run runs = 1 episodes = 5000 # # of bases orders = [5, 10, 20] alphas = [1e-4, 5e-5] labels = [['polynomial basis'] * 3, ['fourier basis'] * 3] # track errors for each episode errors = np.zeros((len(alphas), len(orders), episodes)) for run in range(runs): for i in range(len(orders)): value_functions = [BasesValueFunction(orders[i], POLYNOMIAL_BASES), BasesValueFunction(orders[i], FOURIER_BASES)] for j in range(len(value_functions)): for episode in tqdm(range(episodes)): # gradient Monte Carlo algorithm gradient_monte_carlo(value_functions[j], alphas[j]) # get state values under current value function state_values = [value_functions[j].value(state) for state in STATES] # get the root-mean-squared error errors[j, i, episode] += np.sqrt(np.mean(np.power(true_value[1: -1] - state_values, 2))) # average over independent runs errors /= runs for i in range(len(alphas)): for j in range(len(orders)): plt.plot(errors[i, j, :], label='%s order = %d' % (labels[i][j], orders[j])) plt.xlabel('Episodes') # The book plots RMSVE, which is RMSE weighted by a state distribution plt.ylabel('RMSE') plt.legend() plt.savefig('../images/figure_9_5.png') plt.close() # Figure 9.10, it will take quite a while def figure_9_10(true_value): # My machine can only afford one run, thus the curve isn't so smooth runs = 1 # number of episodes episodes = 5000 num_of_tilings = 50 # each tile will cover 200 states tile_width = 200 # how to put so many tilings tiling_offset = 4 labels = ['tile coding (50 tilings)', 'state aggregation (one tiling)'] # track errors for each episode errors = np.zeros((len(labels), episodes)) for run in range(runs): # initialize value functions for multiple tilings and single tiling value_functions = [TilingsValueFunction(num_of_tilings, tile_width, tiling_offset), ValueFunction(N_STATES // tile_width)] for i in range(len(value_functions)): for episode in tqdm(range(episodes)): # I use a changing alpha according to the episode instead of a small fixed alpha # With a small fixed alpha, I don't think 5000 episodes is enough for so many # parameters in multiple tilings. # The asymptotic performance for single tiling stays unchanged under a changing alpha, # however the asymptotic performance for multiple tilings improves significantly alpha = 1.0 / (episode + 1) # gradient Monte Carlo algorithm gradient_monte_carlo(value_functions[i], alpha) # get state values under current value function state_values = [value_functions[i].value(state) for state in STATES] # get the root-mean-squared error errors[i][episode] += np.sqrt(np.mean(np.power(true_value[1: -1] - state_values, 2))) # average over independent runs errors /= runs for i in range(0, len(labels)): plt.plot(errors[i], label=labels[i]) plt.xlabel('Episodes') # The book plots RMSVE, which is RMSE weighted by a state distribution plt.ylabel('RMSE') plt.legend() plt.savefig('../images/figure_9_10.png') plt.close() if __name__ == '__main__': true_value = compute_true_value() figure_9_1(true_value) figure_9_2(true_value) figure_9_5(true_value) figure_9_10(true_value) ================================================ FILE: chapter09/square_wave.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from tqdm import tqdm # wrapper class for an interval # readability is more important than efficiency, so I won't use many tricks class Interval: # [@left, @right) def __init__(self, left, right): self.left = left self.right = right # whether a point is in this interval def contain(self, x): return self.left <= x < self.right # length of this interval def size(self): return self.right - self.left # domain of the square wave, [0, 2) DOMAIN = Interval(0.0, 2.0) # square wave function def square_wave(x): if 0.5 < x < 1.5: return 1 return 0 # get @n samples randomly from the square wave def sample(n): samples = [] for i in range(0, n): x = np.random.uniform(DOMAIN.left, DOMAIN.right) y = square_wave(x) samples.append([x, y]) return samples # wrapper class for value function class ValueFunction: # @domain: domain of this function, an instance of Interval # @alpha: basic step size for one update def __init__(self, feature_width, domain=DOMAIN, alpha=0.2, num_of_features=50): self.feature_width = feature_width self.num_of_featrues = num_of_features self.features = [] self.alpha = alpha self.domain = domain # there are many ways to place those feature windows, # following is just one possible way step = (domain.size() - feature_width) / (num_of_features - 1) left = domain.left for i in range(0, num_of_features - 1): self.features.append(Interval(left, left + feature_width)) left += step self.features.append(Interval(left, domain.right)) # initialize weight for each feature self.weights = np.zeros(num_of_features) # for point @x, return the indices of corresponding feature windows def get_active_features(self, x): active_features = [] for i in range(0, len(self.features)): if self.features[i].contain(x): active_features.append(i) return active_features # estimate the value for point @x def value(self, x): active_features = self.get_active_features(x) return np.sum(self.weights[active_features]) # update weights given sample of point @x # @delta: y - x def update(self, delta, x): active_features = self.get_active_features(x) delta *= self.alpha / len(active_features) for index in active_features: self.weights[index] += delta # train @value_function with a set of samples @samples def approximate(samples, value_function): for x, y in samples: delta = y - value_function.value(x) value_function.update(delta, x) # Figure 9.8 def figure_9_8(): num_of_samples = [10, 40, 160, 640, 2560, 10240] feature_widths = [0.2, 0.4, 1.0] plt.figure(figsize=(30, 20)) axis_x = np.arange(DOMAIN.left, DOMAIN.right, 0.02) for index, num_of_sample in enumerate(num_of_samples): print(num_of_sample, 'samples') samples = sample(num_of_sample) value_functions = [ValueFunction(feature_width) for feature_width in feature_widths] plt.subplot(2, 3, index + 1) plt.title('%d samples' % (num_of_sample)) for value_function in value_functions: approximate(samples, value_function) values = [value_function.value(x) for x in axis_x] plt.plot(axis_x, values, label='feature width %.01f' % (value_function.feature_width)) plt.legend() plt.savefig('../images/figure_9_8.png') plt.close() if __name__ == '__main__': figure_9_8() ================================================ FILE: chapter10/access_control.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from tqdm import tqdm from mpl_toolkits.mplot3d.axes3d import Axes3D from math import floor import seaborn as sns ####################################################################### # Following are some utilities for tile coding from Rich. # To make each file self-contained, I copied them from # http://incompleteideas.net/tiles/tiles3.py-remove # with some naming convention changes # # Tile coding starts class IHT: "Structure to handle collisions" def __init__(self, size_val): self.size = size_val self.overfull_count = 0 self.dictionary = {} def count(self): return len(self.dictionary) def full(self): return len(self.dictionary) >= self.size def get_index(self, obj, read_only=False): d = self.dictionary if obj in d: return d[obj] elif read_only: return None size = self.size count = self.count() if count >= size: if self.overfull_count == 0: print('IHT full, starting to allow collisions') self.overfull_count += 1 return hash(obj) % self.size else: d[obj] = count return count def hash_coords(coordinates, m, read_only=False): if isinstance(m, IHT): return m.get_index(tuple(coordinates), read_only) if isinstance(m, int): return hash(tuple(coordinates)) % m if m is None: return coordinates def tiles(iht_or_size, num_tilings, floats, ints=None, read_only=False): """returns num-tilings tile indices corresponding to the floats and ints""" if ints is None: ints = [] qfloats = [floor(f * num_tilings) for f in floats] tiles = [] for tiling in range(num_tilings): tilingX2 = tiling * 2 coords = [tiling] b = tiling for q in qfloats: coords.append((q + b) // num_tilings) b += tilingX2 coords.extend(ints) tiles.append(hash_coords(coords, iht_or_size, read_only)) return tiles # Tile coding ends ####################################################################### # possible priorities PRIORITIES = np.arange(0, 4) # reward for each priority REWARDS = np.power(2, np.arange(0, 4)) # possible actions REJECT = 0 ACCEPT = 1 ACTIONS = [REJECT, ACCEPT] # total number of servers NUM_OF_SERVERS = 10 # at each time step, a busy server will be free w.p. 0.06 PROBABILITY_FREE = 0.06 # step size for learning state-action value ALPHA = 0.01 # step size for learning average reward BETA = 0.01 # probability for exploration EPSILON = 0.1 # a wrapper class for differential semi-gradient Sarsa state-action function class ValueFunction: # In this example I use the tiling software instead of implementing standard tiling by myself # One important thing is that tiling is only a map from (state, action) to a series of indices # It doesn't matter whether the indices have meaning, only if this map satisfy some property # View the following webpage for more information # http://incompleteideas.net/sutton/tiles/tiles3.html # @alpha: step size for learning state-action value # @beta: step size for learning average reward def __init__(self, num_of_tilings, alpha=ALPHA, beta=BETA): self.num_of_tilings = num_of_tilings self.max_size = 2048 self.hash_table = IHT(self.max_size) self.weights = np.zeros(self.max_size) # state features needs scaling to satisfy the tile software self.server_scale = self.num_of_tilings / float(NUM_OF_SERVERS) self.priority_scale = self.num_of_tilings / float(len(PRIORITIES) - 1) self.average_reward = 0.0 # divide step size equally to each tiling self.alpha = alpha / self.num_of_tilings self.beta = beta # get indices of active tiles for given state and action def get_active_tiles(self, free_servers, priority, action): active_tiles = tiles(self.hash_table, self.num_of_tilings, [self.server_scale * free_servers, self.priority_scale * priority], [action]) return active_tiles # estimate the value of given state and action without subtracting average def value(self, free_servers, priority, action): active_tiles = self.get_active_tiles(free_servers, priority, action) return np.sum(self.weights[active_tiles]) # estimate the value of given state without subtracting average def state_value(self, free_servers, priority): values = [self.value(free_servers, priority, action) for action in ACTIONS] # if no free server, can't accept if free_servers == 0: return values[REJECT] return np.max(values) # learn with given sequence def learn(self, free_servers, priority, action, new_free_servers, new_priority, new_action, reward): active_tiles = self.get_active_tiles(free_servers, priority, action) estimation = np.sum(self.weights[active_tiles]) delta = reward - self.average_reward + self.value(new_free_servers, new_priority, new_action) - estimation # update average reward self.average_reward += self.beta * delta delta *= self.alpha for active_tile in active_tiles: self.weights[active_tile] += delta # get action based on epsilon greedy policy and @valueFunction def get_action(free_servers, priority, value_function): # if no free server, can't accept if free_servers == 0: return REJECT if np.random.binomial(1, EPSILON) == 1: return np.random.choice(ACTIONS) values = [value_function.value(free_servers, priority, action) for action in ACTIONS] return np.random.choice([action_ for action_, value_ in enumerate(values) if value_ == np.max(values)]) # take an action def take_action(free_servers, priority, action): if free_servers > 0 and action == ACCEPT: free_servers -= 1 reward = REWARDS[priority] * action # some busy servers may become free busy_servers = NUM_OF_SERVERS - free_servers free_servers += np.random.binomial(busy_servers, PROBABILITY_FREE) return free_servers, np.random.choice(PRIORITIES), reward # differential semi-gradient Sarsa # @valueFunction: state value function to learn # @maxSteps: step limit in the continuing task def differential_semi_gradient_sarsa(value_function, max_steps): current_free_servers = NUM_OF_SERVERS current_priority = np.random.choice(PRIORITIES) current_action = get_action(current_free_servers, current_priority, value_function) # track the hit for each number of free servers freq = np.zeros(NUM_OF_SERVERS + 1) for _ in tqdm(range(max_steps)): freq[current_free_servers] += 1 new_free_servers, new_priority, reward = take_action(current_free_servers, current_priority, current_action) new_action = get_action(new_free_servers, new_priority, value_function) value_function.learn(current_free_servers, current_priority, current_action, new_free_servers, new_priority, new_action, reward) current_free_servers = new_free_servers current_priority = new_priority current_action = new_action print('Frequency of number of free servers:') print(freq / max_steps) # Figure 10.5, Differential semi-gradient Sarsa on the access-control queuing task def figure_10_5(): max_steps = int(1e6) # use tile coding with 8 tilings num_of_tilings = 8 value_function = ValueFunction(num_of_tilings) differential_semi_gradient_sarsa(value_function, max_steps) values = np.zeros((len(PRIORITIES), NUM_OF_SERVERS + 1)) for priority in PRIORITIES: for free_servers in range(NUM_OF_SERVERS + 1): values[priority, free_servers] = value_function.state_value(free_servers, priority) fig = plt.figure(figsize=(10, 20)) plt.subplot(2, 1, 1) for priority in PRIORITIES: plt.plot(range(NUM_OF_SERVERS + 1), values[priority, :], label='priority %d' % (REWARDS[priority])) plt.xlabel('Number of free servers') plt.ylabel('Differential value of best action') plt.legend() ax = fig.add_subplot(2, 1, 2) policy = np.zeros((len(PRIORITIES), NUM_OF_SERVERS + 1)) for priority in PRIORITIES: for free_servers in range(NUM_OF_SERVERS + 1): values = [value_function.value(free_servers, priority, action) for action in ACTIONS] if free_servers == 0: policy[priority, free_servers] = REJECT else: policy[priority, free_servers] = np.argmax(values) fig = sns.heatmap(policy, cmap="YlGnBu", ax=ax, xticklabels=range(NUM_OF_SERVERS + 1), yticklabels=PRIORITIES) fig.set_title('Policy (0 Reject, 1 Accept)') fig.set_xlabel('Number of free servers') fig.set_ylabel('Priority') plt.savefig('../images/figure_10_5.png') plt.close() if __name__ == '__main__': figure_10_5() ================================================ FILE: chapter10/mountain_car.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from tqdm import tqdm from mpl_toolkits.mplot3d.axes3d import Axes3D from math import floor ####################################################################### # Following are some utilities for tile coding from Rich. # To make each file self-contained, I copied them from # http://incompleteideas.net/tiles/tiles3.py-remove # with some naming convention changes # # Tile coding starts class IHT: "Structure to handle collisions" def __init__(self, size_val): self.size = size_val self.overfull_count = 0 self.dictionary = {} def count(self): return len(self.dictionary) def full(self): return len(self.dictionary) >= self.size def get_index(self, obj, read_only=False): d = self.dictionary if obj in d: return d[obj] elif read_only: return None size = self.size count = self.count() if count >= size: if self.overfull_count == 0: print('IHT full, starting to allow collisions') self.overfull_count += 1 return hash(obj) % self.size else: d[obj] = count return count def hash_coords(coordinates, m, read_only=False): if isinstance(m, IHT): return m.get_index(tuple(coordinates), read_only) if isinstance(m, int): return hash(tuple(coordinates)) % m if m is None: return coordinates def tiles(iht_or_size, num_tilings, floats, ints=None, read_only=False): """returns num-tilings tile indices corresponding to the floats and ints""" if ints is None: ints = [] qfloats = [floor(f * num_tilings) for f in floats] tiles = [] for tiling in range(num_tilings): tilingX2 = tiling * 2 coords = [tiling] b = tiling for q in qfloats: coords.append((q + b) // num_tilings) b += tilingX2 coords.extend(ints) tiles.append(hash_coords(coords, iht_or_size, read_only)) return tiles # Tile coding ends ####################################################################### # all possible actions ACTION_REVERSE = -1 ACTION_ZERO = 0 ACTION_FORWARD = 1 # order is important ACTIONS = [ACTION_REVERSE, ACTION_ZERO, ACTION_FORWARD] # bound for position and velocity POSITION_MIN = -1.2 POSITION_MAX = 0.5 VELOCITY_MIN = -0.07 VELOCITY_MAX = 0.07 # use optimistic initial value, so it's ok to set epsilon to 0 EPSILON = 0 # take an @action at @position and @velocity # @return: new position, new velocity, reward (always -1) def step(position, velocity, action): new_velocity = velocity + 0.001 * action - 0.0025 * np.cos(3 * position) new_velocity = min(max(VELOCITY_MIN, new_velocity), VELOCITY_MAX) new_position = position + new_velocity new_position = min(max(POSITION_MIN, new_position), POSITION_MAX) reward = -1.0 if new_position == POSITION_MIN: new_velocity = 0.0 return new_position, new_velocity, reward # wrapper class for state action value function class ValueFunction: # In this example I use the tiling software instead of implementing standard tiling by myself # One important thing is that tiling is only a map from (state, action) to a series of indices # It doesn't matter whether the indices have meaning, only if this map satisfy some property # View the following webpage for more information # http://incompleteideas.net/sutton/tiles/tiles3.html # @max_size: the maximum # of indices def __init__(self, step_size, num_of_tilings=8, max_size=2048): self.max_size = max_size self.num_of_tilings = num_of_tilings # divide step size equally to each tiling self.step_size = step_size / num_of_tilings self.hash_table = IHT(max_size) # weight for each tile self.weights = np.zeros(max_size) # position and velocity needs scaling to satisfy the tile software self.position_scale = self.num_of_tilings / (POSITION_MAX - POSITION_MIN) self.velocity_scale = self.num_of_tilings / (VELOCITY_MAX - VELOCITY_MIN) # get indices of active tiles for given state and action def get_active_tiles(self, position, velocity, action): # I think positionScale * (position - position_min) would be a good normalization. # However positionScale * position_min is a constant, so it's ok to ignore it. active_tiles = tiles(self.hash_table, self.num_of_tilings, [self.position_scale * position, self.velocity_scale * velocity], [action]) return active_tiles # estimate the value of given state and action def value(self, position, velocity, action): if position == POSITION_MAX: return 0.0 active_tiles = self.get_active_tiles(position, velocity, action) return np.sum(self.weights[active_tiles]) # learn with given state, action and target def learn(self, position, velocity, action, target): active_tiles = self.get_active_tiles(position, velocity, action) estimation = np.sum(self.weights[active_tiles]) delta = self.step_size * (target - estimation) for active_tile in active_tiles: self.weights[active_tile] += delta # get # of steps to reach the goal under current state value function def cost_to_go(self, position, velocity): costs = [] for action in ACTIONS: costs.append(self.value(position, velocity, action)) return -np.max(costs) # get action at @position and @velocity based on epsilon greedy policy and @valueFunction def get_action(position, velocity, value_function): if np.random.binomial(1, EPSILON) == 1: return np.random.choice(ACTIONS) values = [] for action in ACTIONS: values.append(value_function.value(position, velocity, action)) return np.random.choice([action_ for action_, value_ in enumerate(values) if value_ == np.max(values)]) - 1 # semi-gradient n-step Sarsa # @valueFunction: state value function to learn # @n: # of steps def semi_gradient_n_step_sarsa(value_function, n=1): # start at a random position around the bottom of the valley current_position = np.random.uniform(-0.6, -0.4) # initial velocity is 0 current_velocity = 0.0 # get initial action current_action = get_action(current_position, current_velocity, value_function) # track previous position, velocity, action and reward positions = [current_position] velocities = [current_velocity] actions = [current_action] rewards = [0.0] # track the time time = 0 # the length of this episode T = float('inf') while True: # go to next time step time += 1 if time < T: # take current action and go to the new state new_position, new_velocity, reward = step(current_position, current_velocity, current_action) # choose new action new_action = get_action(new_position, new_velocity, value_function) # track new state and action positions.append(new_position) velocities.append(new_velocity) actions.append(new_action) rewards.append(reward) if new_position == POSITION_MAX: T = time # get the time of the state to update update_time = time - n if update_time >= 0: returns = 0.0 # calculate corresponding rewards for t in range(update_time + 1, min(T, update_time + n) + 1): returns += rewards[t] # add estimated state action value to the return if update_time + n <= T: returns += value_function.value(positions[update_time + n], velocities[update_time + n], actions[update_time + n]) # update the state value function if positions[update_time] != POSITION_MAX: value_function.learn(positions[update_time], velocities[update_time], actions[update_time], returns) if update_time == T - 1: break current_position = new_position current_velocity = new_velocity current_action = new_action return time # print learned cost to go def print_cost(value_function, episode, ax): grid_size = 40 positions = np.linspace(POSITION_MIN, POSITION_MAX, grid_size) # positionStep = (POSITION_MAX - POSITION_MIN) / grid_size # positions = np.arange(POSITION_MIN, POSITION_MAX + positionStep, positionStep) # velocityStep = (VELOCITY_MAX - VELOCITY_MIN) / grid_size # velocities = np.arange(VELOCITY_MIN, VELOCITY_MAX + velocityStep, velocityStep) velocities = np.linspace(VELOCITY_MIN, VELOCITY_MAX, grid_size) axis_x = [] axis_y = [] axis_z = [] for position in positions: for velocity in velocities: axis_x.append(position) axis_y.append(velocity) axis_z.append(value_function.cost_to_go(position, velocity)) ax.scatter(axis_x, axis_y, axis_z) ax.set_xlabel('Position') ax.set_ylabel('Velocity') ax.set_zlabel('Cost to go') ax.set_title('Episode %d' % (episode + 1)) # Figure 10.1, cost to go in a single run def figure_10_1(): episodes = 9000 plot_episodes = [0, 99, episodes - 1] fig = plt.figure(figsize=(40, 10)) axes = [fig.add_subplot(1, len(plot_episodes), i+1, projection='3d') for i in range(len(plot_episodes))] num_of_tilings = 8 alpha = 0.3 value_function = ValueFunction(alpha, num_of_tilings) for ep in tqdm(range(episodes)): semi_gradient_n_step_sarsa(value_function) if ep in plot_episodes: print_cost(value_function, ep, axes[plot_episodes.index(ep)]) plt.savefig('../images/figure_10_1.png') plt.close() # Figure 10.2, semi-gradient Sarsa with different alphas def figure_10_2(): runs = 10 episodes = 500 num_of_tilings = 8 alphas = [0.1, 0.2, 0.5] steps = np.zeros((len(alphas), episodes)) for run in range(runs): value_functions = [ValueFunction(alpha, num_of_tilings) for alpha in alphas] for index in range(len(value_functions)): for episode in tqdm(range(episodes)): step = semi_gradient_n_step_sarsa(value_functions[index]) steps[index, episode] += step steps /= runs for i in range(0, len(alphas)): plt.plot(steps[i], label='alpha = '+str(alphas[i])+'/'+str(num_of_tilings)) plt.xlabel('Episode') plt.ylabel('Steps per episode') plt.yscale('log') plt.legend() plt.savefig('../images/figure_10_2.png') plt.close() # Figure 10.3, one-step semi-gradient Sarsa vs multi-step semi-gradient Sarsa def figure_10_3(): runs = 10 episodes = 500 num_of_tilings = 8 alphas = [0.5, 0.3] n_steps = [1, 8] steps = np.zeros((len(alphas), episodes)) for run in range(runs): value_functions = [ValueFunction(alpha, num_of_tilings) for alpha in alphas] for index in range(len(value_functions)): for episode in tqdm(range(episodes)): step = semi_gradient_n_step_sarsa(value_functions[index], n_steps[index]) steps[index, episode] += step steps /= runs for i in range(0, len(alphas)): plt.plot(steps[i], label='n = %.01f' % (n_steps[i])) plt.xlabel('Episode') plt.ylabel('Steps per episode') plt.yscale('log') plt.legend() plt.savefig('../images/figure_10_3.png') plt.close() # Figure 10.4, effect of alpha and n on multi-step semi-gradient Sarsa def figure_10_4(): alphas = np.arange(0.25, 1.75, 0.25) n_steps = np.power(2, np.arange(0, 5)) episodes = 50 runs = 5 max_steps = 300 steps = np.zeros((len(n_steps), len(alphas))) for run in range(runs): for n_step_index, n_step in enumerate(n_steps): for alpha_index, alpha in enumerate(alphas): if (n_step == 8 and alpha > 1) or \ (n_step == 16 and alpha > 0.75): # In these cases it won't converge, so ignore them steps[n_step_index, alpha_index] += max_steps * episodes continue value_function = ValueFunction(alpha) for episode in tqdm(range(episodes)): step = semi_gradient_n_step_sarsa(value_function, n_step) steps[n_step_index, alpha_index] += step # average over independent runs and episodes steps /= runs * episodes for i in range(0, len(n_steps)): plt.plot(alphas, steps[i, :], label='n = '+str(n_steps[i])) plt.xlabel('alpha * number of tilings(8)') plt.ylabel('Steps per episode') plt.ylim([220, max_steps]) plt.legend() plt.savefig('../images/figure_10_4.png') plt.close() if __name__ == '__main__': figure_10_1() figure_10_2() figure_10_3() figure_10_4() ================================================ FILE: chapter11/counterexample.py ================================================ ####################################################################### # Copyright (C) # # 2016 - 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from tqdm import tqdm from mpl_toolkits.mplot3d.axes3d import Axes3D # all states: state 0-5 are upper states STATES = np.arange(0, 7) # state 6 is lower state LOWER_STATE = 6 # discount factor DISCOUNT = 0.99 # each state is represented by a vector of length 8 FEATURE_SIZE = 8 FEATURES = np.zeros((len(STATES), FEATURE_SIZE)) for i in range(LOWER_STATE): FEATURES[i, i] = 2 FEATURES[i, 7] = 1 FEATURES[LOWER_STATE, 6] = 1 FEATURES[LOWER_STATE, 7] = 2 # all possible actions DASHED = 0 SOLID = 1 ACTIONS = [DASHED, SOLID] # reward is always zero REWARD = 0 # take @action at @state, return the new state def step(state, action): if action == SOLID: return LOWER_STATE return np.random.choice(STATES[: LOWER_STATE]) # target policy def target_policy(state): return SOLID # state distribution for the behavior policy STATE_DISTRIBUTION = np.ones(len(STATES)) / 7 STATE_DISTRIBUTION_MAT = np.matrix(np.diag(STATE_DISTRIBUTION)) # projection matrix for minimize MSVE PROJECTION_MAT = np.matrix(FEATURES) * \ np.linalg.pinv(np.matrix(FEATURES.T) * STATE_DISTRIBUTION_MAT * np.matrix(FEATURES)) * \ np.matrix(FEATURES.T) * \ STATE_DISTRIBUTION_MAT # behavior policy BEHAVIOR_SOLID_PROBABILITY = 1.0 / 7 def behavior_policy(state): if np.random.binomial(1, BEHAVIOR_SOLID_PROBABILITY) == 1: return SOLID return DASHED # Semi-gradient off-policy temporal difference # @state: current state # @theta: weight for each component of the feature vector # @alpha: step size # @return: next state def semi_gradient_off_policy_TD(state, theta, alpha): action = behavior_policy(state) next_state = step(state, action) # get the importance ratio if action == DASHED: rho = 0.0 else: rho = 1.0 / BEHAVIOR_SOLID_PROBABILITY delta = REWARD + DISCOUNT * np.dot(FEATURES[next_state, :], theta) - \ np.dot(FEATURES[state, :], theta) delta *= rho * alpha # derivatives happen to be the same matrix due to the linearity theta += FEATURES[state, :] * delta return next_state # Semi-gradient DP # @theta: weight for each component of the feature vector # @alpha: step size def semi_gradient_DP(theta, alpha): delta = 0.0 # go through all the states for state in STATES: expected_return = 0.0 # compute bellman error for each state for next_state in STATES: if next_state == LOWER_STATE: expected_return += REWARD + DISCOUNT * np.dot(theta, FEATURES[next_state, :]) bellmanError = expected_return - np.dot(theta, FEATURES[state, :]) # accumulate gradients delta += bellmanError * FEATURES[state, :] # derivatives happen to be the same matrix due to the linearity theta += alpha / len(STATES) * delta # temporal difference with gradient correction # @state: current state # @theta: weight of each component of the feature vector # @weight: auxiliary trace for gradient correction # @alpha: step size of @theta # @beta: step size of @weight def TDC(state, theta, weight, alpha, beta): action = behavior_policy(state) next_state = step(state, action) # get the importance ratio if action == DASHED: rho = 0.0 else: rho = 1.0 / BEHAVIOR_SOLID_PROBABILITY delta = REWARD + DISCOUNT * np.dot(FEATURES[next_state, :], theta) - \ np.dot(FEATURES[state, :], theta) theta += alpha * rho * (delta * FEATURES[state, :] - DISCOUNT * FEATURES[next_state, :] * np.dot(FEATURES[state, :], weight)) weight += beta * rho * (delta - np.dot(FEATURES[state, :], weight)) * FEATURES[state, :] return next_state # expected temporal difference with gradient correction # @theta: weight of each component of the feature vector # @weight: auxiliary trace for gradient correction # @alpha: step size of @theta # @beta: step size of @weight def expected_TDC(theta, weight, alpha, beta): for state in STATES: # When computing expected update target, if next state is not lower state, importance ratio will be 0, # so we can safely ignore this case and assume next state is always lower state delta = REWARD + DISCOUNT * np.dot(FEATURES[LOWER_STATE, :], theta) - np.dot(FEATURES[state, :], theta) rho = 1 / BEHAVIOR_SOLID_PROBABILITY # Under behavior policy, state distribution is uniform, so the probability for each state is 1.0 / len(STATES) expected_update_theta = 1.0 / len(STATES) * BEHAVIOR_SOLID_PROBABILITY * rho * ( delta * FEATURES[state, :] - DISCOUNT * FEATURES[LOWER_STATE, :] * np.dot(weight, FEATURES[state, :])) theta += alpha * expected_update_theta expected_update_weight = 1.0 / len(STATES) * BEHAVIOR_SOLID_PROBABILITY * rho * ( delta - np.dot(weight, FEATURES[state, :])) * FEATURES[state, :] weight += beta * expected_update_weight # if *accumulate* expected update and actually apply update here, then it's synchronous # theta += alpha * expectedUpdateTheta # weight += beta * expectedUpdateWeight # interest is 1 for every state INTEREST = 1 # expected update of ETD # @theta: weight of each component of the feature vector # @emphasis: current emphasis # @alpha: step size of @theta # @return: expected next emphasis def expected_emphatic_TD(theta, emphasis, alpha): # we perform synchronous update for both theta and emphasis expected_update = 0 expected_next_emphasis = 0.0 # go through all the states for state in STATES: # compute rho(t-1) if state == LOWER_STATE: rho = 1.0 / BEHAVIOR_SOLID_PROBABILITY else: rho = 0 # update emphasis next_emphasis = DISCOUNT * rho * emphasis + INTEREST expected_next_emphasis += next_emphasis # When computing expected update target, if next state is not lower state, importance ratio will be 0, # so we can safely ignore this case and assume next state is always lower state next_state = LOWER_STATE delta = REWARD + DISCOUNT * np.dot(FEATURES[next_state, :], theta) - np.dot(FEATURES[state, :], theta) expected_update += 1.0 / len(STATES) * BEHAVIOR_SOLID_PROBABILITY * next_emphasis * 1 / BEHAVIOR_SOLID_PROBABILITY * delta * FEATURES[state, :] theta += alpha * expected_update return expected_next_emphasis / len(STATES) # compute RMSVE for a value function parameterized by @theta # true value function is always 0 in this example def compute_RMSVE(theta): return np.sqrt(np.dot(np.power(np.dot(FEATURES, theta), 2), STATE_DISTRIBUTION)) # compute RMSPBE for a value function parameterized by @theta # true value function is always 0 in this example def compute_RMSPBE(theta): bellman_error = np.zeros(len(STATES)) for state in STATES: for next_state in STATES: if next_state == LOWER_STATE: bellman_error[state] += REWARD + DISCOUNT * np.dot(theta, FEATURES[next_state, :]) - np.dot(theta, FEATURES[state, :]) bellman_error = np.dot(np.asarray(PROJECTION_MAT), bellman_error) return np.sqrt(np.dot(np.power(bellman_error, 2), STATE_DISTRIBUTION)) figureIndex = 0 # Figure 11.2(left), semi-gradient off-policy TD def figure_11_2_left(): # Initialize the theta theta = np.ones(FEATURE_SIZE) theta[6] = 10 alpha = 0.01 steps = 1000 thetas = np.zeros((FEATURE_SIZE, steps)) state = np.random.choice(STATES) for step in tqdm(range(steps)): state = semi_gradient_off_policy_TD(state, theta, alpha) thetas[:, step] = theta for i in range(FEATURE_SIZE): plt.plot(thetas[i, :], label='theta' + str(i + 1)) plt.xlabel('Steps') plt.ylabel('Theta value') plt.title('semi-gradient off-policy TD') plt.legend() # Figure 11.2(right), semi-gradient DP def figure_11_2_right(): # Initialize the theta theta = np.ones(FEATURE_SIZE) theta[6] = 10 alpha = 0.01 sweeps = 1000 thetas = np.zeros((FEATURE_SIZE, sweeps)) for sweep in tqdm(range(sweeps)): semi_gradient_DP(theta, alpha) thetas[:, sweep] = theta for i in range(FEATURE_SIZE): plt.plot(thetas[i, :], label='theta' + str(i + 1)) plt.xlabel('Sweeps') plt.ylabel('Theta value') plt.title('semi-gradient DP') plt.legend() def figure_11_2(): plt.figure(figsize=(10, 20)) plt.subplot(2, 1, 1) figure_11_2_left() plt.subplot(2, 1, 2) figure_11_2_right() plt.savefig('../images/figure_11_2.png') plt.close() # Figure 11.6(left), temporal difference with gradient correction def figure_11_6_left(): # Initialize the theta theta = np.ones(FEATURE_SIZE) theta[6] = 10 weight = np.zeros(FEATURE_SIZE) alpha = 0.005 beta = 0.05 steps = 1000 thetas = np.zeros((FEATURE_SIZE, steps)) RMSVE = np.zeros(steps) RMSPBE = np.zeros(steps) state = np.random.choice(STATES) for step in tqdm(range(steps)): state = TDC(state, theta, weight, alpha, beta) thetas[:, step] = theta RMSVE[step] = compute_RMSVE(theta) RMSPBE[step] = compute_RMSPBE(theta) for i in range(FEATURE_SIZE): plt.plot(thetas[i, :], label='theta' + str(i + 1)) plt.plot(RMSVE, label='RMSVE') plt.plot(RMSPBE, label='RMSPBE') plt.xlabel('Steps') plt.title('TDC') plt.legend() # Figure 11.6(right), expected temporal difference with gradient correction def figure_11_6_right(): # Initialize the theta theta = np.ones(FEATURE_SIZE) theta[6] = 10 weight = np.zeros(FEATURE_SIZE) alpha = 0.005 beta = 0.05 sweeps = 1000 thetas = np.zeros((FEATURE_SIZE, sweeps)) RMSVE = np.zeros(sweeps) RMSPBE = np.zeros(sweeps) for sweep in tqdm(range(sweeps)): expected_TDC(theta, weight, alpha, beta) thetas[:, sweep] = theta RMSVE[sweep] = compute_RMSVE(theta) RMSPBE[sweep] = compute_RMSPBE(theta) for i in range(FEATURE_SIZE): plt.plot(thetas[i, :], label='theta' + str(i + 1)) plt.plot(RMSVE, label='RMSVE') plt.plot(RMSPBE, label='RMSPBE') plt.xlabel('Sweeps') plt.title('Expected TDC') plt.legend() def figure_11_6(): plt.figure(figsize=(10, 20)) plt.subplot(2, 1, 1) figure_11_6_left() plt.subplot(2, 1, 2) figure_11_6_right() plt.savefig('../images/figure_11_6.png') plt.close() # Figure 11.7, expected ETD def figure_11_7(): # Initialize the theta theta = np.ones(FEATURE_SIZE) theta[6] = 10 alpha = 0.03 sweeps = 1000 thetas = np.zeros((FEATURE_SIZE, sweeps)) RMSVE = np.zeros(sweeps) emphasis = 0.0 for sweep in tqdm(range(sweeps)): emphasis = expected_emphatic_TD(theta, emphasis, alpha) thetas[:, sweep] = theta RMSVE[sweep] = compute_RMSVE(theta) for i in range(FEATURE_SIZE): plt.plot(thetas[i, :], label='theta' + str(i + 1)) plt.plot(RMSVE, label='RMSVE') plt.xlabel('Sweeps') plt.title('emphatic TD') plt.legend() plt.savefig('../images/figure_11_7.png') plt.close() if __name__ == '__main__': figure_11_2() figure_11_6() figure_11_7() ================================================ FILE: chapter12/lambda_effect.py ================================================ ####################################################################### # Copyright (C) # # 2021 Johann Huber (huber.joh@hotmail.fr) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### """ Description: This script is meant to reproduce Figure 12.14 of Sutton and Barto's book. This example shows the effect of λ on 4 reinforcement learning tasks. Credits: The "Cart and Pole" environment's code has been taken from openai gym source code. Link : https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py#L7 The tile coding software has been taken from Sutton's website. Link : http://www.incompleteideas.net/tiles/tiles3.html Remark: - The optimum step-size parameters search have been omitted to avoid an even longer code. This problem has already been met several times in the chapter. Structure: 1. Utils 1.1. Tiling utils 1.2. Eligibility traces utils 2. Random walk 3. Mountain Car 4. Cart and Pole 5. Results 5.1. Getting plot data 5.2. Reproducing figure 12.14 5.3. Main """; import math import numpy as np import pandas as pd from tqdm import tqdm import matplotlib.pyplot as plt import seaborn as sns; sns.set_theme() ############################################################################################# # 1. Utils # ############################################################################################# #-------------------# # 1.1. Tiling utils # #-------------------# # Credit : http://www.incompleteideas.net/tiles/tiles3.html basehash = hash class IHT: """Structure to handle collisions.""" def __init__(self, sizeval): self.size = sizeval self.overfullCount = 0 self.dictionary = {} def __str__(self): """Prepares a string for printing whenever this object is printed.""" return "Collision table:" + \ " size:" + str(self.size) + \ " overfullCount:" + str(self.overfullCount) + \ " dictionary:" + str(len(self.dictionary)) + " items" def count(self): return len(self.dictionary) def fullp(self): return len(self.dictionary) >= self.size def getindex(self, obj, readonly=False): d = self.dictionary if obj in d: return d[obj] elif readonly: return None size = self.size count = self.count() if count >= size: if self.overfullCount == 0: print('IHT full, starting to allow collisions') assert self.overfullCount != 0 self.overfullCount += 1 return basehash(obj) % self.size else: d[obj] = count return count def hashcoords(coordinates, m, readonly=False): if type(m) == IHT: return m.getindex(tuple(coordinates), readonly) if type(m) == int: return basehash(tuple(coordinates)) % m if m == None: return coordinates from math import floor, log from itertools import zip_longest def tiles(ihtORsize, numtilings, floats, ints=[], readonly=False): """Returns num-tilings tile indices corresponding to the floats and ints""" qfloats = [floor(f * numtilings) for f in floats] Tiles = [] for tiling in range(numtilings): tilingX2 = tiling * 2 coords = [tiling] b = tiling for q in qfloats: coords.append((q + b) // numtilings) b += tilingX2 coords.extend(ints) Tiles.append(hashcoords(coords, ihtORsize, readonly)) return Tiles def tileswrap(ihtORsize, numtilings, floats, wrapwidths, ints=[], readonly=False): """Returns num-tilings tile indices corresponding to the floats and ints, wrapping some floats""" qfloats = [floor(f * numtilings) for f in floats] Tiles = [] for tiling in range(numtilings): tilingX2 = tiling * 2 coords = [tiling] b = tiling for q, width in zip_longest(qfloats, wrapwidths): c = (q + b % numtilings) // numtilings coords.append(c % width if width else c) b += tilingX2 coords.extend(ints) Tiles.append(hashcoords(coords, ihtORsize, readonly)) return Tiles class IndexHashTable: def __init__(self, iht_size, num_tilings, tiling_size, obs_bounds): # Index Hash Table size self._iht = IHT(iht_size) # Number of tilings self._num_tilings = num_tilings # Tiling size self._tiling_size = tiling_size # Observation boundaries # (format : [[min_1, max_1], ..., [min_i, max_i], ... ] for i in state's components) self._obs_bounds = obs_bounds def get_tiles(self, state, action): """Get the encoded state_action using Sutton's grid tiling software.""" # List of floats numbers to be tiled floats = [s * self._tiling_size/(obs_max - obs_min) for (s, (obs_min, obs_max)) in zip(state, self._obs_bounds)] return tiles(self._iht, self._num_tilings, floats, [action]) #-------------------------------# # 1.2. Eligibility traces utils # #-------------------------------# def update_trace_vector(agent, method, state, action=None): """Updates agent's trace vector (z) with then current state (or state-action pair) using to the given method. Returns the updated vector.""" assert method in ['replace', 'replace_reset', 'accumulating'], 'Invalid trace update method.' # Trace step z = agent._γ * agent._λ * agent._z # Update last observations components if action is not None: x_ids = agent.get_active_features(state, action) # x(s,a) else: x_ids = agent.get_active_features(state) # x(s) if method == 'replace_reset': for a in agent._all_actions: if a != action: x_ids2clear = agent.get_active_features(state, a) # always x(s,a) for id_w in x_ids2clear: z[id_w] = 0 for id_w in x_ids: if (method == 'replace') or (method == 'replace_reset'): z[id_w] = 1 elif method == 'accumulating': z[id_w] += 1 return z ############################################################################################# # 2. Random walk # ############################################################################################# class RandomWalkEnvironment: def __init__(self): # Number of states n_states = 21 # [term=0] [1, ... , 19] [term=20] # Transition rewards self._rewards = {key:0 for key in range(n_states)} self._rewards[0] = -1 self._rewards[n_states-1] = 1 # Id terminal states self._terminal_states = [0, n_states - 1] def step(self, state, action): next_state = state + action reward = self._rewards[next_state] return next_state, reward class RandomWalkAgent: def __init__(self, lmbda, alpha): # Number of states self._n_states = 21 # [term=0] [1, ... , 19] [term=20] # Weight vector self._w = np.zeros(self._n_states) # Eligibility trace self._z = np.zeros(self._n_states) # Id initial state self._init_state = int(self._n_states/2) + 1 # Id terminal states self._terminal_states = [0, self._n_states - 1] # Action space self._all_actions = [-1, 1] # Learning step-size self._α = alpha # Discount factor self._γ = 1. # Exponential weighting decrease self._λ = lmbda # True values (to compute RMS error) self._target_values = np.array([i/10 for i in range(-9,10)]) # RMS error computed for each episode self._error_hist = [] @property def error_hist(self): return self._error_hist def get_all_v_hat(self): all_v_hats = np.array([self.v_hat(s) for s in range(self._n_states)]) return all_v_hats[1:-1] # discard terminal states def policy(self, state): """Action selection : uniform distribution. State argument is given for consistency.""" return np.random.choice(self._all_actions) def v_hat(self, state): """Returns the approximated value for state, w.r.t. the weight vector.""" if state in self._terminal_states: return 0. # by convention : R(S(T)) = 0 value = self._w[state] return value def grad_v_hat(self, state): """Compute the gradient of the state value w.r.t. the weight vector.""" grad_v_hat = np.zeros_like(self._z) grad_v_hat[state] = 1 return grad_v_hat def get_active_features(self, state): """Get an array containing the id of the current active feature.""" return [np.where(self.grad_v_hat(state) == 1)[0][0]] def run_td_lambda(self, env, n_episodes, method): """Method described p293 of the book. :param env: environment to interact with. :param n_episodes: number of episodes to train on. :param method: specify the TD(λ) method : * 'accumulating' : With accumulating traces ; * 'replace' : With replacing traces ; :return: None """ assert method in ['replace', 'accumulating'], 'Invalid method' for n_ep in range(n_episodes): curr_state = self._init_state self._z = np.zeros(self._n_states) running = True while running: state = curr_state action = self.policy(state) next_state, reward = env.step(state, action) self._z = update_trace_vector(agent=self, method=method, state=state) # Moment-by-moment TD error δ = reward + self._γ * self.v_hat(next_state) - self.v_hat(state) # Weight vector update self._w += self._α * δ * self._z if next_state in self._terminal_states: running = False else: curr_state = next_state rms_err = np.sqrt(np.array((self._target_values - self.get_all_v_hat()) ** 2).mean()) self._error_hist.append(rms_err) class RandomWalk: def __init__(self, lmbda, alpha): self._env = RandomWalkEnvironment() self._agent = RandomWalkAgent(lmbda=lmbda, alpha=alpha) @property def error_hist(self): return self._agent.error_hist def train(self, n_episodes, method): assert method in ['replace', 'accumulating'], 'Invalid method' self._agent.run_td_lambda(self._env, n_episodes=n_episodes, method=method) ############################################################################################# # 3. Mountain Car # ############################################################################################# class MountainCarEnvironment: def __init__(self): # Action space self._all_actions = [-1, 0, 1] # Position bounds self._pos_lims = [-1.2, 0.5] # Speed bounds self._vel_lims = [-0.07, 0.07] # Terminal state position self._pos_terminal = self._pos_lims[1] # Terminal state reward self._terminal_reward = 0 # Non-terminal state reward self.step_reward = -1 def step(self, state, action): x, x_dot = state x_dot_next = x_dot + 0.001 * action - 0.0025 * np.cos(3 * x) x_dot_next = np.clip(x_dot_next, a_min=self._vel_lims[0], a_max=self._vel_lims[1]) x_next = x + x_dot_next x_next = np.clip(x_next, a_min=self._pos_lims[0], a_max=self._pos_lims[1]) if x_next == self._pos_lims[0]: x_dot_next = 0. # left border : reset speed next_state = (x_next, x_dot_next) reward = self._terminal_reward if (x_next == self._pos_terminal) else self.step_reward return next_state, reward class MountainCarAgent: def __init__(self, alpha, lmbda, iht_args): # Index Hash Table for position encoding self._iht = IndexHashTable(**iht_args) # Number of tilings self._num_tilings = iht_args['num_tilings'] # Weight vector init_w_val = -20. # optimistic initial values to make the agent explore self._w = np.full(iht_args['iht_size'], init_w_val) # Eligibility trace self._z = np.zeros(iht_args['iht_size']) # Maximum number of step within an episode (avoid infinite episode) self.max_n_step = 4000 # Minimum cumulated reward (means that q values have diverged). self.default_min_reward = -4000 # Action space self._all_actions = [-1, 0, 1] # Position bounds self._pos_lims = [-1.2, 0.5] # Speed bounds self._vel_lims = [-0.07, 0.07] # Terminal state position self._pos_terminal = self._pos_lims[1] # Learning step-size self._α = alpha # Exponential weighting decrease self._λ = lmbda # Discount factor self._γ = 1. # Number of steps before termination, computed for each episode self._n_step_hist = [] @property def n_step_hist(self): return self._n_step_hist def policy(self, state): """Apply a ε-greedy policy to choose an action from state.""" # Always greedy : exploration is assured by optimistic initial values q_sa_next = np.array([self.q_hat(state, a) for a in self._all_actions]) greedy_action_inds = np.where(q_sa_next == q_sa_next.max())[0] ind_action = np.random.choice(greedy_action_inds) # randomly choose between maximum q values action = self._all_actions[ind_action] return action def get_init_state(self): """Get a random starting position in the interval [-0.6, -0.4).""" x = np.random.uniform(low=-0.6, high=-0.4) x_dot = 0. return x, x_dot def is_terminal_state(self, state): return state[0] == self._pos_terminal def q_hat(self, state, action): """Compute the q value for the current state-action pair.""" x, x_dot = state if x == self._pos_terminal: return 0 x_s_a = self._iht.get_tiles(state, action) q = np.array([self._w[id_w] for id_w in x_s_a]).sum() return q def get_active_features(self, state, action): """Get an array containing the ids of the current active features.""" return self._iht.get_tiles(state, action) def run_sarsa_lambda(self, env, n_episodes, method): """Apply Sarsa(λ) algorithm. (p.305) :param env: environment to interact with. :param n_episodes: number of episodes to train on. :param method: specify the Sarsa(λ) method : * 'accumulating' : With accumulating traces ; * 'replace' : With replacing traces ; :return: None """ assert method in ['accumulating', 'replace'], 'Invalid method arg.' overflow_flag = False for i_ep in range(n_episodes): if overflow_flag: # Training diverged : set default worse value for all the remaining epochs self._n_step_hist.append(self.max_n_step) continue n_it = 0 state = self.get_init_state() action = self.policy(state) self._z = np.zeros(self._w.shape) running = True while(running): try: next_state, reward = env.step(state, action) n_it += 1 δ = reward δ -= self.q_hat(state, action) # q_hat(s) : implicit sum over F(s,a) (see book) self._z = update_trace_vector(agent=self, method=method, state=state, action=action) if self.is_terminal_state(next_state) or (n_it == self.max_n_step): self._w += (self._α/self._num_tilings) * δ * self._z running = False continue # go to next episode next_action = self.policy(next_state) δ += self._γ * self.q_hat(next_state, next_action) # q_hat(s') : implicit sum over F(s',a') (see book) self._w += (self._α/self._num_tilings) * δ * self._z state = next_state action = next_action except ValueError: overflow_msg = 'λ>0.9 : expected behavior !' if (self._λ > .9) else 'Training diverged, try a lower α.' print(f'Warning : Value overflow.| λ={self._λ} , α*num_tile={self._α} | ' + overflow_msg) # Training data lists will be fed with default worse values for all the remaining epochs. overflow_flag = True running = False continue if overflow_flag: n_it = self.max_n_step self._n_step_hist.append(n_it) class MountainCar: def __init__(self, lmbda, alpha): # Environment initialization self._env = MountainCarEnvironment() # Observation boundaries # (format : [[min_1, max_1], ..., [min_i, max_i], ... ] for i in state's components # state = (x, x_dot)) obs_bounds = [[-1.2, 0.5], [-0.07, 0.07]] # Tiling parameters self._iht_args = {'iht_size': 2 ** 12, 'num_tilings': 10, 'tiling_size': 9, 'obs_bounds': obs_bounds} # Agent parameters mc_agent_args = {'iht_args': self._iht_args, 'alpha': alpha, 'lmbda': lmbda} # Agent initialization self._agent = MountainCarAgent(**mc_agent_args) @property def n_step_hist(self): return self._agent.n_step_hist def train(self, n_episodes, method): assert method in ['accumulating', 'replace'], 'Invalid method arg.' self._agent.run_sarsa_lambda(self._env, n_episodes=n_episodes, method=method) ############################################################################################# # 4. Cart and Pole # ############################################################################################# class CartPoleEnvironment: """Credit : https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py#L7""" def __init__(self): self.gravity = 9.8 self.masscart = 1.0 self.masspole = 0.1 self.total_mass = (self.masspole + self.masscart) self.length = 0.5 # actually half the pole's length self.polemass_length = (self.masspole * self.length) self.force_mag = 10.0 self.tau = 0.02 # seconds between state updates self.kinematics_integrator = 'euler' # Angle at which to fail the episode self.theta_threshold_radians = 12 * 2 * math.pi / 360 # Position at which to fail the episode self.x_threshold = 2.4 # Action space self._all_actions = [0, 1] # left, right def is_state_valid(self, state): x, _, theta, _ = state # Velocities aren't bounded, therefore cannot be checked. is_state_invalid = bool( x < -4.8 or x > 4.8 or theta < -0.418 or theta > 0.418 ) return not is_state_invalid def step(self, state, action): x, x_dot, theta, theta_dot = state force = self.force_mag if action == 1 else -self.force_mag costheta = math.cos(theta) sintheta = math.sin(theta) # For the interested reader: # https://coneural.org/florian/papers/05_cart_pole.pdf temp = (force + self.polemass_length * theta_dot ** 2 * sintheta) / self.total_mass thetaacc = (self.gravity * sintheta - costheta * temp) / (self.length * (4.0 / 3.0 - self.masspole * costheta ** 2 / self.total_mass)) xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass if self.kinematics_integrator == 'euler': x = x + self.tau * x_dot x_dot = x_dot + self.tau * xacc theta = theta + self.tau * theta_dot theta_dot = theta_dot + self.tau * thetaacc else: # semi-implicit euler x_dot = x_dot + self.tau * xacc x = x + self.tau * x_dot theta_dot = theta_dot + self.tau * thetaacc theta = theta + self.tau * theta_dot next_state = (x, x_dot, theta, theta_dot) reward = 1.0 return next_state, reward class CartPoleAgent: def __init__(self, iht_args, alpha, lmbda): # Index Hash Table for position encoding self._iht = IndexHashTable(**iht_args) # Weight vector self._w = np.zeros(iht_args['iht_size']) # Number of tilings self._num_tilings = iht_args['num_tilings'] # Eligibility trace self._z = self._z = np.zeros(self._w.shape) # Exponential weighting decrease self._λ = lmbda # Max number of failures (default worse n_failures) self.max_n_failures = 100000 # Action space self._all_actions = [0, 1] # Learning step-size self._α = alpha # Discount factor self._γ = 0.99 # Exploration ratio self._ε = 0.1 # Angle at which to fail the episode (12°) self.theta_threshold_radians = 12 * 2 * math.pi / 360 # Position at which to fail the episode self.x_threshold = 2.4 # Number of failures (updated while running Sarsa(λ)) self._n_failures = 0 @property def n_failures(self): return self._n_failures def policy(self, state): """Apply a ε-greedy policy to choose an action from state.""" if np.random.random_sample() < self._ε: action = self._all_actions[np.random.choice(range(len(self._all_actions)))] return action q_sa_next = np.array([self.q_hat(state, a) for a in self._all_actions]) greedy_action_inds = np.where(q_sa_next == q_sa_next.max())[0] ind_action = np.random.choice(greedy_action_inds) action = self._all_actions[ind_action] return action def is_state_valid(self, state): x, _, theta, _ = state is_state_invalid = bool( x < -4.8 or x > 4.8 or theta < -0.418 or theta > 0.418 ) return not is_state_invalid def get_init_state(self): """Get a random starting position.""" state = np.random.uniform(low=-0.05, high=0.05, size=(4,)) return state def is_state_over_bounds(self, state): """Returns True if the current state is out of bounds, i.e. the current run is over. Returns False otherwise.""" x, x_dot, theta, theta_dot = state return bool( x < -self.x_threshold or x > self.x_threshold or theta < -self.theta_threshold_radians or theta > self.theta_threshold_radians ) def q_hat(self, state, action): """Compute the q value for the current state-action pair.""" if self.is_state_over_bounds(state): return 0. x_s_a = self._iht.get_tiles(state, action) q = np.array([self._w[id_w] for id_w in x_s_a]).sum() return q def get_active_features(self, state, action): """Get an array containing the ids of the current active features.""" return self._iht.get_tiles(state, action) def run_sarsa_lambda(self, env, n_step_max, method): """Apply Sarsa(λ) algorithm. (p.305) :param env: environment to interact with. :param n_step_max: number of steps to train on. :param method: specify the Sarsa(λ) method : * 'accumulating' : With accumulating traces ; :return: None """ assert method in ['accumulating'], 'Invalid method arg.' n_step = 0 # number of steps across episodes n_ep = 0 # number of episode while n_step < n_step_max: n_ep += 1 n_step_try = 0 # number of steps in the current episode state = self.get_init_state() action = self.policy(state) self._z = np.zeros_like(self._w) running = True while running: try: next_state, reward = env.step(state, action) n_step_try += 1 n_step += 1 δ = reward δ -= self.q_hat(state, action) # q_hat(s) : implicit sum over F(s,a) (see book) self._z = update_trace_vector(agent=self, method=method, state=state, action=action) # End of run if n_step == n_step_max: running = False # Failed trial if self.is_state_over_bounds(next_state) : self._w += (self._α/self._num_tilings) * δ * self._z self._n_failures += 1 running = False continue next_action = self.policy(next_state) δ += self._γ * self.q_hat(next_state, next_action) # q_hat(s') : implicit sum over F(s',a') (see book) self._w += (self._α/self._num_tilings) * δ * self._z state = next_state action = next_action except ValueError: overflow_msg = 'λ>0.9 : expected behavior !' if (self._λ > .9) else 'Training diverged, try a lower α.' print(f'Warning : Value overflow.| λ={self._λ} , α*num_tile={self._α} | ' + overflow_msg) # Training metric is set with the default worse value. self._n_failures = self.max_n_failures running = False n_step = n_step_max continue #print('Running over. n_ep =', n_ep) class CartPole: def __init__(self, lmbda, alpha): # Environment initialization self._env = CartPoleEnvironment() # Observation boundaries # (format : [[min_1, max_1], ..., [min_i, max_i], ... ] for i in state's components. # state = (x, x_dot, theta, theta_dot) # "Fake" bounds have been set for velocity components to ease tiling.) obs_bounds = [[-4.8, 4.8], [-3., 3.], [-0.25, 0.25], [-3., 3.]] # Tiling parameters self._iht_args = {'iht_size': 2 ** 11, 'num_tilings': 2, 'tiling_size': 4, 'obs_bounds': obs_bounds} # Agent parameters pw_agent_args = {'iht_args': self._iht_args, 'alpha': alpha, 'lmbda': lmbda} # Agent initialization self._agent = CartPoleAgent(**pw_agent_args) @property def n_failures(self): return self._agent.n_failures def train(self, n_step_max, method): assert method in ['accumulating'], 'Invalid method' self._agent.run_sarsa_lambda(self._env, n_step_max=n_step_max, method=method) ############################################################################################# # 5. Puddle World # ############################################################################################# class PuddleWorldGrid: def __init__(self): # Grid dimensions self._h, self._w = (1, 1) # Distance to the top left corner that define the goal area self._goal_len = 0.01 # Position of puddle centers # format : ((i_center_a, j_center_a), (i_center_b, j_center_b)) self._pos_centers_puddles = [((.25, .1), (.25, .45)), ((.2, .45), (.6, .45))] # Puddle radius self._puddle_radius = 0.1 # Figure dimension for plotting self._fig_size = (10, 10) @property def height(self): return self._h @property def width(self): return self._w def is_state_goal(self, state): i,j = state g_i, g_j = (0., 1.) dist2goal = np.sqrt((i - g_i) ** 2 + (j - g_j) ** 2) return dist2goal <= self._goal_len def get_dist2puddle(self, state): """Get state's distance (float) to the nearest puddle's border. Returns a float corresponding to the state's distance to the nearest puddle border. Return -1 if the state to evaluate is far enough from puddles to be not affected by the cost penalty. """ i, j = state max_dist = -1 # puddle cost is defined by the maximal distance to border # Unpack puddle pos (p_horiz_ij_1, p_horiz_ij_2), (p_verti_ij_1, p_verti_ij_2) = self._pos_centers_puddles p_horiz_i_1, p_horiz_j_1 = p_horiz_ij_1 p_horiz_i_2, p_horiz_j_2 = p_horiz_ij_2 p_verti_i_1, p_verti_j_1 = p_verti_ij_1 p_verti_i_2, p_verti_j_2 = p_verti_ij_2 dist2centers = [np.sqrt((i - p_horiz_i_1) ** 2 + (j - p_horiz_j_1) ** 2), np.sqrt((i - p_horiz_i_2) ** 2 + (j - p_horiz_j_2) ** 2), np.sqrt((i - p_verti_i_1) ** 2 + (j - p_verti_j_1) ** 2), np.sqrt((i - p_verti_i_2) ** 2 + (j - p_verti_j_2) ** 2)] min_dist2centers = np.array(dist2centers).min() if min_dist2centers <= self._puddle_radius: dist2border = self._puddle_radius - min_dist2centers if max_dist < dist2border: max_dist = dist2border # Horizontal puddle axis if (j >= p_horiz_j_1) and (j <= p_horiz_j_2): dist2horiz_axis_p = np.abs(i - p_horiz_i_1) if (dist2horiz_axis_p <= self._puddle_radius): dist2border = self._puddle_radius - dist2horiz_axis_p if max_dist < dist2border: max_dist = dist2border # Vertical puddle axis if (i >= p_verti_i_1) and (i <= p_verti_i_2): dist2verti_axis_p = np.abs(j - p_verti_j_1) if dist2verti_axis_p <= self._puddle_radius: dist2border = self._puddle_radius - dist2verti_axis_p if max_dist < dist2border: max_dist = dist2border dist2puddle = max_dist return dist2puddle def cvt_ij2xy(self, pos_ij): return pos_ij[1], self._h - pos_ij[0] def draw(self): fig, ax = plt.subplots(1, 1, figsize=self._fig_size) # Goal corner goal_area_ij = [[0, self._w], [0, self._w - self._goal_len], [self._goal_len, self._w]] goal_area_xy = [self.cvt_ij2xy(pos_ij) for pos_ij in goal_area_ij] goal_triangle = plt.Polygon(goal_area_xy, color='tab:green') ax.add_patch(goal_triangle) for i in tqdm(np.arange(0., 1., 0.005), desc='Creating map'): for j in np.arange(0., 1., 0.005): dist = self.get_dist2puddle(state=(i,j)) if dist == -1: continue # far from puddles # Grayscale : min=0.25 , max=0.75 color_intensity = (dist / self._puddle_radius) * (0.75 - 0.25) + 0.25 x, y = self.cvt_ij2xy((i, j)) dot = plt.Circle((x, y), 0.002, color=str(color_intensity)) ax.add_patch(dot) ax.set_xlim(0, self._w) ax.set_ylim(0, self._h) ax.set_title('PUDDLE WORLD', fontsize=18) # plt.waitforbuttonpress() # Export export_name = 'puddleworld_map' plt.savefig(export_name) print(f'Puddle word map exported as : {export_name}.png') class PuddleWorldEnvironment: def __init__(self, grid): # Grid object self._grid = grid # Action space self._all_actions = [(i, j) for i in range(-1, 2) for j in range(-1, 2) if abs(i) != abs(j)] # Step size when taking an action in a certain direction self._step_range = 0.05 # Transition cost self._step_cost = -1 # Transition cost when the agent walks on puddles self._puddle_cost = -400 def step(self, state, action): # Random gaussian noise (std=0.01) on each move move_noise = np.random.normal(0, 0.01, len(state)) # Move next_state = np.array(state) + self._step_range*np.array(action) + move_noise next_state = (np.clip(next_state[0], a_min=0, a_max=self._grid.height), np.clip(next_state[1], a_min=0, a_max=self._grid.width)) # Cost dist2puddle = self._grid.get_dist2puddle(next_state) is_far_from_puddles = dist2puddle == -1 reward = self._step_cost if is_far_from_puddles else self._puddle_cost * dist2puddle return next_state, reward class PuddleWorldAgent: def __init__(self, grid, alpha, lmbda, iht_args): # Index Hash Table for position encoding self._iht = IndexHashTable(**iht_args) # Weight vector self._w = np.zeros(iht_args['iht_size']) # Number of tilings self._num_tilings = iht_args['num_tilings'] # Action space self._all_actions = [(i, j) for i in range(-1, 2) for j in range(-1, 2) if abs(i) != abs(j)] # Grid object (uses metadata to check state validity) self._grid = grid # Learning step-size self._α = alpha # Exponential weighting decrease self._λ = lmbda # Discount factor self._γ = 1. # Exploration ratio self._ε = 0.1 # Cost computed for each episode self._cost_per_ep_hist = [] @property def cost_per_ep_hist(self): return self._cost_per_ep_hist def policy(self, state): """Apply a ε-greedy policy to choose an action from state.""" if np.random.random_sample() < self._ε: action = self._all_actions[np.random.choice(range(len(self._all_actions)))] return action q_hat = np.array([self.q_hat(state, a) for a in self._all_actions]) greedy_action_inds = np.where(q_hat == q_hat.max())[0] ind_action = np.random.choice(greedy_action_inds) action = self._all_actions[ind_action] return action def get_start_pos(self): """Randomly pick a non-goal state as starting position.""" i_pos, j_pos = -1, -1 is_init_pos_found = False while not is_init_pos_found: i_pos = np.random.randint(low=0, high=self._grid.height) j_pos = np.random.randint(low=0, high=self._grid.width) if not self._grid.is_state_goal((i_pos, j_pos)): is_init_pos_found = True assert (i_pos != -1) and (j_pos != -1), 'Error while looking for an init position.' return i_pos, j_pos def is_terminal_state(self, state): return self._grid.is_state_goal(state) def q_hat(self, state, action): """Compute the q value for the current state-action pair.""" if self.is_terminal_state(state): return 0. x_s_a = self._iht.get_tiles(state, action) q = np.array([self._w[id_w] for id_w in x_s_a]).sum() return q def get_active_features(self, state, action): """Get an array containing the ids of the current active features.""" return self._iht.get_tiles(state, action) def run_sarsa_lambda(self, env, n_episodes, method): """Apply Sarsa(λ) algorithm. (p.305) :param env: environment to interact with. :param n_episodes: number of episodes to train on. :param method: specify the Sarsa(λ) method : * 'accumulating' : With accumulating traces ; * 'replace' : With replacing traces ; * 'replace_reset' : With replacing traces, and clearing the traces of other actions. :return: None """ assert method in ['replace_reset'], 'Invalid method arg.' for i_ep in range(n_episodes): cumu_reward = 0 # cumulated reward state = self.get_start_pos() action = self.policy(state) self._z = np.zeros(self._w.shape) running = True while(running): next_state, reward = env.step(state, action) cumu_reward += reward δ = reward δ -= self.q_hat(state, action) # q_hat(s) : implicit sum over F(s,a) (see book) self._z = update_trace_vector(agent=self, method=method, state=state, action=action) if self.is_terminal_state(state) : self._w += (self._α / self._num_tilings) * δ * self._z running = False continue # go to next episode next_action = self.policy(next_state) δ += self._γ * self.q_hat(next_state, next_action) # q_hat(s') : implicit sum over F(s',a') (see book) self._w += (self._α / self._num_tilings) * δ * self._z state = next_state action = next_action self._cost_per_ep_hist.append(-cumu_reward) class PuddleWorld: def __init__(self, lmbda, alpha): # Grid initialization self._grid = PuddleWorldGrid() # Environment initialization self._env = PuddleWorldEnvironment(grid=self._grid) # Observation boundaries # (format : [[min_1, max_1], ..., [min_i, max_i], ... ] for i in state's components # state = (i, j)) obs_bounds = [[0., 1.], [0., 1.]] # Tiling parameters iht_args = {'iht_size': 2**10, 'num_tilings': 5, 'tiling_size': 5, 'obs_bounds': obs_bounds} # Agent parameters pw_agent_args = {'grid': self._grid, 'alpha': alpha, 'lmbda': lmbda, 'iht_args': iht_args} # Agent initialization self._agent = PuddleWorldAgent(**pw_agent_args) @property def cost_per_ep_hist(self): return self._agent.cost_per_ep_hist def draw(self): self._grid.draw() def train(self, n_episodes, method): assert method in ['replace_reset'], 'Invalid method arg.' self._agent.run_sarsa_lambda(self._env, n_episodes=n_episodes, method=method) def get_puddle_world_map(): """Creates the puddle world map and save the figure in the local folder as a .png file.""" pw = PuddleWorld(0., 0.) # dummy args pw.draw() ############################################################################################# # 5. Results # ############################################################################################# #---------------------------# # 5.1. Getting plot data # #---------------------------# def get_random_walk_plot_data(): n_episodes = 10 n_runs = 1000 lambda_range = {'replace': [0., 0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.975, 0.99, 1.], 'accumulating': [0., 0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.975, 0.99, 1.]} # Optimal alpha coefficient for each lambda alpha_range = {'replace': [0.8, 0.8, 0.8, 0.6, 0.6, 0.4, 0.4, 0.4, 0.3, 0.3], 'accumulating': [0.8, 0.8, 0.8, 0.6, 0.3, 0.2, 0.1, 0.05, 0.03, 0.01]} all_df_vis = {} for method in ['replace', 'accumulating']: # 'accumulating' # replace rms_per_ep = [] for n_run in tqdm(range(n_runs), desc=f'RANDOM WALK | method={method}'): for λ, α in zip(lambda_range[method], alpha_range[method]): np.random.seed(n_run) # Make sure that each runs of both algorithms experiences the same trial random_walk = RandomWalk(lmbda=λ, alpha=α) random_walk.train(n_episodes=n_episodes, method=method) rms_per_ep.append([λ, np.array(random_walk.error_hist).mean()]) df_str_key = f'{method}' all_df_vis[df_str_key] = pd.DataFrame(np.array(rms_per_ep), columns=['lambda', 'rms']) all_df_vis[df_str_key]['method'] = np.full(len(rms_per_ep), method) # No error bar on the book's RandomWalk figure df_vis = pd.concat(all_df_vis.values(), ignore_index=True) df_vis = df_vis.groupby(['lambda', 'method'])['rms'].mean().reset_index() return df_vis def get_mountain_car_plot_data(): n_episodes = 20 n_runs = 30 lambda_range = {'replace': [0., 0.4, 0.7, 0.8, 0.9, 0.95, 0.99, 1.], 'accumulating': [0., 0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 1.]} # Optimal alpha coefficient for each lambda alpha_range = {'replace': [1.4, 1.7, 1.5, 1.7, 1.7, 1.2, 0.6, 0.4], 'accumulating': [1.4, 1.3, 1.0, 0.8, 0.6, 0.5, 0.3, 0.15, 0.03, 0.01]} all_df_vis = {} for method in ['replace', 'accumulating']: n_step_per_ep = [] for n_run in tqdm(range(n_runs), desc=f'MOUNTAIN CAR | method={method}'): for λ, α in zip(lambda_range[method], alpha_range[method]): np.random.seed(n_run) mountain_car = MountainCar(lmbda=λ, alpha=α) mountain_car.train(n_episodes=n_episodes, method=method) n_step_per_ep.append([λ, np.array(mountain_car.n_step_hist).mean()]) df_str_key = f'{method}' all_df_vis[df_str_key] = pd.DataFrame(np.array(n_step_per_ep), columns=['lambda', 'steps']) all_df_vis[df_str_key]['method'] = np.full(len(n_step_per_ep), method) df_vis = pd.concat(all_df_vis.values(), ignore_index=True) return df_vis def get_cart_pole_plot_data(): n_step_max = 100000 n_runs = 30 method = 'accumulating' lambda_range = [0., 0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.99] # Optimal alpha coefficient for each lambda alpha_range = [.3, .3, .3, .3, .1, .1, .1, .1] all_df_vis = {} n_fail_lambdas = [] for n_run in tqdm(range(n_runs), desc=f'CART POLE | method={method}'): for λ, α in zip(lambda_range, alpha_range): np.random.seed(n_run) # Make sure that each runs of both algorithms experiences the same trial cart_pole = CartPole(lmbda=λ, alpha=α) cart_pole.train(n_step_max=n_step_max, method=method) n_fail_lambdas.append([λ, cart_pole.n_failures]) df_str_key = f'{n_runs}' all_df_vis[df_str_key] = pd.DataFrame(np.array(n_fail_lambdas), columns=['lambda', 'n_fails']) all_df_vis[df_str_key]['method'] = np.full(len(n_fail_lambdas), method) df_vis = pd.concat(all_df_vis.values()) return df_vis def get_puddle_world_plot_data(): n_episodes = 40 n_runs = 30 method = 'replace_reset' lambda_range = [0., 0.5, 0.8, 0.9, 0.95, 0.98, 0.99, 1.] # Optimal alpha coefficient for each lambda alpha_range = [.7, .7, .5, .5, .5, .5, .5, .3] all_df_vis = {} rand_seed = 0 np.random.seed(rand_seed) cost_per_ep = [] for n_run in tqdm(range(n_runs), desc=f'PUDDLE WORLD | method={method}'): for λ, α in zip(lambda_range, alpha_range): puddle_world = PuddleWorld(lmbda=λ, alpha=α) puddle_world.train(n_episodes=n_episodes, method=method) cost_per_ep.append([λ, np.array(puddle_world.cost_per_ep_hist).mean()]) df_str_key = f'{rand_seed}' all_df_vis[df_str_key] = pd.DataFrame(np.array(cost_per_ep), columns=['lambda', 'cost']) all_df_vis[df_str_key]['method'] = np.full(len(cost_per_ep), method) df_vis = pd.concat(all_df_vis.values()) return df_vis #----------------------------------# # 5.2. Reproducing figure 12.14 # #----------------------------------# def figure_12_14(): # Get plot data for each task df_rw = get_random_walk_plot_data() df_mc = get_mountain_car_plot_data() df_cp = get_cart_pole_plot_data() df_pw = get_puddle_world_plot_data() fig, axes = plt.subplots(2, 2, figsize=(12, 12)) # Mountain Car sns.lineplot(data=df_mc, x='lambda', y='steps', hue='method', style='method', ax=axes[0, 0], marker='o', ci=68, err_style="bars") axes[0, 0].set_xlabel('λ') axes[0, 0].set_ylabel('Steps per episode') axes[0, 0].set_title(f"MOUNTAIN CAR", fontsize=15) axes[0, 0].set_ylim(0, 500) # Random walk sns.lineplot(data=df_rw, x='lambda', y='rms', hue='method', style='method', ax=axes[0, 1], marker='o') axes[0, 1].set_xlabel('λ') axes[0, 1].set_ylabel('RMS error') axes[0, 1].set_title(f"RANDOM WALK", fontsize=15) axes[0, 1].set_ylim(0.2, 0.6) # Puddle world sns.lineplot(data=df_pw, x='lambda', y='cost', hue='method', ax=axes[1, 0], marker='o', ci=68, err_style="bars") axes[1, 0].set_xlabel('λ') axes[1, 0].set_ylabel('Cost per episode') axes[1, 0].set_title(f"PUDDLE WORLD", fontsize=15) # Cart and pole sns.lineplot(data=df_cp, x='lambda', y='n_fails', hue='method', ax=axes[1, 1], marker='o', ci=68, err_style="bars") axes[1, 1].set_xlabel('λ') axes[1, 1].set_ylabel('Failures per 100 000 steps') axes[1, 1].set_title(f"CART AND POLE", fontsize=15) plt.savefig('combined_fig_test') #plt.waitforbuttonpress() #--------------# # 5.3. Main # #--------------# if __name__ == '__main__': figure_12_14() # ~2h on colab #get_puddle_world_map() ================================================ FILE: chapter12/mountain_car.py ================================================ ####################################################################### # Copyright (C) # # 2017-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from math import floor from tqdm import tqdm ####################################################################### # Following are some utilities for tile coding from Rich. # To make each file self-contained, I copied them from # http://incompleteideas.net/tiles/tiles3.py-remove # with some naming convention changes # # Tile coding starts class IHT: "Structure to handle collisions" def __init__(self, size_val): self.size = size_val self.overfull_count = 0 self.dictionary = {} def count(self): return len(self.dictionary) def full(self): return len(self.dictionary) >= self.size def get_index(self, obj, read_only=False): d = self.dictionary if obj in d: return d[obj] elif read_only: return None size = self.size count = self.count() if count >= size: if self.overfull_count == 0: print('IHT full, starting to allow collisions') self.overfull_count += 1 return hash(obj) % self.size else: d[obj] = count return count def hash_coords(coordinates, m, read_only=False): if isinstance(m, IHT): return m.get_index(tuple(coordinates), read_only) if isinstance(m, int): return hash(tuple(coordinates)) % m if m is None: return coordinates def tiles(iht_or_size, num_tilings, floats, ints=None, read_only=False): """returns num-tilings tile indices corresponding to the floats and ints""" if ints is None: ints = [] qfloats = [floor(f * num_tilings) for f in floats] tiles = [] for tiling in range(num_tilings): tilingX2 = tiling * 2 coords = [tiling] b = tiling for q in qfloats: coords.append((q + b) // num_tilings) b += tilingX2 coords.extend(ints) tiles.append(hash_coords(coords, iht_or_size, read_only)) return tiles # Tile coding ends ####################################################################### # all possible actions ACTION_REVERSE = -1 ACTION_ZERO = 0 ACTION_FORWARD = 1 # order is important ACTIONS = [ACTION_REVERSE, ACTION_ZERO, ACTION_FORWARD] # bound for position and velocity POSITION_MIN = -1.2 POSITION_MAX = 0.5 VELOCITY_MIN = -0.07 VELOCITY_MAX = 0.07 # discount is always 1.0 in these experiments DISCOUNT = 1.0 # use optimistic initial value, so it's ok to set epsilon to 0 EPSILON = 0 # maximum steps per episode STEP_LIMIT = 5000 # take an @action at @position and @velocity # @return: new position, new velocity, reward (always -1) def step(position, velocity, action): new_velocity = velocity + 0.001 * action - 0.0025 * np.cos(3 * position) new_velocity = min(max(VELOCITY_MIN, new_velocity), VELOCITY_MAX) new_position = position + new_velocity new_position = min(max(POSITION_MIN, new_position), POSITION_MAX) reward = -1.0 if new_position == POSITION_MIN: new_velocity = 0.0 return new_position, new_velocity, reward # accumulating trace update rule # @trace: old trace (will be modified) # @activeTiles: current active tile indices # @lam: lambda # @return: new trace for convenience def accumulating_trace(trace, active_tiles, lam): trace *= lam * DISCOUNT trace[active_tiles] += 1 return trace # replacing trace update rule # @trace: old trace (will be modified) # @activeTiles: current active tile indices # @lam: lambda # @return: new trace for convenience def replacing_trace(trace, activeTiles, lam): active = np.in1d(np.arange(len(trace)), activeTiles) trace[active] = 1 trace[~active] *= lam * DISCOUNT return trace # replacing trace update rule, 'clearing' means set all tiles corresponding to non-selected actions to 0 # @trace: old trace (will be modified) # @activeTiles: current active tile indices # @lam: lambda # @clearingTiles: tiles to be cleared # @return: new trace for convenience def replacing_trace_with_clearing(trace, active_tiles, lam, clearing_tiles): active = np.in1d(np.arange(len(trace)), active_tiles) trace[~active] *= lam * DISCOUNT trace[clearing_tiles] = 0 trace[active] = 1 return trace # dutch trace update rule # @trace: old trace (will be modified) # @activeTiles: current active tile indices # @lam: lambda # @alpha: step size for all tiles # @return: new trace for convenience def dutch_trace(trace, active_tiles, lam, alpha): coef = 1 - alpha * DISCOUNT * lam * np.sum(trace[active_tiles]) trace *= DISCOUNT * lam trace[active_tiles] += coef return trace # wrapper class for Sarsa(lambda) class Sarsa: # In this example I use the tiling software instead of implementing standard tiling by myself # One important thing is that tiling is only a map from (state, action) to a series of indices # It doesn't matter whether the indices have meaning, only if this map satisfy some property # View the following webpage for more information # http://incompleteideas.net/sutton/tiles/tiles3.html # @maxSize: the maximum # of indices def __init__(self, step_size, lam, trace_update=accumulating_trace, num_of_tilings=8, max_size=2048): self.max_size = max_size self.num_of_tilings = num_of_tilings self.trace_update = trace_update self.lam = lam # divide step size equally to each tiling self.step_size = step_size / num_of_tilings self.hash_table = IHT(max_size) # weight for each tile self.weights = np.zeros(max_size) # trace for each tile self.trace = np.zeros(max_size) # position and velocity needs scaling to satisfy the tile software self.position_scale = self.num_of_tilings / (POSITION_MAX - POSITION_MIN) self.velocity_scale = self.num_of_tilings / (VELOCITY_MAX - VELOCITY_MIN) # get indices of active tiles for given state and action def get_active_tiles(self, position, velocity, action): # I think positionScale * (position - position_min) would be a good normalization. # However positionScale * position_min is a constant, so it's ok to ignore it. active_tiles = tiles(self.hash_table, self.num_of_tilings, [self.position_scale * position, self.velocity_scale * velocity], [action]) return active_tiles # estimate the value of given state and action def value(self, position, velocity, action): if position == POSITION_MAX: return 0.0 active_tiles = self.get_active_tiles(position, velocity, action) return np.sum(self.weights[active_tiles]) # learn with given state, action and target def learn(self, position, velocity, action, target): active_tiles = self.get_active_tiles(position, velocity, action) estimation = np.sum(self.weights[active_tiles]) delta = target - estimation if self.trace_update == accumulating_trace or self.trace_update == replacing_trace: self.trace_update(self.trace, active_tiles, self.lam) elif self.trace_update == dutch_trace: self.trace_update(self.trace, active_tiles, self.lam, self.step_size) elif self.trace_update == replacing_trace_with_clearing: clearing_tiles = [] for act in ACTIONS: if act != action: clearing_tiles.extend(self.get_active_tiles(position, velocity, act)) self.trace_update(self.trace, active_tiles, self.lam, clearing_tiles) else: raise Exception('Unexpected Trace Type') self.weights += self.step_size * delta * self.trace # get # of steps to reach the goal under current state value function def cost_to_go(self, position, velocity): costs = [] for action in ACTIONS: costs.append(self.value(position, velocity, action)) return -np.max(costs) # get action at @position and @velocity based on epsilon greedy policy and @valueFunction def get_action(position, velocity, valueFunction): if np.random.binomial(1, EPSILON) == 1: return np.random.choice(ACTIONS) values = [] for action in ACTIONS: values.append(valueFunction.value(position, velocity, action)) return np.argmax(values) - 1 # play Mountain Car for one episode based on given method @evaluator # @return: total steps in this episode def play(evaluator): position = np.random.uniform(-0.6, -0.4) velocity = 0.0 action = get_action(position, velocity, evaluator) steps = 0 while True: next_position, next_velocity, reward = step(position, velocity, action) next_action = get_action(next_position, next_velocity, evaluator) steps += 1 target = reward + DISCOUNT * evaluator.value(next_position, next_velocity, next_action) evaluator.learn(position, velocity, action, target) position = next_position velocity = next_velocity action = next_action if next_position == POSITION_MAX: break if steps >= STEP_LIMIT: print('Step Limit Exceeded!') break return steps # figure 12.10, effect of the lambda and alpha on early performance of Sarsa(lambda) def figure_12_10(): runs = 30 episodes = 50 alphas = np.arange(1, 8) / 4.0 lams = [0.99, 0.95, 0.5, 0] steps = np.zeros((len(lams), len(alphas), runs, episodes)) for lamInd, lam in enumerate(lams): for alphaInd, alpha in enumerate(alphas): for run in tqdm(range(runs)): evaluator = Sarsa(alpha, lam, replacing_trace) for ep in range(episodes): step = play(evaluator) steps[lamInd, alphaInd, run, ep] = step # average over episodes steps = np.mean(steps, axis=3) # average over runs steps = np.mean(steps, axis=2) for lamInd, lam in enumerate(lams): plt.plot(alphas, steps[lamInd, :], label='lambda = %s' % (str(lam))) plt.xlabel('alpha * # of tilings (8)') plt.ylabel('averaged steps per episode') plt.ylim([180, 300]) plt.legend() plt.savefig('../images/figure_12_10.png') plt.close() # figure 12.11, summary comparision of Sarsa(lambda) algorithms # I use 8 tilings rather than 10 tilings def figure_12_11(): traceTypes = [dutch_trace, replacing_trace, replacing_trace_with_clearing, accumulating_trace] alphas = np.arange(0.2, 2.2, 0.2) episodes = 20 runs = 30 lam = 0.9 rewards = np.zeros((len(traceTypes), len(alphas), runs, episodes)) for traceInd, trace in enumerate(traceTypes): for alphaInd, alpha in enumerate(alphas): for run in tqdm(range(runs)): evaluator = Sarsa(alpha, lam, trace) for ep in range(episodes): if trace == accumulating_trace and alpha > 0.6: steps = STEP_LIMIT else: steps = play(evaluator) rewards[traceInd, alphaInd, run, ep] = -steps # average over episodes rewards = np.mean(rewards, axis=3) # average over runs rewards = np.mean(rewards, axis=2) for traceInd, trace in enumerate(traceTypes): plt.plot(alphas, rewards[traceInd, :], label=trace.__name__) plt.xlabel('alpha * # of tilings (8)') plt.ylabel('averaged rewards pre episode') plt.ylim([-550, -150]) plt.legend() plt.savefig('../images/figure_12_11.png') plt.close() if __name__ == '__main__': figure_12_10() figure_12_11() ================================================ FILE: chapter12/random_walk.py ================================================ ####################################################################### # Copyright (C) # # 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # 2016 Kenta Shimada(hyperkentakun@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from tqdm import tqdm # all states N_STATES = 19 # all states but terminal states STATES = np.arange(1, N_STATES + 1) # start from the middle state START_STATE = 10 # two terminal states # an action leading to the left terminal state has reward -1 # an action leading to the right terminal state has reward 1 END_STATES = [0, N_STATES + 1] # true state values from Bellman equation TRUE_VALUE = np.arange(-20, 22, 2) / 20.0 TRUE_VALUE[0] = TRUE_VALUE[N_STATES + 1] = 0.0 # base class for lambda-based algorithms in this chapter # In this example, we use the simplest linear feature function, state aggregation. # And we use exact 19 groups, so the weights for each group is exact the value for that state class ValueFunction: # @rate: lambda, as it's a keyword in python, so I call it rate # @stepSize: alpha, step size for update def __init__(self, rate, step_size): self.rate = rate self.step_size = step_size self.weights = np.zeros(N_STATES + 2) # the state value is just the weight def value(self, state): return self.weights[state] # feed the algorithm with new observation # derived class should override this function def learn(self, state, reward): return # initialize some variables at the beginning of each episode # must be called at the very beginning of each episode # derived class should override this function def new_episode(self): return # Off-line lambda-return algorithm class OffLineLambdaReturn(ValueFunction): def __init__(self, rate, step_size): ValueFunction.__init__(self, rate, step_size) # To accelerate learning, set a truncate value for power of lambda self.rate_truncate = 1e-3 def new_episode(self): # initialize the trajectory self.trajectory = [START_STATE] # only need to track the last reward in one episode, as all others are 0 self.reward = 0.0 def learn(self, state, reward): # add the new state to the trajectory self.trajectory.append(state) if state in END_STATES: # start off-line learning once the episode ends self.reward = reward self.T = len(self.trajectory) - 1 self.off_line_learn() # get the n-step return from the given time def n_step_return_from_time(self, n, time): # gamma is always 1 and rewards are zero except for the last reward # the formula can be simplified end_time = min(time + n, self.T) returns = self.value(self.trajectory[end_time]) if end_time == self.T: returns += self.reward return returns # get the lambda-return from the given time def lambda_return_from_time(self, time): returns = 0.0 lambda_power = 1 for n in range(1, self.T - time): returns += lambda_power * self.n_step_return_from_time(n, time) lambda_power *= self.rate if lambda_power < self.rate_truncate: # If the power of lambda has been too small, discard all the following sequences break returns *= 1 - self.rate if lambda_power >= self.rate_truncate: returns += lambda_power * self.reward return returns # perform off-line learning at the end of an episode def off_line_learn(self): for time in range(self.T): # update for each state in the trajectory state = self.trajectory[time] delta = self.lambda_return_from_time(time) - self.value(state) delta *= self.step_size self.weights[state] += delta # TD(lambda) algorithm class TemporalDifferenceLambda(ValueFunction): def __init__(self, rate, step_size): ValueFunction.__init__(self, rate, step_size) self.new_episode() def new_episode(self): # initialize the eligibility trace self.eligibility = np.zeros(N_STATES + 2) # initialize the beginning state self.last_state = START_STATE def learn(self, state, reward): # update the eligibility trace and weights self.eligibility *= self.rate self.eligibility[self.last_state] += 1 delta = reward + self.value(state) - self.value(self.last_state) delta *= self.step_size self.weights += delta * self.eligibility self.last_state = state # True online TD(lambda) algorithm class TrueOnlineTemporalDifferenceLambda(ValueFunction): def __init__(self, rate, step_size): ValueFunction.__init__(self, rate, step_size) def new_episode(self): # initialize the eligibility trace self.eligibility = np.zeros(N_STATES + 2) # initialize the beginning state self.last_state = START_STATE # initialize the old state value self.old_state_value = 0.0 def learn(self, state, reward): # update the eligibility trace and weights last_state_value = self.value(self.last_state) state_value = self.value(state) dutch = 1 - self.step_size * self.rate * self.eligibility[self.last_state] self.eligibility *= self.rate self.eligibility[self.last_state] += dutch delta = reward + state_value - last_state_value self.weights += self.step_size * (delta + last_state_value - self.old_state_value) * self.eligibility self.weights[self.last_state] -= self.step_size * (last_state_value - self.old_state_value) self.old_state_value = state_value self.last_state = state # 19-state random walk def random_walk(value_function): value_function.new_episode() state = START_STATE while state not in END_STATES: next_state = state + np.random.choice([-1, 1]) if next_state == 0: reward = -1 elif next_state == N_STATES + 1: reward = 1 else: reward = 0 value_function.learn(next_state, reward) state = next_state # general plot framework # @valueFunctionGenerator: generate an instance of value function # @runs: specify the number of independent runs # @lambdas: a series of different lambda values # @alphas: sequences of step size for each lambda def parameter_sweep(value_function_generator, runs, lambdas, alphas): # play for 10 episodes for each run episodes = 10 # track the rms errors errors = [np.zeros(len(alphas_)) for alphas_ in alphas] for run in tqdm(range(runs)): for lambdaIndex, rate in enumerate(lambdas): for alphaIndex, alpha in enumerate(alphas[lambdaIndex]): valueFunction = value_function_generator(rate, alpha) for episode in range(episodes): random_walk(valueFunction) stateValues = [valueFunction.value(state) for state in STATES] errors[lambdaIndex][alphaIndex] += np.sqrt(np.mean(np.power(stateValues - TRUE_VALUE[1: -1], 2))) # average over runs and episodes for error in errors: error /= episodes * runs for i in range(len(lambdas)): plt.plot(alphas[i], errors[i], label='lambda = ' + str(lambdas[i])) plt.xlabel('alpha') plt.ylabel('RMS error') plt.legend() # Figure 12.3: Off-line lambda-return algorithm def figure_12_3(): lambdas = [0.0, 0.4, 0.8, 0.9, 0.95, 0.975, 0.99, 1] alphas = [np.arange(0, 1.1, 0.1), np.arange(0, 1.1, 0.1), np.arange(0, 1.1, 0.1), np.arange(0, 1.1, 0.1), np.arange(0, 1.1, 0.1), np.arange(0, 0.55, 0.05), np.arange(0, 0.22, 0.02), np.arange(0, 0.11, 0.01)] parameter_sweep(OffLineLambdaReturn, 50, lambdas, alphas) plt.savefig('../images/figure_12_3.png') plt.close() # Figure 12.6: TD(lambda) algorithm def figure_12_6(): lambdas = [0.0, 0.4, 0.8, 0.9, 0.95, 0.975, 0.99, 1] alphas = [np.arange(0, 1.1, 0.1), np.arange(0, 1.1, 0.1), np.arange(0, 0.99, 0.09), np.arange(0, 0.55, 0.05), np.arange(0, 0.33, 0.03), np.arange(0, 0.22, 0.02), np.arange(0, 0.11, 0.01), np.arange(0, 0.044, 0.004)] parameter_sweep(TemporalDifferenceLambda, 50, lambdas, alphas) plt.savefig('../images/figure_12_6.png') plt.close() # Figure 12.7: True online TD(lambda) algorithm def figure_12_8(): lambdas = [0.0, 0.4, 0.8, 0.9, 0.95, 0.975, 0.99, 1] alphas = [np.arange(0, 1.1, 0.1), np.arange(0, 1.1, 0.1), np.arange(0, 1.1, 0.1), np.arange(0, 1.1, 0.1), np.arange(0, 1.1, 0.1), np.arange(0, 0.88, 0.08), np.arange(0, 0.44, 0.04), np.arange(0, 0.11, 0.01)] parameter_sweep(TrueOnlineTemporalDifferenceLambda, 50, lambdas, alphas) plt.savefig('../images/figure_12_8.png') plt.close() if __name__ == '__main__': figure_12_3() figure_12_6() figure_12_8() ================================================ FILE: chapter13/short_corridor.py ================================================ ####################################################################### # Copyright (C) # # 2018 Sergii Bondariev (sergeybondarev@gmail.com) # # 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) # # Permission given to modify the code as long as you keep this # # declaration at the top # ####################################################################### import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from tqdm import tqdm def true_value(p): """ True value of the first state Args: p (float): probability of the action 'right'. Returns: True value of the first state. The expression is obtained by manually solving the easy linear system of Bellman equations using known dynamics. """ return (2 * p - 4) / (p * (1 - p)) class ShortCorridor: """ Short corridor environment, see Example 13.1 """ def __init__(self): self.reset() def reset(self): self.state = 0 def step(self, go_right): """ Args: go_right (bool): chosen action Returns: tuple of (reward, episode terminated?) """ if self.state == 0 or self.state == 2: if go_right: self.state += 1 else: self.state = max(0, self.state - 1) else: if go_right: self.state -= 1 else: self.state += 1 if self.state == 3: # terminal state return 0, True else: return -1, False def softmax(x): t = np.exp(x - np.max(x)) return t / np.sum(t) class ReinforceAgent: """ ReinforceAgent that follows algorithm 'REINFORNCE Monte-Carlo Policy-Gradient Control (episodic)' """ def __init__(self, alpha, gamma): # set values such that initial conditions correspond to left-epsilon greedy self.theta = np.array([-1.47, 1.47]) self.alpha = alpha self.gamma = gamma # first column - left, second - right self.x = np.array([[0, 1], [1, 0]]) self.rewards = [] self.actions = [] def get_pi(self): h = np.dot(self.theta, self.x) t = np.exp(h - np.max(h)) pmf = t / np.sum(t) # never become deterministic, # guarantees episode finish imin = np.argmin(pmf) epsilon = 0.05 if pmf[imin] < epsilon: pmf[:] = 1 - epsilon pmf[imin] = epsilon return pmf def get_p_right(self): return self.get_pi()[1] def choose_action(self, reward): if reward is not None: self.rewards.append(reward) pmf = self.get_pi() go_right = np.random.uniform() <= pmf[1] self.actions.append(go_right) return go_right def episode_end(self, last_reward): self.rewards.append(last_reward) # learn theta G = np.zeros(len(self.rewards)) G[-1] = self.rewards[-1] for i in range(2, len(G) + 1): G[-i] = self.gamma * G[-i + 1] + self.rewards[-i] gamma_pow = 1 for i in range(len(G)): j = 1 if self.actions[i] else 0 pmf = self.get_pi() grad_ln_pi = self.x[:, j] - np.dot(self.x, pmf) update = self.alpha * gamma_pow * G[i] * grad_ln_pi self.theta += update gamma_pow *= self.gamma self.rewards = [] self.actions = [] class ReinforceBaselineAgent(ReinforceAgent): def __init__(self, alpha, gamma, alpha_w): super(ReinforceBaselineAgent, self).__init__(alpha, gamma) self.alpha_w = alpha_w self.w = 0 def episode_end(self, last_reward): self.rewards.append(last_reward) # learn theta G = np.zeros(len(self.rewards)) G[-1] = self.rewards[-1] for i in range(2, len(G) + 1): G[-i] = self.gamma * G[-i + 1] + self.rewards[-i] gamma_pow = 1 for i in range(len(G)): self.w += self.alpha_w * gamma_pow * (G[i] - self.w) j = 1 if self.actions[i] else 0 pmf = self.get_pi() grad_ln_pi = self.x[:, j] - np.dot(self.x, pmf) update = self.alpha * gamma_pow * (G[i] - self.w) * grad_ln_pi self.theta += update gamma_pow *= self.gamma self.rewards = [] self.actions = [] def trial(num_episodes, agent_generator): env = ShortCorridor() agent = agent_generator() rewards = np.zeros(num_episodes) for episode_idx in range(num_episodes): rewards_sum = 0 reward = None env.reset() while True: go_right = agent.choose_action(reward) reward, episode_end = env.step(go_right) rewards_sum += reward if episode_end: agent.episode_end(reward) break rewards[episode_idx] = rewards_sum return rewards def example_13_1(): epsilon = 0.05 fig, ax = plt.subplots(1, 1) # Plot a graph p = np.linspace(0.01, 0.99, 100) y = true_value(p) ax.plot(p, y, color='red') # Find a maximum point, can also be done analytically by taking a derivative imax = np.argmax(y) pmax = p[imax] ymax = y[imax] ax.plot(pmax, ymax, color='green', marker="*", label="optimal point: f({0:.2f}) = {1:.2f}".format(pmax, ymax)) # Plot points of two epsilon-greedy policies ax.plot(epsilon, true_value(epsilon), color='magenta', marker="o", label="epsilon-greedy left") ax.plot(1 - epsilon, true_value(1 - epsilon), color='blue', marker="o", label="epsilon-greedy right") ax.set_ylabel("Value of the first state") ax.set_xlabel("Probability of the action 'right'") ax.set_title("Short corridor with switched actions") ax.set_ylim(ymin=-105.0, ymax=5) ax.legend() plt.savefig('../images/example_13_1.png') plt.close() def figure_13_1(): num_trials = 100 num_episodes = 1000 gamma = 1 agent_generators = [lambda : ReinforceAgent(alpha=2e-4, gamma=gamma), lambda : ReinforceAgent(alpha=2e-5, gamma=gamma), lambda : ReinforceAgent(alpha=2e-3, gamma=gamma)] labels = ['alpha = 2e-4', 'alpha = 2e-5', 'alpha = 2e-3'] rewards = np.zeros((len(agent_generators), num_trials, num_episodes)) for agent_index, agent_generator in enumerate(agent_generators): for i in tqdm(range(num_trials)): reward = trial(num_episodes, agent_generator) rewards[agent_index, i, :] = reward plt.plot(np.arange(num_episodes) + 1, -11.6 * np.ones(num_episodes), ls='dashed', color='red', label='-11.6') for i, label in enumerate(labels): plt.plot(np.arange(num_episodes) + 1, rewards[i].mean(axis=0), label=label) plt.ylabel('total reward on episode') plt.xlabel('episode') plt.legend(loc='lower right') plt.savefig('../images/figure_13_1.png') plt.close() def figure_13_2(): num_trials = 100 num_episodes = 1000 alpha = 2e-4 gamma = 1 agent_generators = [lambda : ReinforceAgent(alpha=alpha, gamma=gamma), lambda : ReinforceBaselineAgent(alpha=alpha*10, gamma=gamma, alpha_w=alpha*100)] labels = ['Reinforce without baseline', 'Reinforce with baseline'] rewards = np.zeros((len(agent_generators), num_trials, num_episodes)) for agent_index, agent_generator in enumerate(agent_generators): for i in tqdm(range(num_trials)): reward = trial(num_episodes, agent_generator) rewards[agent_index, i, :] = reward plt.plot(np.arange(num_episodes) + 1, -11.6 * np.ones(num_episodes), ls='dashed', color='red', label='-11.6') for i, label in enumerate(labels): plt.plot(np.arange(num_episodes) + 1, rewards[i].mean(axis=0), label=label) plt.ylabel('total reward on episode') plt.xlabel('episode') plt.legend(loc='lower right') plt.savefig('../images/figure_13_2.png') plt.close() if __name__ == '__main__': example_13_1() figure_13_1() figure_13_2() ================================================ FILE: requirements.txt ================================================ numpy matplotlib seaborn tqdm scipy