[
  {
    "path": ".gitignore",
    "content": ".idea\n*.pyc\nlatex\n*.bin\nextra\n.DS_Store\n.vscode/\n"
  },
  {
    "path": ".travis.yml",
    "content": "language: python\npython:\n  - \"3.6\"\ninstall:\n  - pip install -r requirements.txt\nscript:\n  - ls chapter*/*.py | xargs -n 1 -P 1 python -m py_compile\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2019 Shangtong Zhang\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# Reinforcement Learning: An Introduction\n\n[![Build Status](https://travis-ci.org/ShangtongZhang/reinforcement-learning-an-introduction.svg?branch=master)](https://travis-ci.org/ShangtongZhang/reinforcement-learning-an-introduction)\n\nPython replication for Sutton & Barto's book [*Reinforcement Learning: An Introduction (2nd Edition)*](http://incompleteideas.net/book/the-book-2nd.html)\n\n> If you have any confusion about the code or want to report a bug, please open an issue instead of emailing me directly, and unfortunately I do not have exercise answers for the book.\n\n# Contents \n\n### Chapter 1\n1. Tic-Tac-Toe\n\n### Chapter 2\n1. [Figure 2.1: An exemplary bandit problem from the 10-armed testbed](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_1.png)\n2. [Figure 2.2: Average performance of epsilon-greedy action-value methods on the 10-armed testbed](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_2.png)\n3. [Figure 2.3: Optimistic initial action-value estimates](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_3.png)\n4. [Figure 2.4: Average performance of UCB action selection on the 10-armed testbed](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_4.png)\n5. [Figure 2.5: Average performance of the gradient bandit algorithm](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_5.png)\n6. [Figure 2.6: A parameter study of the various bandit algorithms](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_2_6.png)\n\n### Chapter 3\n1. [Figure 3.2: Grid example with random policy](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_3_2.png)\n2. [Figure 3.5: Optimal solutions to the gridworld example](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_3_5.png)\n\n### Chapter 4\n1. [Figure 4.1: Convergence of iterative policy evaluation on a small gridworld](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_4_1.png)\n2. [Figure 4.2: Jack’s car rental problem](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_4_2.png)\n3. [Figure 4.3: The solution to the gambler’s problem](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_4_3.png)\n\n### Chapter 5\n1. [Figure 5.1: Approximate state-value functions for the blackjack policy](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_5_1.png)\n2. [Figure 5.2: The optimal policy and state-value function for blackjack found by Monte Carlo ES](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_5_2.png)\n3. [Figure 5.3: Weighted importance sampling](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_5_3.png)\n4. [Figure 5.4: Ordinary importance sampling with surprisingly unstable estimates](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_5_4.png)\n\n### Chapter 6\n1. [Example 6.2: Random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/example_6_2.png)\n2. [Figure 6.2: Batch updating](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_2.png)\n3. [Figure 6.3: Sarsa applied to windy grid world](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_3.png)\n4. [Figure 6.4: The cliff-walking task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_4.png)\n5. [Figure 6.6: Interim and asymptotic performance of TD control methods](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_6.png)\n6. [Figure 6.7: Comparison of Q-learning and Double Q-learning](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_6_7.png)\n\n### Chapter 7\n1. [Figure 7.2: Performance of n-step TD methods on 19-state random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_7_2.png)\n\n### Chapter 8\n1. [Figure 8.2: Average learning curves for Dyna-Q agents varying in their number of planning steps](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_2.png)\n2. [Figure 8.4: Average performance of Dyna agents on a blocking task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_4.png)\n3. [Figure 8.5: Average performance of Dyna agents on a shortcut task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_5.png)\n4. [Example 8.4: Prioritized sweeping significantly shortens learning time on the Dyna maze task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/example_8_4.png)\n5. [Figure 8.7: Comparison of efficiency of expected and sample updates](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_7.png)\n6. [Figure 8.8: Relative efficiency of different update distributions](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_8_8.png)\n\n### Chapter 9\n1. [Figure 9.1: Gradient Monte Carlo algorithm on the 1000-state random walk task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_1.png)\n2. [Figure 9.2: Semi-gradient n-steps TD algorithm on the 1000-state random walk task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_2.png)\n3. [Figure 9.5: Fourier basis vs polynomials on the 1000-state random walk task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_5.png)\n4. [Figure 9.8: Example of feature width’s effect on initial generalization and asymptotic accuracy](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_8.png)\n5. [Figure 9.10: Single tiling and multiple tilings on the 1000-state random walk task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_9_10.png)\n\n### Chapter 10\n1. [Figure 10.1: The cost-to-go function for Mountain Car task in one run](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_1.png)\n2. [Figure 10.2: Learning curves for semi-gradient Sarsa on Mountain Car task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_2.png)\n3. [Figure 10.3: One-step vs multi-step performance of semi-gradient Sarsa on the Mountain Car task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_3.png)\n4. [Figure 10.4: Effect of the alpha and n on early performance of n-step semi-gradient Sarsa](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_4.png)\n5. [Figure 10.5: Differential semi-gradient Sarsa on the access-control queuing task](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_10_5.png)\n\n### Chapter 11\n1. [Figure 11.2: Baird's Counterexample](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_11_2.png)\n2. [Figure 11.6: The behavior of the TDC algorithm on Baird’s counterexample](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_11_6.png)\n3. [Figure 11.7: The behavior of the ETD algorithm in expectation on Baird’s counterexample](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_11_7.png)\n\n### Chapter 12\n1. [Figure 12.3: Off-line λ-return algorithm on 19-state random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_3.png)\n2. [Figure 12.6: TD(λ) algorithm on 19-state random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_6.png)\n3. [Figure 12.8: True online TD(λ) algorithm on 19-state random walk](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_8.png)\n4. [Figure 12.10: Sarsa(λ) with replacing traces on Mountain Car](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_10.png)\n5. [Figure 12.11: Summary comparison of Sarsa(λ) algorithms on Mountain Car](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_12_11.png)\n\n### Chapter 13\n1. [Example 13.1: Short corridor with switched actions](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/example_13_1.png)\n2. [Figure 13.1: REINFORCE on the short-corridor grid world](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_13_1.png)\n3. [Figure 13.2: REINFORCE with baseline on the short-corridor grid-world](https://raw.githubusercontent.com/ShangtongZhang/reinforcement-learning-an-introduction/master/images/figure_13_2.png)\n\n\n# Environment\n* python 3.6 \n* numpy\n* matplotlib\n* [seaborn](https://seaborn.pydata.org/index.html)\n* [tqdm](https://pypi.org/project/tqdm/)\n\n# Usage\n> All files are self-contained\n```commandline\npython any_file_you_want.py\n```\n\n# Contribution\nIf you want to contribute some missing examples or fix some bugs, feel free to open an issue or make a pull request. \n"
  },
  {
    "path": "chapter01/tic_tac_toe.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016 - 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)           #\n# 2016 Jan Hakenberg(jan.hakenberg@gmail.com)                         #\n# 2016 Tian Jun(tianjun.cpp@gmail.com)                                #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport pickle\n\nBOARD_ROWS = 3\nBOARD_COLS = 3\nBOARD_SIZE = BOARD_ROWS * BOARD_COLS\n\n\nclass State:\n    def __init__(self):\n        # the board is represented by an n * n array,\n        # 1 represents a chessman of the player who moves first,\n        # -1 represents a chessman of another player\n        # 0 represents an empty position\n        self.data = np.zeros((BOARD_ROWS, BOARD_COLS))\n        self.winner = None\n        self.hash_val = None\n        self.end = None\n\n    # compute the hash value for one state, it's unique\n    def hash(self):\n        if self.hash_val is None:\n            self.hash_val = 0\n            for i in np.nditer(self.data):\n                self.hash_val = self.hash_val * 3 + i + 1\n        return self.hash_val\n\n    # check whether a player has won the game, or it's a tie\n    def is_end(self):\n        if self.end is not None:\n            return self.end\n        results = []\n        # check row\n        for i in range(BOARD_ROWS):\n            results.append(np.sum(self.data[i, :]))\n        # check columns\n        for i in range(BOARD_COLS):\n            results.append(np.sum(self.data[:, i]))\n\n        # check diagonals\n        trace = 0\n        reverse_trace = 0\n        for i in range(BOARD_ROWS):\n            trace += self.data[i, i]\n            reverse_trace += self.data[i, BOARD_ROWS - 1 - i]\n        results.append(trace)\n        results.append(reverse_trace)\n\n        for result in results:\n            if result == 3:\n                self.winner = 1\n                self.end = True\n                return self.end\n            if result == -3:\n                self.winner = -1\n                self.end = True\n                return self.end\n\n        # whether it's a tie\n        sum_values = np.sum(np.abs(self.data))\n        if sum_values == BOARD_SIZE:\n            self.winner = 0\n            self.end = True\n            return self.end\n\n        # game is still going on\n        self.end = False\n        return self.end\n\n    # @symbol: 1 or -1\n    # put chessman symbol in position (i, j)\n    def next_state(self, i, j, symbol):\n        new_state = State()\n        new_state.data = np.copy(self.data)\n        new_state.data[i, j] = symbol\n        return new_state\n\n    # print the board\n    def print_state(self):\n        for i in range(BOARD_ROWS):\n            print('-------------')\n            out = '| '\n            for j in range(BOARD_COLS):\n                if self.data[i, j] == 1:\n                    token = '*'\n                elif self.data[i, j] == -1:\n                    token = 'x'\n                else:\n                    token = '0'\n                out += token + ' | '\n            print(out)\n        print('-------------')\n\n\ndef get_all_states_impl(current_state, current_symbol, all_states):\n    for i in range(BOARD_ROWS):\n        for j in range(BOARD_COLS):\n            if current_state.data[i][j] == 0:\n                new_state = current_state.next_state(i, j, current_symbol)\n                new_hash = new_state.hash()\n                if new_hash not in all_states:\n                    is_end = new_state.is_end()\n                    all_states[new_hash] = (new_state, is_end)\n                    if not is_end:\n                        get_all_states_impl(new_state, -current_symbol, all_states)\n\n\ndef get_all_states():\n    current_symbol = 1\n    current_state = State()\n    all_states = dict()\n    all_states[current_state.hash()] = (current_state, current_state.is_end())\n    get_all_states_impl(current_state, current_symbol, all_states)\n    return all_states\n\n\n# all possible board configurations\nall_states = get_all_states()\n\n\nclass Judger:\n    # @player1: the player who will move first, its chessman will be 1\n    # @player2: another player with a chessman -1\n    def __init__(self, player1, player2):\n        self.p1 = player1\n        self.p2 = player2\n        self.current_player = None\n        self.p1_symbol = 1\n        self.p2_symbol = -1\n        self.p1.set_symbol(self.p1_symbol)\n        self.p2.set_symbol(self.p2_symbol)\n        self.current_state = State()\n\n    def reset(self):\n        self.p1.reset()\n        self.p2.reset()\n\n    def alternate(self):\n        while True:\n            yield self.p1\n            yield self.p2\n\n    # @print_state: if True, print each board during the game\n    def play(self, print_state=False):\n        alternator = self.alternate()\n        self.reset()\n        current_state = State()\n        self.p1.set_state(current_state)\n        self.p2.set_state(current_state)\n        if print_state:\n            current_state.print_state()\n        while True:\n            player = next(alternator)\n            i, j, symbol = player.act()\n            next_state_hash = current_state.next_state(i, j, symbol).hash()\n            current_state, is_end = all_states[next_state_hash]\n            self.p1.set_state(current_state)\n            self.p2.set_state(current_state)\n            if print_state:\n                current_state.print_state()\n            if is_end:\n                return current_state.winner\n\n\n# AI player\nclass Player:\n    # @step_size: the step size to update estimations\n    # @epsilon: the probability to explore\n    def __init__(self, step_size=0.1, epsilon=0.1):\n        self.estimations = dict()\n        self.step_size = step_size\n        self.epsilon = epsilon\n        self.states = []\n        self.greedy = []\n        self.symbol = 0\n\n    def reset(self):\n        self.states = []\n        self.greedy = []\n\n    def set_state(self, state):\n        self.states.append(state)\n        self.greedy.append(True)\n\n    def set_symbol(self, symbol):\n        self.symbol = symbol\n        for hash_val in all_states:\n            state, is_end = all_states[hash_val]\n            if is_end:\n                if state.winner == self.symbol:\n                    self.estimations[hash_val] = 1.0\n                elif state.winner == 0:\n                    # we need to distinguish between a tie and a lose\n                    self.estimations[hash_val] = 0.5\n                else:\n                    self.estimations[hash_val] = 0\n            else:\n                self.estimations[hash_val] = 0.5\n\n    # update value estimation\n    def backup(self):\n        states = [state.hash() for state in self.states]\n\n        for i in reversed(range(len(states) - 1)):\n            state = states[i]\n            td_error = self.greedy[i] * (\n                self.estimations[states[i + 1]] - self.estimations[state]\n            )\n            self.estimations[state] += self.step_size * td_error\n\n    # choose an action based on the state\n    def act(self):\n        state = self.states[-1]\n        next_states = []\n        next_positions = []\n        for i in range(BOARD_ROWS):\n            for j in range(BOARD_COLS):\n                if state.data[i, j] == 0:\n                    next_positions.append([i, j])\n                    next_states.append(state.next_state(\n                        i, j, self.symbol).hash())\n\n        if np.random.rand() < self.epsilon:\n            action = next_positions[np.random.randint(len(next_positions))]\n            action.append(self.symbol)\n            self.greedy[-1] = False\n            return action\n\n        values = []\n        for hash_val, pos in zip(next_states, next_positions):\n            values.append((self.estimations[hash_val], pos))\n        # to select one of the actions of equal value at random due to Python's sort is stable\n        np.random.shuffle(values)\n        values.sort(key=lambda x: x[0], reverse=True)\n        action = values[0][1]\n        action.append(self.symbol)\n        return action\n\n    def save_policy(self):\n        with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'wb') as f:\n            pickle.dump(self.estimations, f)\n\n    def load_policy(self):\n        with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'rb') as f:\n            self.estimations = pickle.load(f)\n\n\n# human interface\n# input a number to put a chessman\n# | q | w | e |\n# | a | s | d |\n# | z | x | c |\nclass HumanPlayer:\n    def __init__(self, **kwargs):\n        self.symbol = None\n        self.keys = ['q', 'w', 'e', 'a', 's', 'd', 'z', 'x', 'c']\n        self.state = None\n\n    def reset(self):\n        pass\n\n    def set_state(self, state):\n        self.state = state\n\n    def set_symbol(self, symbol):\n        self.symbol = symbol\n\n    def act(self):\n        self.state.print_state()\n        key = input(\"Input your position:\")\n        data = self.keys.index(key)\n        i = data // BOARD_COLS\n        j = data % BOARD_COLS\n        return i, j, self.symbol\n\n\ndef train(epochs, print_every_n=500):\n    player1 = Player(epsilon=0.01)\n    player2 = Player(epsilon=0.01)\n    judger = Judger(player1, player2)\n    player1_win = 0.0\n    player2_win = 0.0\n    for i in range(1, epochs + 1):\n        winner = judger.play(print_state=False)\n        if winner == 1:\n            player1_win += 1\n        if winner == -1:\n            player2_win += 1\n        if i % print_every_n == 0:\n            print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))\n        player1.backup()\n        player2.backup()\n        judger.reset()\n    player1.save_policy()\n    player2.save_policy()\n\n\ndef compete(turns):\n    player1 = Player(epsilon=0)\n    player2 = Player(epsilon=0)\n    judger = Judger(player1, player2)\n    player1.load_policy()\n    player2.load_policy()\n    player1_win = 0.0\n    player2_win = 0.0\n    for _ in range(turns):\n        winner = judger.play()\n        if winner == 1:\n            player1_win += 1\n        if winner == -1:\n            player2_win += 1\n        judger.reset()\n    print('%d turns, player 1 win %.02f, player 2 win %.02f' % (turns, player1_win / turns, player2_win / turns))\n\n\n# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.\n# So we test whether the AI can guarantee at least a tie if it goes second.\ndef play():\n    while True:\n        player1 = HumanPlayer()\n        player2 = Player(epsilon=0)\n        judger = Judger(player1, player2)\n        player2.load_policy()\n        winner = judger.play()\n        if winner == player2.symbol:\n            print(\"You lose!\")\n        elif winner == player1.symbol:\n            print(\"You win!\")\n        else:\n            print(\"It is a tie!\")\n\n\nif __name__ == '__main__':\n    train(int(1e5))\n    compete(int(1e3))\n    play()\n"
  },
  {
    "path": "chapter02/ten_armed_testbed.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Tian Jun(tianjun.cpp@gmail.com)                                #\n# 2016 Artem Oboturov(oboturov@gmail.com)                             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport matplotlib\nimport matplotlib.pyplot as plt\nimport numpy as np\nfrom tqdm import trange\n\nmatplotlib.use('Agg')\n\n\nclass Bandit:\n    # @k_arm: # of arms\n    # @epsilon: probability for exploration in epsilon-greedy algorithm\n    # @initial: initial estimation for each action\n    # @step_size: constant step size for updating estimations\n    # @sample_averages: if True, use sample averages to update estimations instead of constant step size\n    # @UCB_param: if not None, use UCB algorithm to select action\n    # @gradient: if True, use gradient based bandit algorithm\n    # @gradient_baseline: if True, use average reward as baseline for gradient based bandit algorithm\n    def __init__(self, k_arm=10, epsilon=0., initial=0., step_size=0.1, sample_averages=False, UCB_param=None,\n                 gradient=False, gradient_baseline=False, true_reward=0.):\n        self.k = k_arm\n        self.step_size = step_size\n        self.sample_averages = sample_averages\n        self.indices = np.arange(self.k)\n        self.time = 0\n        self.UCB_param = UCB_param\n        self.gradient = gradient\n        self.gradient_baseline = gradient_baseline\n        self.average_reward = 0\n        self.true_reward = true_reward\n        self.epsilon = epsilon\n        self.initial = initial\n\n    def reset(self):\n        # real reward for each action\n        self.q_true = np.random.randn(self.k) + self.true_reward\n\n        # estimation for each action\n        self.q_estimation = np.zeros(self.k) + self.initial\n\n        # # of chosen times for each action\n        self.action_count = np.zeros(self.k)\n\n        self.best_action = np.argmax(self.q_true)\n\n        self.time = 0\n\n    # get an action for this bandit\n    def act(self):\n        if np.random.rand() < self.epsilon:\n            return np.random.choice(self.indices)\n\n        if self.UCB_param is not None:\n            UCB_estimation = self.q_estimation + \\\n                self.UCB_param * np.sqrt(np.log(self.time + 1) / (self.action_count + 1e-5))\n            q_best = np.max(UCB_estimation)\n            return np.random.choice(np.where(UCB_estimation == q_best)[0])\n\n        if self.gradient:\n            exp_est = np.exp(self.q_estimation)\n            self.action_prob = exp_est / np.sum(exp_est)\n            return np.random.choice(self.indices, p=self.action_prob)\n\n        q_best = np.max(self.q_estimation)\n        return np.random.choice(np.where(self.q_estimation == q_best)[0])\n\n    # take an action, update estimation for this action\n    def step(self, action):\n        # generate the reward under N(real reward, 1)\n        reward = np.random.randn() + self.q_true[action]\n        self.time += 1\n        self.action_count[action] += 1\n        self.average_reward += (reward - self.average_reward) / self.time\n\n        if self.sample_averages:\n            # update estimation using sample averages\n            self.q_estimation[action] += (reward - self.q_estimation[action]) / self.action_count[action]\n        elif self.gradient:\n            one_hot = np.zeros(self.k)\n            one_hot[action] = 1\n            if self.gradient_baseline:\n                baseline = self.average_reward\n            else:\n                baseline = 0\n            self.q_estimation += self.step_size * (reward - baseline) * (one_hot - self.action_prob)\n        else:\n            # update estimation with constant step size\n            self.q_estimation[action] += self.step_size * (reward - self.q_estimation[action])\n        return reward\n\n\ndef simulate(runs, time, bandits):\n    rewards = np.zeros((len(bandits), runs, time))\n    best_action_counts = np.zeros(rewards.shape)\n    for i, bandit in enumerate(bandits):\n        for r in trange(runs):\n            bandit.reset()\n            for t in range(time):\n                action = bandit.act()\n                reward = bandit.step(action)\n                rewards[i, r, t] = reward\n                if action == bandit.best_action:\n                    best_action_counts[i, r, t] = 1\n    mean_best_action_counts = best_action_counts.mean(axis=1)\n    mean_rewards = rewards.mean(axis=1)\n    return mean_best_action_counts, mean_rewards\n\n\ndef figure_2_1():\n    plt.violinplot(dataset=np.random.randn(200, 10) + np.random.randn(10))\n    plt.xlabel(\"Action\")\n    plt.ylabel(\"Reward distribution\")\n    plt.savefig('../images/figure_2_1.png')\n    plt.close()\n\n\ndef figure_2_2(runs=2000, time=1000):\n    epsilons = [0, 0.1, 0.01]\n    bandits = [Bandit(epsilon=eps, sample_averages=True) for eps in epsilons]\n    best_action_counts, rewards = simulate(runs, time, bandits)\n\n    plt.figure(figsize=(10, 20))\n\n    plt.subplot(2, 1, 1)\n    for eps, rewards in zip(epsilons, rewards):\n        plt.plot(rewards, label='$\\epsilon = %.02f$' % (eps))\n    plt.xlabel('steps')\n    plt.ylabel('average reward')\n    plt.legend()\n\n    plt.subplot(2, 1, 2)\n    for eps, counts in zip(epsilons, best_action_counts):\n        plt.plot(counts, label='$\\epsilon = %.02f$' % (eps))\n    plt.xlabel('steps')\n    plt.ylabel('% optimal action')\n    plt.legend()\n\n    plt.savefig('../images/figure_2_2.png')\n    plt.close()\n\n\ndef figure_2_3(runs=2000, time=1000):\n    bandits = []\n    bandits.append(Bandit(epsilon=0, initial=5, step_size=0.1))\n    bandits.append(Bandit(epsilon=0.1, initial=0, step_size=0.1))\n    best_action_counts, _ = simulate(runs, time, bandits)\n\n    plt.plot(best_action_counts[0], label='$\\epsilon = 0, q = 5$')\n    plt.plot(best_action_counts[1], label='$\\epsilon = 0.1, q = 0$')\n    plt.xlabel('Steps')\n    plt.ylabel('% optimal action')\n    plt.legend()\n\n    plt.savefig('../images/figure_2_3.png')\n    plt.close()\n\n\ndef figure_2_4(runs=2000, time=1000):\n    bandits = []\n    bandits.append(Bandit(epsilon=0, UCB_param=2, sample_averages=True))\n    bandits.append(Bandit(epsilon=0.1, sample_averages=True))\n    _, average_rewards = simulate(runs, time, bandits)\n\n    plt.plot(average_rewards[0], label='UCB $c = 2$')\n    plt.plot(average_rewards[1], label='epsilon greedy $\\epsilon = 0.1$')\n    plt.xlabel('Steps')\n    plt.ylabel('Average reward')\n    plt.legend()\n\n    plt.savefig('../images/figure_2_4.png')\n    plt.close()\n\n\ndef figure_2_5(runs=2000, time=1000):\n    bandits = []\n    bandits.append(Bandit(gradient=True, step_size=0.1, gradient_baseline=True, true_reward=4))\n    bandits.append(Bandit(gradient=True, step_size=0.1, gradient_baseline=False, true_reward=4))\n    bandits.append(Bandit(gradient=True, step_size=0.4, gradient_baseline=True, true_reward=4))\n    bandits.append(Bandit(gradient=True, step_size=0.4, gradient_baseline=False, true_reward=4))\n    best_action_counts, _ = simulate(runs, time, bandits)\n    labels = [r'$\\alpha = 0.1$, with baseline',\n              r'$\\alpha = 0.1$, without baseline',\n              r'$\\alpha = 0.4$, with baseline',\n              r'$\\alpha = 0.4$, without baseline']\n\n    for i in range(len(bandits)):\n        plt.plot(best_action_counts[i], label=labels[i])\n    plt.xlabel('Steps')\n    plt.ylabel('% Optimal action')\n    plt.legend()\n\n    plt.savefig('../images/figure_2_5.png')\n    plt.close()\n\n\ndef figure_2_6(runs=2000, time=1000):\n    labels = ['epsilon-greedy', 'gradient bandit',\n              'UCB', 'optimistic initialization']\n    generators = [lambda epsilon: Bandit(epsilon=epsilon, sample_averages=True),\n                  lambda alpha: Bandit(gradient=True, step_size=alpha, gradient_baseline=True),\n                  lambda coef: Bandit(epsilon=0, UCB_param=coef, sample_averages=True),\n                  lambda initial: Bandit(epsilon=0, initial=initial, step_size=0.1)]\n    parameters = [np.arange(-7, -1, dtype=np.float),\n                  np.arange(-5, 2, dtype=np.float),\n                  np.arange(-4, 3, dtype=np.float),\n                  np.arange(-2, 3, dtype=np.float)]\n\n    bandits = []\n    for generator, parameter in zip(generators, parameters):\n        for param in parameter:\n            bandits.append(generator(pow(2, param)))\n\n    _, average_rewards = simulate(runs, time, bandits)\n    rewards = np.mean(average_rewards, axis=1)\n\n    i = 0\n    for label, parameter in zip(labels, parameters):\n        l = len(parameter)\n        plt.plot(parameter, rewards[i:i+l], label=label)\n        i += l\n    plt.xlabel('Parameter($2^x$)')\n    plt.ylabel('Average reward')\n    plt.legend()\n\n    plt.savefig('../images/figure_2_6.png')\n    plt.close()\n\n\nif __name__ == '__main__':\n    figure_2_1()\n    figure_2_2()\n    figure_2_3()\n    figure_2_4()\n    figure_2_5()\n    figure_2_6()\n"
  },
  {
    "path": "chapter03/grid_world.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport matplotlib\nimport matplotlib.pyplot as plt\nimport numpy as np\nfrom matplotlib.table import Table\n\nmatplotlib.use('Agg')\n\nWORLD_SIZE = 5\nA_POS = [0, 1]\nA_PRIME_POS = [4, 1]\nB_POS = [0, 3]\nB_PRIME_POS = [2, 3]\nDISCOUNT = 0.9\n\n# left, up, right, down\nACTIONS = [np.array([0, -1]),\n           np.array([-1, 0]),\n           np.array([0, 1]),\n           np.array([1, 0])]\nACTIONS_FIGS=[ '←', '↑', '→', '↓']\n\n\nACTION_PROB = 0.25\n\n\ndef step(state, action):\n    if state == A_POS:\n        return A_PRIME_POS, 10\n    if state == B_POS:\n        return B_PRIME_POS, 5\n\n    next_state = (np.array(state) + action).tolist()\n    x, y = next_state\n    if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:\n        reward = -1.0\n        next_state = state\n    else:\n        reward = 0\n    return next_state, reward\n\n\ndef draw_image(image):\n    fig, ax = plt.subplots()\n    ax.set_axis_off()\n    tb = Table(ax, bbox=[0, 0, 1, 1])\n\n    nrows, ncols = image.shape\n    width, height = 1.0 / ncols, 1.0 / nrows\n\n    # Add cells\n    for (i, j), val in np.ndenumerate(image):\n\n        # add state labels\n        if [i, j] == A_POS:\n            val = str(val) + \" (A)\"\n        if [i, j] == A_PRIME_POS:\n            val = str(val) + \" (A')\"\n        if [i, j] == B_POS:\n            val = str(val) + \" (B)\"\n        if [i, j] == B_PRIME_POS:\n            val = str(val) + \" (B')\"\n        \n        tb.add_cell(i, j, width, height, text=val,\n                    loc='center', facecolor='white')\n        \n\n    # Row and column labels...\n    for i in range(len(image)):\n        tb.add_cell(i, -1, width, height, text=i+1, loc='right',\n                    edgecolor='none', facecolor='none')\n        tb.add_cell(-1, i, width, height/2, text=i+1, loc='center',\n                    edgecolor='none', facecolor='none')\n\n    ax.add_table(tb)\n\ndef draw_policy(optimal_values):\n    fig, ax = plt.subplots()\n    ax.set_axis_off()\n    tb = Table(ax, bbox=[0, 0, 1, 1])\n\n    nrows, ncols = optimal_values.shape\n    width, height = 1.0 / ncols, 1.0 / nrows\n\n    # Add cells\n    for (i, j), val in np.ndenumerate(optimal_values):\n        next_vals=[]\n        for action in ACTIONS:\n            next_state, _ = step([i, j], action)\n            next_vals.append(optimal_values[next_state[0],next_state[1]])\n\n        best_actions=np.where(next_vals == np.max(next_vals))[0]\n        val=''\n        for ba in best_actions:\n            val+=ACTIONS_FIGS[ba]\n        \n        # add state labels\n        if [i, j] == A_POS:\n            val = str(val) + \" (A)\"\n        if [i, j] == A_PRIME_POS:\n            val = str(val) + \" (A')\"\n        if [i, j] == B_POS:\n            val = str(val) + \" (B)\"\n        if [i, j] == B_PRIME_POS:\n            val = str(val) + \" (B')\"\n        \n        tb.add_cell(i, j, width, height, text=val,\n                loc='center', facecolor='white')\n\n    # Row and column labels...\n    for i in range(len(optimal_values)):\n        tb.add_cell(i, -1, width, height, text=i+1, loc='right',\n                    edgecolor='none', facecolor='none')\n        tb.add_cell(-1, i, width, height/2, text=i+1, loc='center',\n                   edgecolor='none', facecolor='none')\n\n    ax.add_table(tb)\n\n\ndef figure_3_2():\n    value = np.zeros((WORLD_SIZE, WORLD_SIZE))\n    while True:\n        # keep iteration until convergence\n        new_value = np.zeros_like(value)\n        for i in range(WORLD_SIZE):\n            for j in range(WORLD_SIZE):\n                for action in ACTIONS:\n                    (next_i, next_j), reward = step([i, j], action)\n                    # bellman equation\n                    new_value[i, j] += ACTION_PROB * (reward + DISCOUNT * value[next_i, next_j])\n        if np.sum(np.abs(value - new_value)) < 1e-4:\n            draw_image(np.round(new_value, decimals=2))\n            plt.savefig('../images/figure_3_2.png')\n            plt.close()\n            break\n        value = new_value\n\ndef figure_3_2_linear_system():\n    '''\n    Here we solve the linear system of equations to find the exact solution.\n    We do this by filling the coefficients for each of the states with their respective right side constant.\n    '''\n    A = -1 * np.eye(WORLD_SIZE * WORLD_SIZE)\n    b = np.zeros(WORLD_SIZE * WORLD_SIZE)\n    for i in range(WORLD_SIZE):\n        for j in range(WORLD_SIZE):\n            s = [i, j]  # current state\n            index_s = np.ravel_multi_index(s, (WORLD_SIZE, WORLD_SIZE))\n            for a in ACTIONS:\n                s_, r = step(s, a)\n                index_s_ = np.ravel_multi_index(s_, (WORLD_SIZE, WORLD_SIZE))\n\n                A[index_s, index_s_] += ACTION_PROB * DISCOUNT\n                b[index_s] -= ACTION_PROB * r\n\n    x = np.linalg.solve(A, b)\n    draw_image(np.round(x.reshape(WORLD_SIZE, WORLD_SIZE), decimals=2))\n    plt.savefig('../images/figure_3_2_linear_system.png')\n    plt.close()\n\ndef figure_3_5():\n    value = np.zeros((WORLD_SIZE, WORLD_SIZE))\n    while True:\n        # keep iteration until convergence\n        new_value = np.zeros_like(value)\n        for i in range(WORLD_SIZE):\n            for j in range(WORLD_SIZE):\n                values = []\n                for action in ACTIONS:\n                    (next_i, next_j), reward = step([i, j], action)\n                    # value iteration\n                    values.append(reward + DISCOUNT * value[next_i, next_j])\n                new_value[i, j] = np.max(values)\n        if np.sum(np.abs(new_value - value)) < 1e-4:\n            draw_image(np.round(new_value, decimals=2))\n            plt.savefig('../images/figure_3_5.png')\n            plt.close()\n            draw_policy(new_value)\n            plt.savefig('../images/figure_3_5_policy.png')\n            plt.close()\n            break\n        value = new_value\n\n\nif __name__ == '__main__':\n    figure_3_2_linear_system()\n    figure_3_2()\n    figure_3_5()\n"
  },
  {
    "path": "chapter04/car_rental.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016 Shangtong Zhang(zhangshangtong.cpp@gmail.com)                  #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# 2017 Aja Rangaswamy (aja004@gmail.com)                              #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport matplotlib\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport seaborn as sns\nfrom scipy.stats import poisson\n\nmatplotlib.use('Agg')\n\n# maximum # of cars in each location\nMAX_CARS = 20\n\n# maximum # of cars to move during night\nMAX_MOVE_OF_CARS = 5\n\n# expectation for rental requests in first location\nRENTAL_REQUEST_FIRST_LOC = 3\n\n# expectation for rental requests in second location\nRENTAL_REQUEST_SECOND_LOC = 4\n\n# expectation for # of cars returned in first location\nRETURNS_FIRST_LOC = 3\n\n# expectation for # of cars returned in second location\nRETURNS_SECOND_LOC = 2\n\nDISCOUNT = 0.9\n\n# credit earned by a car\nRENTAL_CREDIT = 10\n\n# cost of moving a car\nMOVE_CAR_COST = 2\n\n# all possible actions\nactions = np.arange(-MAX_MOVE_OF_CARS, MAX_MOVE_OF_CARS + 1)\n\n# An up bound for poisson distribution\n# If n is greater than this value, then the probability of getting n is truncated to 0\nPOISSON_UPPER_BOUND = 11\n\n# Probability for poisson distribution\n# @lam: lambda should be less than 10 for this function\npoisson_cache = dict()\n\n\ndef poisson_probability(n, lam):\n    global poisson_cache\n    key = n * 10 + lam\n    if key not in poisson_cache:\n        poisson_cache[key] = poisson.pmf(n, lam)\n    return poisson_cache[key]\n\n\ndef expected_return(state, action, state_value, constant_returned_cars):\n    \"\"\"\n    @state: [# of cars in first location, # of cars in second location]\n    @action: positive if moving cars from first location to second location,\n            negative if moving cars from second location to first location\n    @stateValue: state value matrix\n    @constant_returned_cars:  if set True, model is simplified such that\n    the # of cars returned in daytime becomes constant\n    rather than a random value from poisson distribution, which will reduce calculation time\n    and leave the optimal policy/value state matrix almost the same\n    \"\"\"\n    # initailize total return\n    returns = 0.0\n\n    # cost for moving cars\n    returns -= MOVE_CAR_COST * abs(action)\n\n    # moving cars\n    NUM_OF_CARS_FIRST_LOC = min(state[0] - action, MAX_CARS)\n    NUM_OF_CARS_SECOND_LOC = min(state[1] + action, MAX_CARS)\n\n    # go through all possible rental requests\n    for rental_request_first_loc in range(POISSON_UPPER_BOUND):\n        for rental_request_second_loc in range(POISSON_UPPER_BOUND):\n            # probability for current combination of rental requests\n            prob = poisson_probability(rental_request_first_loc, RENTAL_REQUEST_FIRST_LOC) * \\\n                poisson_probability(rental_request_second_loc, RENTAL_REQUEST_SECOND_LOC)\n\n            num_of_cars_first_loc = NUM_OF_CARS_FIRST_LOC\n            num_of_cars_second_loc = NUM_OF_CARS_SECOND_LOC\n\n            # valid rental requests should be less than actual # of cars\n            valid_rental_first_loc = min(num_of_cars_first_loc, rental_request_first_loc)\n            valid_rental_second_loc = min(num_of_cars_second_loc, rental_request_second_loc)\n\n            # get credits for renting\n            reward = (valid_rental_first_loc + valid_rental_second_loc) * RENTAL_CREDIT\n            num_of_cars_first_loc -= valid_rental_first_loc\n            num_of_cars_second_loc -= valid_rental_second_loc\n\n            if constant_returned_cars:\n                # get returned cars, those cars can be used for renting tomorrow\n                returned_cars_first_loc = RETURNS_FIRST_LOC\n                returned_cars_second_loc = RETURNS_SECOND_LOC\n                num_of_cars_first_loc = min(num_of_cars_first_loc + returned_cars_first_loc, MAX_CARS)\n                num_of_cars_second_loc = min(num_of_cars_second_loc + returned_cars_second_loc, MAX_CARS)\n                returns += prob * (reward + DISCOUNT * state_value[num_of_cars_first_loc, num_of_cars_second_loc])\n            else:\n                for returned_cars_first_loc in range(POISSON_UPPER_BOUND):\n                    for returned_cars_second_loc in range(POISSON_UPPER_BOUND):\n                        prob_return = poisson_probability(\n                            returned_cars_first_loc, RETURNS_FIRST_LOC) * poisson_probability(returned_cars_second_loc, RETURNS_SECOND_LOC)\n                        num_of_cars_first_loc_ = min(num_of_cars_first_loc + returned_cars_first_loc, MAX_CARS)\n                        num_of_cars_second_loc_ = min(num_of_cars_second_loc + returned_cars_second_loc, MAX_CARS)\n                        prob_ = prob_return * prob\n                        returns += prob_ * (reward + DISCOUNT *\n                                            state_value[num_of_cars_first_loc_, num_of_cars_second_loc_])\n    return returns\n\n\ndef figure_4_2(constant_returned_cars=True):\n    value = np.zeros((MAX_CARS + 1, MAX_CARS + 1))\n    policy = np.zeros(value.shape, dtype=np.int)\n\n    iterations = 0\n    _, axes = plt.subplots(2, 3, figsize=(40, 20))\n    plt.subplots_adjust(wspace=0.1, hspace=0.2)\n    axes = axes.flatten()\n    while True:\n        fig = sns.heatmap(np.flipud(policy), cmap=\"YlGnBu\", ax=axes[iterations])\n        fig.set_ylabel('# cars at first location', fontsize=30)\n        fig.set_yticks(list(reversed(range(MAX_CARS + 1))))\n        fig.set_xlabel('# cars at second location', fontsize=30)\n        fig.set_title('policy {}'.format(iterations), fontsize=30)\n\n        # policy evaluation (in-place)\n        while True:\n            old_value = value.copy()\n            for i in range(MAX_CARS + 1):\n                for j in range(MAX_CARS + 1):\n                    new_state_value = expected_return([i, j], policy[i, j], value, constant_returned_cars)\n                    value[i, j] = new_state_value\n            max_value_change = abs(old_value - value).max()\n            print('max value change {}'.format(max_value_change))\n            if max_value_change < 1e-4:\n                break\n\n        # policy improvement\n        policy_stable = True\n        for i in range(MAX_CARS + 1):\n            for j in range(MAX_CARS + 1):\n                old_action = policy[i, j]\n                action_returns = []\n                for action in actions:\n                    if (0 <= action <= i) or (-j <= action <= 0):\n                        action_returns.append(expected_return([i, j], action, value, constant_returned_cars))\n                    else:\n                        action_returns.append(-np.inf)\n                new_action = actions[np.argmax(action_returns)]\n                policy[i, j] = new_action\n                if policy_stable and old_action != new_action:\n                    policy_stable = False\n        print('policy stable {}'.format(policy_stable))\n\n        if policy_stable:\n            fig = sns.heatmap(np.flipud(value), cmap=\"YlGnBu\", ax=axes[-1])\n            fig.set_ylabel('# cars at first location', fontsize=30)\n            fig.set_yticks(list(reversed(range(MAX_CARS + 1))))\n            fig.set_xlabel('# cars at second location', fontsize=30)\n            fig.set_title('optimal value', fontsize=30)\n            break\n\n        iterations += 1\n\n    plt.savefig('../images/figure_4_2.png')\n    plt.close()\n\n\nif __name__ == '__main__':\n    figure_4_2()\n"
  },
  {
    "path": "chapter04/car_rental_synchronous.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016 Shangtong Zhang(zhangshangtong.cpp@gmail.com)                  #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# 2017 Aja Rangaswamy (aja004@gmail.com)                              #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\n# This file is contributed by Tahsincan Köse which implements a synchronous policy evaluation, while the car_rental.py\n# implements an asynchronous policy evaluation. This file also utilizes multi-processing for acceleration and contains\n# an answer to Exercise 4.5\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport math\nimport tqdm\nimport multiprocessing as mp\nfrom functools import partial\nimport time\nimport itertools\n\n############# PROBLEM SPECIFIC CONSTANTS #######################\nMAX_CARS = 20\nMAX_MOVE = 5\nMOVE_COST = -2\nADDITIONAL_PARK_COST = -4\n\nRENT_REWARD = 10\n# expectation for rental requests in first location\nRENTAL_REQUEST_FIRST_LOC = 3\n# expectation for rental requests in second location\nRENTAL_REQUEST_SECOND_LOC = 4\n# expectation for # of cars returned in first location\nRETURNS_FIRST_LOC = 3\n# expectation for # of cars returned in second location\nRETURNS_SECOND_LOC = 2\n################################################################\n\npoisson_cache = dict()\n\n\ndef poisson(n, lam):\n    global poisson_cache\n    key = n * 10 + lam\n    if key not in poisson_cache.keys():\n        poisson_cache[key] = math.exp(-lam) * math.pow(lam, n) / math.factorial(n)\n    return poisson_cache[key]\n\n\nclass PolicyIteration:\n    def __init__(self, truncate, parallel_processes, delta=1e-2, gamma=0.9, solve_4_5=False):\n        self.TRUNCATE = truncate\n        self.NR_PARALLEL_PROCESSES = parallel_processes\n        self.actions = np.arange(-MAX_MOVE, MAX_MOVE + 1)\n        self.inverse_actions = {el: ind[0] for ind, el in np.ndenumerate(self.actions)}\n        self.values = np.zeros((MAX_CARS + 1, MAX_CARS + 1))\n        self.policy = np.zeros(self.values.shape, dtype=np.int)\n        self.delta = delta\n        self.gamma = gamma\n        self.solve_extension = solve_4_5\n\n    def solve(self):\n        iterations = 0\n        total_start_time = time.time()\n        while True:\n            start_time = time.time()\n            self.values = self.policy_evaluation(self.values, self.policy)\n            elapsed_time = time.time() - start_time\n            print(f'PE => Elapsed time {elapsed_time} seconds')\n            start_time = time.time()\n\n            policy_change, self.policy = self.policy_improvement(self.actions, self.values, self.policy)\n            elapsed_time = time.time() - start_time\n            print(f'PI => Elapsed time {elapsed_time} seconds')\n            if policy_change == 0:\n                break\n            iterations += 1\n        total_elapsed_time = time.time() - total_start_time\n        print(f'Optimal policy is reached after {iterations} iterations in {total_elapsed_time} seconds')\n\n    # out-place\n    def policy_evaluation(self, values, policy):\n\n        global MAX_CARS\n        while True:\n            new_values = np.copy(values)\n            k = np.arange(MAX_CARS + 1)\n            # cartesian product\n            all_states = ((i, j) for i, j in itertools.product(k, k))\n\n            results = []\n            with mp.Pool(processes=self.NR_PARALLEL_PROCESSES) as p:\n                cook = partial(self.expected_return_pe, policy, values)\n                results = p.map(cook, all_states)\n\n            for v, i, j in results:\n                new_values[i, j] = v\n\n            difference = np.abs(new_values - values).sum()\n            print(f'Difference: {difference}')\n            values = new_values\n            if difference < self.delta:\n                print(f'Values are converged!')\n                return values\n\n    def policy_improvement(self, actions, values, policy):\n        new_policy = np.copy(policy)\n\n        expected_action_returns = np.zeros((MAX_CARS + 1, MAX_CARS + 1, np.size(actions)))\n        cooks = dict()\n        with mp.Pool(processes=8) as p:\n            for action in actions:\n                k = np.arange(MAX_CARS + 1)\n                all_states = ((i, j) for i, j in itertools.product(k, k))\n                cooks[action] = partial(self.expected_return_pi, values, action)\n                results = p.map(cooks[action], all_states)\n                for v, i, j, a in results:\n                    expected_action_returns[i, j, self.inverse_actions[a]] = v\n        for i in range(expected_action_returns.shape[0]):\n            for j in range(expected_action_returns.shape[1]):\n                new_policy[i, j] = actions[np.argmax(expected_action_returns[i, j])]\n\n        policy_change = (new_policy != policy).sum()\n        print(f'Policy changed in {policy_change} states')\n        return policy_change, new_policy\n\n    # O(n^4) computation for all possible requests and returns\n    def bellman(self, values, action, state):\n        expected_return = 0\n        if self.solve_extension:\n            if action > 0:\n                # Free shuttle to the second location\n                expected_return += MOVE_COST * (action - 1)\n            else:\n                expected_return += MOVE_COST * abs(action)\n        else:\n            expected_return += MOVE_COST * abs(action)\n\n        for req1 in range(0, self.TRUNCATE):\n            for req2 in range(0, self.TRUNCATE):\n                # moving cars\n                num_of_cars_first_loc = int(min(state[0] - action, MAX_CARS))\n                num_of_cars_second_loc = int(min(state[1] + action, MAX_CARS))\n\n                # valid rental requests should be less than actual # of cars\n                real_rental_first_loc = min(num_of_cars_first_loc, req1)\n                real_rental_second_loc = min(num_of_cars_second_loc, req2)\n\n                # get credits for renting\n                reward = (real_rental_first_loc + real_rental_second_loc) * RENT_REWARD\n\n                if self.solve_extension:\n                    if num_of_cars_first_loc >= 10:\n                        reward += ADDITIONAL_PARK_COST\n                    if num_of_cars_second_loc >= 10:\n                        reward += ADDITIONAL_PARK_COST\n\n                num_of_cars_first_loc -= real_rental_first_loc\n                num_of_cars_second_loc -= real_rental_second_loc\n\n                # probability for current combination of rental requests\n                prob = poisson(req1, RENTAL_REQUEST_FIRST_LOC) * \\\n                       poisson(req2, RENTAL_REQUEST_SECOND_LOC)\n                for ret1 in range(0, self.TRUNCATE):\n                    for ret2 in range(0, self.TRUNCATE):\n                        num_of_cars_first_loc_ = min(num_of_cars_first_loc + ret1, MAX_CARS)\n                        num_of_cars_second_loc_ = min(num_of_cars_second_loc + ret2, MAX_CARS)\n                        prob_ = poisson(ret1, RETURNS_FIRST_LOC) * \\\n                                poisson(ret2, RETURNS_SECOND_LOC) * prob\n                        # Classic Bellman equation for state-value\n                        # prob_ corresponds to p(s'|s,a) for each possible s' -> (num_of_cars_first_loc_,num_of_cars_second_loc_)\n                        expected_return += prob_ * (\n                                reward + self.gamma * values[num_of_cars_first_loc_, num_of_cars_second_loc_])\n        return expected_return\n\n    # Parallelization enforced different helper functions\n    # Expected return calculator for Policy Evaluation\n    def expected_return_pe(self, policy, values, state):\n\n        action = policy[state[0], state[1]]\n        expected_return = self.bellman(values, action, state)\n        return expected_return, state[0], state[1]\n\n    # Expected return calculator for Policy Improvement\n    def expected_return_pi(self, values, action, state):\n\n        if ((action >= 0 and state[0] >= action) or (action < 0 and state[1] >= abs(action))) == False:\n            return -float('inf'), state[0], state[1], action\n        expected_return = self.bellman(values, action, state)\n        return expected_return, state[0], state[1], action\n\n    def plot(self):\n        print(self.policy)\n        plt.figure()\n        plt.xlim(0, MAX_CARS + 1)\n        plt.ylim(0, MAX_CARS + 1)\n        plt.table(cellText=np.flipud(self.policy), loc=(0, 0), cellLoc='center')\n        plt.show()\n\n\nif __name__ == '__main__':\n    TRUNCATE = 9\n    solver = PolicyIteration(TRUNCATE, parallel_processes=4, delta=1e-1, gamma=0.9, solve_4_5=True)\n    solver.solve()\n    solver.plot()\n"
  },
  {
    "path": "chapter04/gamblers_problem.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport matplotlib\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nmatplotlib.use('Agg')\n\n# goal\nGOAL = 100\n\n# all states, including state 0 and state 100\nSTATES = np.arange(GOAL + 1)\n\n# probability of head\nHEAD_PROB = 0.4\n\n\ndef figure_4_3():\n    # state value\n    state_value = np.zeros(GOAL + 1)\n    state_value[GOAL] = 1.0\n\n    sweeps_history = []\n\n    # value iteration\n    while True:\n        old_state_value = state_value.copy()\n        sweeps_history.append(old_state_value)\n\n        for state in STATES[1:GOAL]:\n            # get possilbe actions for current state\n            actions = np.arange(min(state, GOAL - state) + 1)\n            action_returns = []\n            for action in actions:\n                action_returns.append(\n                    HEAD_PROB * state_value[state + action] + (1 - HEAD_PROB) * state_value[state - action])\n            new_value = np.max(action_returns)\n            state_value[state] = new_value\n        delta = abs(state_value - old_state_value).max()\n        if delta < 1e-9:\n            sweeps_history.append(state_value)\n            break\n\n    # compute the optimal policy\n    policy = np.zeros(GOAL + 1)\n    for state in STATES[1:GOAL]:\n        actions = np.arange(min(state, GOAL - state) + 1)\n        action_returns = []\n        for action in actions:\n            action_returns.append(\n                HEAD_PROB * state_value[state + action] + (1 - HEAD_PROB) * state_value[state - action])\n\n        # round to resemble the figure in the book, see\n        # https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/issues/83\n        policy[state] = actions[np.argmax(np.round(action_returns[1:], 5)) + 1]\n\n    plt.figure(figsize=(10, 20))\n\n    plt.subplot(2, 1, 1)\n    for sweep, state_value in enumerate(sweeps_history):\n        plt.plot(state_value, label='sweep {}'.format(sweep))\n    plt.xlabel('Capital')\n    plt.ylabel('Value estimates')\n    plt.legend(loc='best')\n\n    plt.subplot(2, 1, 2)\n    plt.scatter(STATES, policy)\n    plt.xlabel('Capital')\n    plt.ylabel('Final policy (stake)')\n\n    plt.savefig('../images/figure_4_3.png')\n    plt.close()\n\n\nif __name__ == '__main__':\n    figure_4_3()\n"
  },
  {
    "path": "chapter04/grid_world.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport matplotlib\nimport matplotlib.pyplot as plt\nimport numpy as np\nfrom matplotlib.table import Table\n\nmatplotlib.use('Agg')\n\nWORLD_SIZE = 4\n# left, up, right, down\nACTIONS = [np.array([0, -1]),\n           np.array([-1, 0]),\n           np.array([0, 1]),\n           np.array([1, 0])]\nACTION_PROB = 0.25\n\n\ndef is_terminal(state):\n    x, y = state\n    return (x == 0 and y == 0) or (x == WORLD_SIZE - 1 and y == WORLD_SIZE - 1)\n\n\ndef step(state, action):\n    if is_terminal(state):\n        return state, 0\n\n    next_state = (np.array(state) + action).tolist()\n    x, y = next_state\n\n    if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:\n        next_state = state\n\n    reward = -1\n    return next_state, reward\n\n\ndef draw_image(image):\n    fig, ax = plt.subplots()\n    ax.set_axis_off()\n    tb = Table(ax, bbox=[0, 0, 1, 1])\n\n    nrows, ncols = image.shape\n    width, height = 1.0 / ncols, 1.0 / nrows\n\n    # Add cells\n    for (i, j), val in np.ndenumerate(image):\n        tb.add_cell(i, j, width, height, text=val,\n                    loc='center', facecolor='white')\n\n        # Row and column labels...\n    for i in range(len(image)):\n        tb.add_cell(i, -1, width, height, text=i+1, loc='right',\n                    edgecolor='none', facecolor='none')\n        tb.add_cell(-1, i, width, height/2, text=i+1, loc='center',\n                    edgecolor='none', facecolor='none')\n    ax.add_table(tb)\n\n\ndef compute_state_value(in_place=True, discount=1.0):\n    new_state_values = np.zeros((WORLD_SIZE, WORLD_SIZE))\n    iteration = 0\n    while True:\n        if in_place:\n            state_values = new_state_values\n        else:\n            state_values = new_state_values.copy()\n        old_state_values = state_values.copy()\n\n        for i in range(WORLD_SIZE):\n            for j in range(WORLD_SIZE):\n                value = 0\n                for action in ACTIONS:\n                    (next_i, next_j), reward = step([i, j], action)\n                    value += ACTION_PROB * (reward + discount * state_values[next_i, next_j])\n                new_state_values[i, j] = value\n\n        max_delta_value = abs(old_state_values - new_state_values).max()\n        if max_delta_value < 1e-4:\n            break\n\n        iteration += 1\n\n    return new_state_values, iteration\n\n\ndef figure_4_1():\n    # While the author suggests using in-place iterative policy evaluation,\n    # Figure 4.1 actually uses out-of-place version.\n    _, asycn_iteration = compute_state_value(in_place=True)\n    values, sync_iteration = compute_state_value(in_place=False)\n    draw_image(np.round(values, decimals=2))\n    print('In-place: {} iterations'.format(asycn_iteration))\n    print('Synchronous: {} iterations'.format(sync_iteration))\n\n    plt.savefig('../images/figure_4_1.png')\n    plt.close()\n\n\nif __name__ == '__main__':\n    figure_4_1()\n"
  },
  {
    "path": "chapter05/blackjack.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# 2017 Nicky van Foreest(vanforeest@gmail.com)                        #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom tqdm import tqdm\n\n# actions: hit or stand\nACTION_HIT = 0\nACTION_STAND = 1  #  \"strike\" in the book\nACTIONS = [ACTION_HIT, ACTION_STAND]\n\n# policy for player\nPOLICY_PLAYER = np.zeros(22, dtype=np.int)\nfor i in range(12, 20):\n    POLICY_PLAYER[i] = ACTION_HIT\nPOLICY_PLAYER[20] = ACTION_STAND\nPOLICY_PLAYER[21] = ACTION_STAND\n\n# function form of target policy of player\ndef target_policy_player(usable_ace_player, player_sum, dealer_card):\n    return POLICY_PLAYER[player_sum]\n\n# function form of behavior policy of player\ndef behavior_policy_player(usable_ace_player, player_sum, dealer_card):\n    if np.random.binomial(1, 0.5) == 1:\n        return ACTION_STAND\n    return ACTION_HIT\n\n# policy for dealer\nPOLICY_DEALER = np.zeros(22)\nfor i in range(12, 17):\n    POLICY_DEALER[i] = ACTION_HIT\nfor i in range(17, 22):\n    POLICY_DEALER[i] = ACTION_STAND\n\n# get a new card\ndef get_card():\n    card = np.random.randint(1, 14)\n    card = min(card, 10)\n    return card\n\n# get the value of a card (11 for ace).\ndef card_value(card_id):\n    return 11 if card_id == 1 else card_id\n\n# play a game\n# @policy_player: specify policy for player\n# @initial_state: [whether player has a usable Ace, sum of player's cards, one card of dealer]\n# @initial_action: the initial action\ndef play(policy_player, initial_state=None, initial_action=None):\n    # player status\n\n    # sum of player\n    player_sum = 0\n\n    # trajectory of player\n    player_trajectory = []\n\n    # whether player uses Ace as 11\n    usable_ace_player = False\n\n    # dealer status\n    dealer_card1 = 0\n    dealer_card2 = 0\n    usable_ace_dealer = False\n\n    if initial_state is None:\n        # generate a random initial state\n\n        while player_sum < 12:\n            # if sum of player is less than 12, always hit\n            card = get_card()\n            player_sum += card_value(card)\n\n            # If the player's sum is larger than 21, he may hold one or two aces.\n            if player_sum > 21:\n                assert player_sum == 22\n                # last card must be ace\n                player_sum -= 10\n            else:\n                usable_ace_player |= (1 == card)\n\n        # initialize cards of dealer, suppose dealer will show the first card he gets\n        dealer_card1 = get_card()\n        dealer_card2 = get_card()\n\n    else:\n        # use specified initial state\n        usable_ace_player, player_sum, dealer_card1 = initial_state\n        dealer_card2 = get_card()\n\n    # initial state of the game\n    state = [usable_ace_player, player_sum, dealer_card1]\n\n    # initialize dealer's sum\n    dealer_sum = card_value(dealer_card1) + card_value(dealer_card2)\n    usable_ace_dealer = 1 in (dealer_card1, dealer_card2)\n    # if the dealer's sum is larger than 21, he must hold two aces.\n    if dealer_sum > 21:\n        assert dealer_sum == 22\n        # use one Ace as 1 rather than 11\n        dealer_sum -= 10\n    assert dealer_sum <= 21\n    assert player_sum <= 21\n\n    # game starts!\n\n    # player's turn\n    while True:\n        if initial_action is not None:\n            action = initial_action\n            initial_action = None\n        else:\n            # get action based on current sum\n            action = policy_player(usable_ace_player, player_sum, dealer_card1)\n\n        # track player's trajectory for importance sampling\n        player_trajectory.append([(usable_ace_player, player_sum, dealer_card1), action])\n\n        if action == ACTION_STAND:\n            break\n        # if hit, get new card\n        card = get_card()\n        # Keep track of the ace count. the usable_ace_player flag is insufficient alone as it cannot\n        # distinguish between having one ace or two.\n        ace_count = int(usable_ace_player)\n        if card == 1:\n            ace_count += 1\n        player_sum += card_value(card)\n        # If the player has a usable ace, use it as 1 to avoid busting and continue.\n        while player_sum > 21 and ace_count:\n            player_sum -= 10\n            ace_count -= 1\n        # player busts\n        if player_sum > 21:\n            return state, -1, player_trajectory\n        assert player_sum <= 21\n        usable_ace_player = (ace_count == 1)\n\n    # dealer's turn\n    while True:\n        # get action based on current sum\n        action = POLICY_DEALER[dealer_sum]\n        if action == ACTION_STAND:\n            break\n        # if hit, get a new card\n        new_card = get_card()\n        ace_count = int(usable_ace_dealer)\n        if new_card == 1:\n            ace_count += 1\n        dealer_sum += card_value(new_card)\n        # If the dealer has a usable ace, use it as 1 to avoid busting and continue.\n        while dealer_sum > 21 and ace_count:\n            dealer_sum -= 10\n            ace_count -= 1\n        # dealer busts\n        if dealer_sum > 21:\n            return state, 1, player_trajectory\n        usable_ace_dealer = (ace_count == 1)\n\n    # compare the sum between player and dealer\n    assert player_sum <= 21 and dealer_sum <= 21\n    if player_sum > dealer_sum:\n        return state, 1, player_trajectory\n    elif player_sum == dealer_sum:\n        return state, 0, player_trajectory\n    else:\n        return state, -1, player_trajectory\n\n# Monte Carlo Sample with On-Policy\ndef monte_carlo_on_policy(episodes):\n    states_usable_ace = np.zeros((10, 10))\n    # initialze counts to 1 to avoid 0 being divided\n    states_usable_ace_count = np.ones((10, 10))\n    states_no_usable_ace = np.zeros((10, 10))\n    # initialze counts to 1 to avoid 0 being divided\n    states_no_usable_ace_count = np.ones((10, 10))\n    for i in tqdm(range(0, episodes)):\n        _, reward, player_trajectory = play(target_policy_player)\n        for (usable_ace, player_sum, dealer_card), _ in player_trajectory:\n            player_sum -= 12\n            dealer_card -= 1\n            if usable_ace:\n                states_usable_ace_count[player_sum, dealer_card] += 1\n                states_usable_ace[player_sum, dealer_card] += reward\n            else:\n                states_no_usable_ace_count[player_sum, dealer_card] += 1\n                states_no_usable_ace[player_sum, dealer_card] += reward\n    return states_usable_ace / states_usable_ace_count, states_no_usable_ace / states_no_usable_ace_count\n\n# Monte Carlo with Exploring Starts\ndef monte_carlo_es(episodes):\n    # (playerSum, dealerCard, usableAce, action)\n    state_action_values = np.zeros((10, 10, 2, 2))\n    # initialze counts to 1 to avoid division by 0\n    state_action_pair_count = np.ones((10, 10, 2, 2))\n\n    # behavior policy is greedy\n    def behavior_policy(usable_ace, player_sum, dealer_card):\n        usable_ace = int(usable_ace)\n        player_sum -= 12\n        dealer_card -= 1\n        # get argmax of the average returns(s, a)\n        values_ = state_action_values[player_sum, dealer_card, usable_ace, :] / \\\n                  state_action_pair_count[player_sum, dealer_card, usable_ace, :]\n        return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])\n\n    # play for several episodes\n    for episode in tqdm(range(episodes)):\n        # for each episode, use a randomly initialized state and action\n        initial_state = [bool(np.random.choice([0, 1])),\n                       np.random.choice(range(12, 22)),\n                       np.random.choice(range(1, 11))]\n        initial_action = np.random.choice(ACTIONS)\n        current_policy = behavior_policy if episode else target_policy_player\n        _, reward, trajectory = play(current_policy, initial_state, initial_action)\n        first_visit_check = set()\n        for (usable_ace, player_sum, dealer_card), action in trajectory:\n            usable_ace = int(usable_ace)\n            player_sum -= 12\n            dealer_card -= 1\n            state_action = (usable_ace, player_sum, dealer_card, action)\n            if state_action in first_visit_check:\n                continue\n            first_visit_check.add(state_action)\n            # update values of state-action pairs\n            state_action_values[player_sum, dealer_card, usable_ace, action] += reward\n            state_action_pair_count[player_sum, dealer_card, usable_ace, action] += 1\n\n    return state_action_values / state_action_pair_count\n\n# Monte Carlo Sample with Off-Policy\ndef monte_carlo_off_policy(episodes):\n    initial_state = [True, 13, 2]\n\n    rhos = []\n    returns = []\n\n    for i in range(0, episodes):\n        _, reward, player_trajectory = play(behavior_policy_player, initial_state=initial_state)\n\n        # get the importance ratio\n        numerator = 1.0\n        denominator = 1.0\n        for (usable_ace, player_sum, dealer_card), action in player_trajectory:\n            if action == target_policy_player(usable_ace, player_sum, dealer_card):\n                denominator *= 0.5\n            else:\n                numerator = 0.0\n                break\n        rho = numerator / denominator\n        rhos.append(rho)\n        returns.append(reward)\n\n    rhos = np.asarray(rhos)\n    returns = np.asarray(returns)\n    weighted_returns = rhos * returns\n\n    weighted_returns = np.add.accumulate(weighted_returns)\n    rhos = np.add.accumulate(rhos)\n\n    ordinary_sampling = weighted_returns / np.arange(1, episodes + 1)\n\n    with np.errstate(divide='ignore',invalid='ignore'):\n        weighted_sampling = np.where(rhos != 0, weighted_returns / rhos, 0)\n\n    return ordinary_sampling, weighted_sampling\n\ndef figure_5_1():\n    states_usable_ace_1, states_no_usable_ace_1 = monte_carlo_on_policy(10000)\n    states_usable_ace_2, states_no_usable_ace_2 = monte_carlo_on_policy(500000)\n\n    states = [states_usable_ace_1,\n              states_usable_ace_2,\n              states_no_usable_ace_1,\n              states_no_usable_ace_2]\n\n    titles = ['Usable Ace, 10000 Episodes',\n              'Usable Ace, 500000 Episodes',\n              'No Usable Ace, 10000 Episodes',\n              'No Usable Ace, 500000 Episodes']\n\n    _, axes = plt.subplots(2, 2, figsize=(40, 30))\n    plt.subplots_adjust(wspace=0.1, hspace=0.2)\n    axes = axes.flatten()\n\n    for state, title, axis in zip(states, titles, axes):\n        fig = sns.heatmap(np.flipud(state), cmap=\"YlGnBu\", ax=axis, xticklabels=range(1, 11),\n                          yticklabels=list(reversed(range(12, 22))))\n        fig.set_ylabel('player sum', fontsize=30)\n        fig.set_xlabel('dealer showing', fontsize=30)\n        fig.set_title(title, fontsize=30)\n\n    plt.savefig('../images/figure_5_1.png')\n    plt.close()\n\ndef figure_5_2():\n    state_action_values = monte_carlo_es(500000)\n\n    state_value_no_usable_ace = np.max(state_action_values[:, :, 0, :], axis=-1)\n    state_value_usable_ace = np.max(state_action_values[:, :, 1, :], axis=-1)\n\n    # get the optimal policy\n    action_no_usable_ace = np.argmax(state_action_values[:, :, 0, :], axis=-1)\n    action_usable_ace = np.argmax(state_action_values[:, :, 1, :], axis=-1)\n\n    images = [action_usable_ace,\n              state_value_usable_ace,\n              action_no_usable_ace,\n              state_value_no_usable_ace]\n\n    titles = ['Optimal policy with usable Ace',\n              'Optimal value with usable Ace',\n              'Optimal policy without usable Ace',\n              'Optimal value without usable Ace']\n\n    _, axes = plt.subplots(2, 2, figsize=(40, 30))\n    plt.subplots_adjust(wspace=0.1, hspace=0.2)\n    axes = axes.flatten()\n\n    for image, title, axis in zip(images, titles, axes):\n        fig = sns.heatmap(np.flipud(image), cmap=\"YlGnBu\", ax=axis, xticklabels=range(1, 11),\n                          yticklabels=list(reversed(range(12, 22))))\n        fig.set_ylabel('player sum', fontsize=30)\n        fig.set_xlabel('dealer showing', fontsize=30)\n        fig.set_title(title, fontsize=30)\n\n    plt.savefig('../images/figure_5_2.png')\n    plt.close()\n\ndef figure_5_3():\n    true_value = -0.27726\n    episodes = 10000\n    runs = 100\n    error_ordinary = np.zeros(episodes)\n    error_weighted = np.zeros(episodes)\n    for i in tqdm(range(0, runs)):\n        ordinary_sampling_, weighted_sampling_ = monte_carlo_off_policy(episodes)\n        # get the squared error\n        error_ordinary += np.power(ordinary_sampling_ - true_value, 2)\n        error_weighted += np.power(weighted_sampling_ - true_value, 2)\n    error_ordinary /= runs\n    error_weighted /= runs\n\n    plt.plot(np.arange(1, episodes + 1), error_ordinary, color='green', label='Ordinary Importance Sampling')\n    plt.plot(np.arange(1, episodes + 1), error_weighted, color='red', label='Weighted Importance Sampling')\n    plt.ylim(-0.1, 5)\n    plt.xlabel('Episodes (log scale)')\n    plt.ylabel(f'Mean square error\\n(average over {runs} runs)')\n    plt.xscale('log')\n    plt.legend()\n\n    plt.savefig('../images/figure_5_3.png')\n    plt.close()\n\n\nif __name__ == '__main__':\n    figure_5_1()\n    figure_5_2()\n    figure_5_3()\n"
  },
  {
    "path": "chapter05/infinite_variance.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\n\nACTION_BACK = 0\nACTION_END = 1\n\n# behavior policy\ndef behavior_policy():\n    return np.random.binomial(1, 0.5)\n\n# target policy\ndef target_policy():\n    return ACTION_BACK\n\n# one turn\ndef play():\n    # track the action for importance ratio\n    trajectory = []\n    while True:\n        action = behavior_policy()\n        trajectory.append(action)\n        if action == ACTION_END:\n            return 0, trajectory\n        if np.random.binomial(1, 0.9) == 0:\n            return 1, trajectory\n\ndef figure_5_4():\n    runs = 10\n    episodes = 100000\n    for run in range(runs):\n        rewards = []\n        for episode in range(0, episodes):\n            reward, trajectory = play()\n            if trajectory[-1] == ACTION_END:\n                rho = 0\n            else:\n                rho = 1.0 / pow(0.5, len(trajectory))\n            rewards.append(rho * reward)\n        rewards = np.add.accumulate(rewards)\n        estimations = np.asarray(rewards) / np.arange(1, episodes + 1)\n        plt.plot(estimations)\n    plt.xlabel('Episodes (log scale)')\n    plt.ylabel('Ordinary Importance Sampling')\n    plt.xscale('log')\n\n    plt.savefig('../images/figure_5_4.png')\n    plt.close()\n\nif __name__ == '__main__':\n    figure_5_4()\n"
  },
  {
    "path": "chapter06/cliff_walking.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\n\n# world height\nWORLD_HEIGHT = 4\n\n# world width\nWORLD_WIDTH = 12\n\n# probability for exploration\nEPSILON = 0.1\n\n# step size\nALPHA = 0.5\n\n# gamma for Q-Learning and Expected Sarsa\nGAMMA = 1\n\n# all possible actions\nACTION_UP = 0\nACTION_DOWN = 1\nACTION_LEFT = 2\nACTION_RIGHT = 3\nACTIONS = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT]\n\n# initial state action pair values\nSTART = [3, 0]\nGOAL = [3, 11]\n\ndef step(state, action):\n    i, j = state\n    if action == ACTION_UP:\n        next_state = [max(i - 1, 0), j]\n    elif action == ACTION_LEFT:\n        next_state = [i, max(j - 1, 0)]\n    elif action == ACTION_RIGHT:\n        next_state = [i, min(j + 1, WORLD_WIDTH - 1)]\n    elif action == ACTION_DOWN:\n        next_state = [min(i + 1, WORLD_HEIGHT - 1), j]\n    else:\n        assert False\n\n    reward = -1\n    if (action == ACTION_DOWN and i == 2 and 1 <= j <= 10) or (\n        action == ACTION_RIGHT and state == START):\n        reward = -100\n        next_state = START\n\n    return next_state, reward\n\n# reward for each action in each state\n# actionRewards = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4))\n# actionRewards[:, :, :] = -1.0\n# actionRewards[2, 1:11, ACTION_DOWN] = -100.0\n# actionRewards[3, 0, ACTION_RIGHT] = -100.0\n\n# set up destinations for each action in each state\n# actionDestination = []\n# for i in range(0, WORLD_HEIGHT):\n#     actionDestination.append([])\n#     for j in range(0, WORLD_WIDTH):\n#         destinaion = dict()\n#         destinaion[ACTION_UP] = [max(i - 1, 0), j]\n#         destinaion[ACTION_LEFT] = [i, max(j - 1, 0)]\n#         destinaion[ACTION_RIGHT] = [i, min(j + 1, WORLD_WIDTH - 1)]\n#         if i == 2 and 1 <= j <= 10:\n#             destinaion[ACTION_DOWN] = START\n#         else:\n#             destinaion[ACTION_DOWN] = [min(i + 1, WORLD_HEIGHT - 1), j]\n#         actionDestination[-1].append(destinaion)\n# actionDestination[3][0][ACTION_RIGHT] = START\n\n# choose an action based on epsilon greedy algorithm\ndef choose_action(state, q_value):\n    if np.random.binomial(1, EPSILON) == 1:\n        return np.random.choice(ACTIONS)\n    else:\n        values_ = q_value[state[0], state[1], :]\n        return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])\n\n# an episode with Sarsa\n# @q_value: values for state action pair, will be updated\n# @expected: if True, will use expected Sarsa algorithm\n# @step_size: step size for updating\n# @return: total rewards within this episode\ndef sarsa(q_value, expected=False, step_size=ALPHA):\n    state = START\n    action = choose_action(state, q_value)\n    rewards = 0.0\n    while state != GOAL:\n        next_state, reward = step(state, action)\n        next_action = choose_action(next_state, q_value)\n        rewards += reward\n        if not expected:\n            target = q_value[next_state[0], next_state[1], next_action]\n        else:\n            # calculate the expected value of new state\n            target = 0.0\n            q_next = q_value[next_state[0], next_state[1], :]\n            best_actions = np.argwhere(q_next == np.max(q_next))\n            for action_ in ACTIONS:\n                if action_ in best_actions:\n                    target += ((1.0 - EPSILON) / len(best_actions) + EPSILON / len(ACTIONS)) * q_value[next_state[0], next_state[1], action_]\n                else:\n                    target += EPSILON / len(ACTIONS) * q_value[next_state[0], next_state[1], action_]\n        target *= GAMMA\n        q_value[state[0], state[1], action] += step_size * (\n                reward + target - q_value[state[0], state[1], action])\n        state = next_state\n        action = next_action\n    return rewards\n\n# an episode with Q-Learning\n# @q_value: values for state action pair, will be updated\n# @step_size: step size for updating\n# @return: total rewards within this episode\ndef q_learning(q_value, step_size=ALPHA):\n    state = START\n    rewards = 0.0\n    while state != GOAL:\n        action = choose_action(state, q_value)\n        next_state, reward = step(state, action)\n        rewards += reward\n        # Q-Learning update\n        q_value[state[0], state[1], action] += step_size * (\n                reward + GAMMA * np.max(q_value[next_state[0], next_state[1], :]) -\n                q_value[state[0], state[1], action])\n        state = next_state\n    return rewards\n\n# print optimal policy\ndef print_optimal_policy(q_value):\n    optimal_policy = []\n    for i in range(0, WORLD_HEIGHT):\n        optimal_policy.append([])\n        for j in range(0, WORLD_WIDTH):\n            if [i, j] == GOAL:\n                optimal_policy[-1].append('G')\n                continue\n            bestAction = np.argmax(q_value[i, j, :])\n            if bestAction == ACTION_UP:\n                optimal_policy[-1].append('U')\n            elif bestAction == ACTION_DOWN:\n                optimal_policy[-1].append('D')\n            elif bestAction == ACTION_LEFT:\n                optimal_policy[-1].append('L')\n            elif bestAction == ACTION_RIGHT:\n                optimal_policy[-1].append('R')\n    for row in optimal_policy:\n        print(row)\n\n# Use multiple runs instead of a single run and a sliding window\n# With a single run I failed to present a smooth curve\n# However the optimal policy converges well with a single run\n# Sarsa converges to the safe path, while Q-Learning converges to the optimal path\ndef figure_6_4():\n    # episodes of each run\n    episodes = 500\n\n    # perform 40 independent runs\n    runs = 50\n\n    rewards_sarsa = np.zeros(episodes)\n    rewards_q_learning = np.zeros(episodes)\n    for r in tqdm(range(runs)):\n        q_sarsa = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4))\n        q_q_learning = np.copy(q_sarsa)\n        for i in range(0, episodes):\n            # cut off the value by -100 to draw the figure more elegantly\n            # rewards_sarsa[i] += max(sarsa(q_sarsa), -100)\n            # rewards_q_learning[i] += max(q_learning(q_q_learning), -100)\n            rewards_sarsa[i] += sarsa(q_sarsa)\n            rewards_q_learning[i] += q_learning(q_q_learning)\n\n    # averaging over independt runs\n    rewards_sarsa /= runs\n    rewards_q_learning /= runs\n\n    # draw reward curves\n    plt.plot(rewards_sarsa, label='Sarsa')\n    plt.plot(rewards_q_learning, label='Q-Learning')\n    plt.xlabel('Episodes')\n    plt.ylabel('Sum of rewards during episode')\n    plt.ylim([-100, 0])\n    plt.legend()\n\n    plt.savefig('../images/figure_6_4.png')\n    plt.close()\n\n    # display optimal policy\n    print('Sarsa Optimal Policy:')\n    print_optimal_policy(q_sarsa)\n    print('Q-Learning Optimal Policy:')\n    print_optimal_policy(q_q_learning)\n\n# Due to limited capacity of calculation of my machine, I can't complete this experiment\n# with 100,000 episodes and 50,000 runs to get the fully averaged performance\n# However even I only play for 1,000 episodes and 10 runs, the curves looks still good.\ndef figure_6_6():\n    step_sizes = np.arange(0.1, 1.1, 0.1)\n    episodes = 1000\n    runs = 10\n\n    ASY_SARSA = 0\n    ASY_EXPECTED_SARSA = 1\n    ASY_QLEARNING = 2\n    INT_SARSA = 3\n    INT_EXPECTED_SARSA = 4\n    INT_QLEARNING = 5\n    methods = range(0, 6)\n\n    performace = np.zeros((6, len(step_sizes)))\n    for run in range(runs):\n        for ind, step_size in tqdm(list(zip(range(0, len(step_sizes)), step_sizes))):\n            q_sarsa = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4))\n            q_expected_sarsa = np.copy(q_sarsa)\n            q_q_learning = np.copy(q_sarsa)\n            for ep in range(episodes):\n                sarsa_reward = sarsa(q_sarsa, expected=False, step_size=step_size)\n                expected_sarsa_reward = sarsa(q_expected_sarsa, expected=True, step_size=step_size)\n                q_learning_reward = q_learning(q_q_learning, step_size=step_size)\n                performace[ASY_SARSA, ind] += sarsa_reward\n                performace[ASY_EXPECTED_SARSA, ind] += expected_sarsa_reward\n                performace[ASY_QLEARNING, ind] += q_learning_reward\n\n                if ep < 100:\n                    performace[INT_SARSA, ind] += sarsa_reward\n                    performace[INT_EXPECTED_SARSA, ind] += expected_sarsa_reward\n                    performace[INT_QLEARNING, ind] += q_learning_reward\n\n    performace[:3, :] /= episodes * runs\n    performace[3:, :] /= 100 * runs\n    labels = ['Asymptotic Sarsa', 'Asymptotic Expected Sarsa', 'Asymptotic Q-Learning',\n              'Interim Sarsa', 'Interim Expected Sarsa', 'Interim Q-Learning']\n\n    for method, label in zip(methods, labels):\n        plt.plot(step_sizes, performace[method, :], label=label)\n    plt.xlabel('alpha')\n    plt.ylabel('reward per episode')\n    plt.legend()\n\n    plt.savefig('../images/figure_6_6.png')\n    plt.close()\n\nif __name__ == '__main__':\n    figure_6_4()\n    figure_6_6()\n"
  },
  {
    "path": "chapter06/maximization_bias.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\nimport copy\n\n# state A\nSTATE_A = 0\n\n# state B\nSTATE_B = 1\n\n# use one terminal state\nSTATE_TERMINAL = 2\n\n# starts from state A\nSTATE_START = STATE_A\n\n# possible actions in A\nACTION_A_RIGHT = 0\nACTION_A_LEFT = 1\n\n# probability for exploration\nEPSILON = 0.1\n\n# step size\nALPHA = 0.1\n\n# discount for max value\nGAMMA = 1.0\n\n# possible actions in B, maybe 10 actions\nACTIONS_B = range(0, 10)\n\n# all possible actions\nSTATE_ACTIONS = [[ACTION_A_RIGHT, ACTION_A_LEFT], ACTIONS_B]\n\n# state action pair values, if a state is a terminal state, then the value is always 0\nINITIAL_Q = [np.zeros(2), np.zeros(len(ACTIONS_B)), np.zeros(1)]\n\n# set up destination for each state and each action\nTRANSITION = [[STATE_TERMINAL, STATE_B], [STATE_TERMINAL] * len(ACTIONS_B)]\n\n# choose an action based on epsilon greedy algorithm\ndef choose_action(state, q_value):\n    if np.random.binomial(1, EPSILON) == 1:\n        return np.random.choice(STATE_ACTIONS[state])\n    else:\n        values_ = q_value[state]\n        return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])\n\n# take @action in @state, return the reward\ndef take_action(state, action):\n    if state == STATE_A:\n        return 0\n    return np.random.normal(-0.1, 1)\n\n# if there are two state action pair value array, use double Q-Learning\n# otherwise use normal Q-Learning\ndef q_learning(q1, q2=None):\n    state = STATE_START\n    # track the # of action left in state A\n    left_count = 0\n    while state != STATE_TERMINAL:\n        if q2 is None:\n            action = choose_action(state, q1)\n        else:\n            # derive a action form Q1 and Q2\n            action = choose_action(state, [item1 + item2 for item1, item2 in zip(q1, q2)])\n        if state == STATE_A and action == ACTION_A_LEFT:\n            left_count += 1\n        reward = take_action(state, action)\n        next_state = TRANSITION[state][action]\n        if q2 is None:\n            active_q = q1\n            target = np.max(active_q[next_state])\n        else:\n            if np.random.binomial(1, 0.5) == 1:\n                active_q = q1\n                target_q = q2\n            else:\n                active_q = q2\n                target_q = q1\n            best_action = np.random.choice([action_ for action_, value_ in enumerate(active_q[next_state]) if value_ == np.max(active_q[next_state])])\n            target = target_q[next_state][best_action]\n\n        # Q-Learning update\n        active_q[state][action] += ALPHA * (\n            reward + GAMMA * target - active_q[state][action])\n        state = next_state\n    return left_count\n\n# Figure 6.7, 1,000 runs may be enough, # of actions in state B will also affect the curves\ndef figure_6_7():\n    # each independent run has 300 episodes\n    episodes = 300\n    runs = 1000\n    left_counts_q = np.zeros((runs, episodes))\n    left_counts_double_q = np.zeros((runs, episodes))\n    for run in tqdm(range(runs)):\n        q = copy.deepcopy(INITIAL_Q)\n        q1 = copy.deepcopy(INITIAL_Q)\n        q2 = copy.deepcopy(INITIAL_Q)\n        for ep in range(0, episodes):\n            left_counts_q[run, ep] = q_learning(q)\n            left_counts_double_q[run, ep] = q_learning(q1, q2)\n    left_counts_q = left_counts_q.mean(axis=0)\n    left_counts_double_q = left_counts_double_q.mean(axis=0)\n\n    plt.plot(left_counts_q, label='Q-Learning')\n    plt.plot(left_counts_double_q, label='Double Q-Learning')\n    plt.plot(np.ones(episodes) * 0.05, label='Optimal')\n    plt.xlabel('episodes')\n    plt.ylabel('% left actions from A')\n    plt.legend()\n\n    plt.savefig('../images/figure_6_7.png')\n    plt.close()\n\nif __name__ == '__main__':\n    figure_6_7()"
  },
  {
    "path": "chapter06/random_walk.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\n\n# 0 is the left terminal state\n# 6 is the right terminal state\n# 1 ... 5 represents A ... E\nVALUES = np.zeros(7)\nVALUES[1:6] = 0.5\n# For convenience, we assume all rewards are 0\n# and the left terminal state has value 0, the right terminal state has value 1\n# This trick has been used in Gambler's Problem\nVALUES[6] = 1\n\n# set up true state values\nTRUE_VALUE = np.zeros(7)\nTRUE_VALUE[1:6] = np.arange(1, 6) / 6.0\nTRUE_VALUE[6] = 1\n\nACTION_LEFT = 0\nACTION_RIGHT = 1\n\n# @values: current states value, will be updated if @batch is False\n# @alpha: step size\n# @batch: whether to update @values\ndef temporal_difference(values, alpha=0.1, batch=False):\n    state = 3\n    trajectory = [state]\n    rewards = [0]\n    while True:\n        old_state = state\n        if np.random.binomial(1, 0.5) == ACTION_LEFT:\n            state -= 1\n        else:\n            state += 1\n        # Assume all rewards are 0\n        reward = 0\n        trajectory.append(state)\n        # TD update\n        if not batch:\n            values[old_state] += alpha * (reward + values[state] - values[old_state])\n        if state == 6 or state == 0:\n            break\n        rewards.append(reward)\n    return trajectory, rewards\n\n# @values: current states value, will be updated if @batch is False\n# @alpha: step size\n# @batch: whether to update @values\ndef monte_carlo(values, alpha=0.1, batch=False):\n    state = 3\n    trajectory = [state]\n\n    # if end up with left terminal state, all returns are 0\n    # if end up with right terminal state, all returns are 1\n    while True:\n        if np.random.binomial(1, 0.5) == ACTION_LEFT:\n            state -= 1\n        else:\n            state += 1\n        trajectory.append(state)\n        if state == 6:\n            returns = 1.0\n            break\n        elif state == 0:\n            returns = 0.0\n            break\n\n    if not batch:\n        for state_ in trajectory[:-1]:\n            # MC update\n            values[state_] += alpha * (returns - values[state_])\n    return trajectory, [returns] * (len(trajectory) - 1)\n\n# Example 6.2 left\ndef compute_state_value():\n    episodes = [0, 1, 10, 100]\n    current_values = np.copy(VALUES)\n    plt.figure(1)\n    for i in range(episodes[-1] + 1):\n        if i in episodes:\n            plt.plot((\"A\", \"B\", \"C\", \"D\", \"E\"), current_values[1:6], label=str(i) + ' episodes')\n        temporal_difference(current_values)\n    plt.plot((\"A\", \"B\", \"C\", \"D\", \"E\"), TRUE_VALUE[1:6], label='true values')\n    plt.xlabel('State')\n    plt.ylabel('Estimated Value')\n    plt.legend()\n\n# Example 6.2 right\ndef rms_error():\n    # Same alpha value can appear in both arrays\n    td_alphas = [0.15, 0.1, 0.05]\n    mc_alphas = [0.01, 0.02, 0.03, 0.04]\n    episodes = 100 + 1\n    runs = 100\n    for i, alpha in enumerate(td_alphas + mc_alphas):\n        total_errors = np.zeros(episodes)\n        if i < len(td_alphas):\n            method = 'TD'\n            linestyle = 'solid'\n        else:\n            method = 'MC'\n            linestyle = 'dashdot'\n        for r in tqdm(range(runs)):\n            errors = []\n            current_values = np.copy(VALUES)\n            for i in range(0, episodes):\n                errors.append(np.sqrt(np.sum(np.power(TRUE_VALUE - current_values, 2)) / 5.0))\n                if method == 'TD':\n                    temporal_difference(current_values, alpha=alpha)\n                else:\n                    monte_carlo(current_values, alpha=alpha)\n            total_errors += np.asarray(errors)\n        total_errors /= runs\n        plt.plot(total_errors, linestyle=linestyle, label=method + ', $\\\\alpha$ = %.02f' % (alpha))\n    plt.xlabel('Walks/Episodes')\n    plt.ylabel('Empirical RMS error, averaged over states')\n    plt.legend()\n\n# Figure 6.2\n# @method: 'TD' or 'MC'\ndef batch_updating(method, episodes, alpha=0.001):\n    # perform 100 independent runs\n    runs = 100\n    total_errors = np.zeros(episodes)\n    for r in tqdm(range(0, runs)):\n        current_values = np.copy(VALUES)\n        current_values[1:6] = -1\n        errors = []\n        # track shown trajectories and reward/return sequences\n        trajectories = []\n        rewards = []\n        for ep in range(episodes):\n            if method == 'TD':\n                trajectory_, rewards_ = temporal_difference(current_values, batch=True)\n            else:\n                trajectory_, rewards_ = monte_carlo(current_values, batch=True)\n            trajectories.append(trajectory_)\n            rewards.append(rewards_)\n            while True:\n                # keep feeding our algorithm with trajectories seen so far until state value function converges\n                updates = np.zeros(7)\n                for trajectory_, rewards_ in zip(trajectories, rewards):\n                    for i in range(0, len(trajectory_) - 1):\n                        if method == 'TD':\n                            updates[trajectory_[i]] += rewards_[i] + current_values[trajectory_[i + 1]] - current_values[trajectory_[i]]\n                        else:\n                            updates[trajectory_[i]] += rewards_[i] - current_values[trajectory_[i]]\n                updates *= alpha\n                if np.sum(np.abs(updates)) < 1e-3:\n                    break\n                # perform batch updating\n                current_values += updates\n            # calculate rms error\n            errors.append(np.sqrt(np.sum(np.power(current_values - TRUE_VALUE, 2)) / 5.0))\n        total_errors += np.asarray(errors)\n    total_errors /= runs\n    return total_errors\n\ndef example_6_2():\n    plt.figure(figsize=(10, 20))\n    plt.subplot(2, 1, 1)\n    compute_state_value()\n\n    plt.subplot(2, 1, 2)\n    rms_error()\n    plt.tight_layout()\n\n    plt.savefig('../images/example_6_2.png')\n    plt.close()\n\ndef figure_6_2():\n    episodes = 100 + 1\n    td_errors = batch_updating('TD', episodes)\n    mc_errors = batch_updating('MC', episodes)\n\n    plt.plot(td_errors, label='TD')\n    plt.plot(mc_errors, label='MC')\n    plt.title(\"Batch Training\")\n    plt.xlabel('Walks/Episodes')\n    plt.ylabel('RMS error, averaged over states')\n    plt.xlim(0, 100)\n    plt.ylim(0, 0.25)\n    plt.legend()\n\n    plt.savefig('../images/figure_6_2.png')\n    plt.close()\n\nif __name__ == '__main__':\n    example_6_2()\n    figure_6_2()\n"
  },
  {
    "path": "chapter06/windy_grid_world.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\n\n# world height\nWORLD_HEIGHT = 7\n\n# world width\nWORLD_WIDTH = 10\n\n# wind strength for each column\nWIND = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0]\n\n# possible actions\nACTION_UP = 0\nACTION_DOWN = 1\nACTION_LEFT = 2\nACTION_RIGHT = 3\n\n# probability for exploration\nEPSILON = 0.1\n\n# Sarsa step size\nALPHA = 0.5\n\n# reward for each step\nREWARD = -1.0\n\nSTART = [3, 0]\nGOAL = [3, 7]\nACTIONS = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT]\n\ndef step(state, action):\n    i, j = state\n    if action == ACTION_UP:\n        return [max(i - 1 - WIND[j], 0), j]\n    elif action == ACTION_DOWN:\n        return [max(min(i + 1 - WIND[j], WORLD_HEIGHT - 1), 0), j]\n    elif action == ACTION_LEFT:\n        return [max(i - WIND[j], 0), max(j - 1, 0)]\n    elif action == ACTION_RIGHT:\n        return [max(i - WIND[j], 0), min(j + 1, WORLD_WIDTH - 1)]\n    else:\n        assert False\n\n# play for an episode\ndef episode(q_value):\n    # track the total time steps in this episode\n    time = 0\n\n    # initialize state\n    state = START\n\n    # choose an action based on epsilon-greedy algorithm\n    if np.random.binomial(1, EPSILON) == 1:\n        action = np.random.choice(ACTIONS)\n    else:\n        values_ = q_value[state[0], state[1], :]\n        action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])\n\n    # keep going until get to the goal state\n    while state != GOAL:\n        next_state = step(state, action)\n        if np.random.binomial(1, EPSILON) == 1:\n            next_action = np.random.choice(ACTIONS)\n        else:\n            values_ = q_value[next_state[0], next_state[1], :]\n            next_action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])\n\n        # Sarsa update\n        q_value[state[0], state[1], action] += \\\n            ALPHA * (REWARD + q_value[next_state[0], next_state[1], next_action] -\n                     q_value[state[0], state[1], action])\n        state = next_state\n        action = next_action\n        time += 1\n    return time\n\ndef figure_6_3():\n    q_value = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4))\n    episode_limit = 500\n\n    steps = []\n    ep = 0\n    while ep < episode_limit:\n        steps.append(episode(q_value))\n        # time = episode(q_value)\n        # episodes.extend([ep] * time)\n        ep += 1\n\n    steps = np.add.accumulate(steps)\n\n    plt.plot(steps, np.arange(1, len(steps) + 1))\n    plt.xlabel('Time steps')\n    plt.ylabel('Episodes')\n\n    plt.savefig('../images/figure_6_3.png')\n    plt.close()\n\n    # display the optimal policy\n    optimal_policy = []\n    for i in range(0, WORLD_HEIGHT):\n        optimal_policy.append([])\n        for j in range(0, WORLD_WIDTH):\n            if [i, j] == GOAL:\n                optimal_policy[-1].append('G')\n                continue\n            bestAction = np.argmax(q_value[i, j, :])\n            if bestAction == ACTION_UP:\n                optimal_policy[-1].append('U')\n            elif bestAction == ACTION_DOWN:\n                optimal_policy[-1].append('D')\n            elif bestAction == ACTION_LEFT:\n                optimal_policy[-1].append('L')\n            elif bestAction == ACTION_RIGHT:\n                optimal_policy[-1].append('R')\n    print('Optimal policy is:')\n    for row in optimal_policy:\n        print(row)\n    print('Wind strength for each column:\\n{}'.format([str(w) for w in WIND]))\n\nif __name__ == '__main__':\n    figure_6_3()\n\n"
  },
  {
    "path": "chapter07/random_walk.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\n\n# all states\nN_STATES = 19\n\n# discount\nGAMMA = 1\n\n# all states but terminal states\nSTATES = np.arange(1, N_STATES + 1)\n\n# start from the middle state\nSTART_STATE = 10\n\n# two terminal states\n# an action leading to the left terminal state has reward -1\n# an action leading to the right terminal state has reward 1\nEND_STATES = [0, N_STATES + 1]\n\n# true state value from bellman equation\nTRUE_VALUE = np.arange(-20, 22, 2) / 20.0\nTRUE_VALUE[0] = TRUE_VALUE[-1] = 0\n\n# n-steps TD method\n# @value: values for each state, will be updated\n# @n: # of steps\n# @alpha: # step size\ndef temporal_difference(value, n, alpha):\n    # initial starting state\n    state = START_STATE\n\n    # arrays to store states and rewards for an episode\n    # space isn't a major consideration, so I didn't use the mod trick\n    states = [state]\n    rewards = [0]\n\n    # track the time\n    time = 0\n\n    # the length of this episode\n    T = float('inf')\n    while True:\n        # go to next time step\n        time += 1\n\n        if time < T:\n            # choose an action randomly\n            if np.random.binomial(1, 0.5) == 1:\n                next_state = state + 1\n            else:\n                next_state = state - 1\n\n            if next_state == 0:\n                reward = -1\n            elif next_state == 20:\n                reward = 1\n            else:\n                reward = 0\n\n            # store new state and new reward\n            states.append(next_state)\n            rewards.append(reward)\n\n            if next_state in END_STATES:\n                T = time\n\n        # get the time of the state to update\n        update_time = time - n\n        if update_time >= 0:\n            returns = 0.0\n            # calculate corresponding rewards\n            for t in range(update_time + 1, min(T, update_time + n) + 1):\n                returns += pow(GAMMA, t - update_time - 1) * rewards[t]\n            # add state value to the return\n            if update_time + n <= T:\n                returns += pow(GAMMA, n) * value[states[(update_time + n)]]\n            state_to_update = states[update_time]\n            # update the state value\n            if not state_to_update in END_STATES:\n                value[state_to_update] += alpha * (returns - value[state_to_update])\n        if update_time == T - 1:\n            break\n        state = next_state\n\n# Figure 7.2, it will take quite a while\ndef figure7_2():\n    # all possible steps\n    steps = np.power(2, np.arange(0, 10))\n\n    # all possible alphas\n    alphas = np.arange(0, 1.1, 0.1)\n\n    # each run has 10 episodes\n    episodes = 10\n\n    # perform 100 independent runs\n    runs = 100\n\n    # track the errors for each (step, alpha) combination\n    errors = np.zeros((len(steps), len(alphas)))\n    for run in tqdm(range(0, runs)):\n        for step_ind, step in enumerate(steps):\n            for alpha_ind, alpha in enumerate(alphas):\n                # print('run:', run, 'step:', step, 'alpha:', alpha)\n                value = np.zeros(N_STATES + 2)\n                for ep in range(0, episodes):\n                    temporal_difference(value, step, alpha)\n                    # calculate the RMS error\n                    errors[step_ind, alpha_ind] += np.sqrt(np.sum(np.power(value - TRUE_VALUE, 2)) / N_STATES)\n    # take average\n    errors /= episodes * runs\n\n    for i in range(0, len(steps)):\n        plt.plot(alphas, errors[i, :], label='n = %d' % (steps[i]))\n    plt.xlabel('alpha')\n    plt.ylabel('RMS error')\n    plt.ylim([0.25, 0.55])\n    plt.legend()\n\n    plt.savefig('../images/figure_7_2.png')\n    plt.close()\n\nif __name__ == '__main__':\n    figure7_2()\n\n\n"
  },
  {
    "path": "chapter08/expectation_vs_sample.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)                  #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\n\n# for figure 8.7, run a simulation of 2 * @b steps\ndef b_steps(b):\n    # set the value of the next b states\n    # it is not clear how to set this\n    distribution = np.random.randn(b)\n\n    # true value of the current state\n    true_v = np.mean(distribution)\n\n    samples = []\n    errors = []\n\n    # sample 2b steps\n    for t in range(2 * b):\n        v = np.random.choice(distribution)\n        samples.append(v)\n        errors.append(np.abs(np.mean(samples) - true_v))\n\n    return errors\n\ndef figure_8_7():\n    runs = 100\n    branch = [2, 10, 100, 1000]\n    for b in branch:\n        errors = np.zeros((runs, 2 * b))\n        for r in tqdm(np.arange(runs)):\n            errors[r] = b_steps(b)\n        errors = errors.mean(axis=0)\n        x_axis = (np.arange(len(errors)) + 1) / float(b)\n        plt.plot(x_axis, errors, label='b = %d' % (b))\n\n    plt.xlabel('number of computations')\n    plt.xticks([0, 1.0, 2.0], ['0', 'b', '2b'])\n    plt.ylabel('RMS error')\n    plt.legend()\n\n    plt.savefig('../images/figure_8_7.png')\n    plt.close()\n\nif __name__ == '__main__':\n    figure_8_7()\n"
  },
  {
    "path": "chapter08/maze.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\nimport heapq\nfrom copy import deepcopy\n\nclass PriorityQueue:\n    def __init__(self):\n        self.pq = []\n        self.entry_finder = {}\n        self.REMOVED = '<removed-task>'\n        self.counter = 0\n\n    def add_item(self, item, priority=0):\n        if item in self.entry_finder:\n            self.remove_item(item)\n        entry = [priority, self.counter, item]\n        self.counter += 1\n        self.entry_finder[item] = entry\n        heapq.heappush(self.pq, entry)\n\n    def remove_item(self, item):\n        entry = self.entry_finder.pop(item)\n        entry[-1] = self.REMOVED\n\n    def pop_item(self):\n        while self.pq:\n            priority, count, item = heapq.heappop(self.pq)\n            if item is not self.REMOVED:\n                del self.entry_finder[item]\n                return item, priority\n        raise KeyError('pop from an empty priority queue')\n\n    def empty(self):\n        return not self.entry_finder\n\n# A wrapper class for a maze, containing all the information about the maze.\n# Basically it's initialized to DynaMaze by default, however it can be easily adapted\n# to other maze\nclass Maze:\n    def __init__(self):\n        # maze width\n        self.WORLD_WIDTH = 9\n\n        # maze height\n        self.WORLD_HEIGHT = 6\n\n        # all possible actions\n        self.ACTION_UP = 0\n        self.ACTION_DOWN = 1\n        self.ACTION_LEFT = 2\n        self.ACTION_RIGHT = 3\n        self.actions = [self.ACTION_UP, self.ACTION_DOWN, self.ACTION_LEFT, self.ACTION_RIGHT]\n\n        # start state\n        self.START_STATE = [2, 0]\n\n        # goal state\n        self.GOAL_STATES = [[0, 8]]\n\n        # all obstacles\n        self.obstacles = [[1, 2], [2, 2], [3, 2], [0, 7], [1, 7], [2, 7], [4, 5]]\n        self.old_obstacles = None\n        self.new_obstacles = None\n\n        # time to change obstacles\n        self.obstacle_switch_time = None\n\n        # initial state action pair values\n        # self.stateActionValues = np.zeros((self.WORLD_HEIGHT, self.WORLD_WIDTH, len(self.actions)))\n\n        # the size of q value\n        self.q_size = (self.WORLD_HEIGHT, self.WORLD_WIDTH, len(self.actions))\n\n        # max steps\n        self.max_steps = float('inf')\n\n        # track the resolution for this maze\n        self.resolution = 1\n\n    # extend a state to a higher resolution maze\n    # @state: state in lower resolution maze\n    # @factor: extension factor, one state will become factor^2 states after extension\n    def extend_state(self, state, factor):\n        new_state = [state[0] * factor, state[1] * factor]\n        new_states = []\n        for i in range(0, factor):\n            for j in range(0, factor):\n                new_states.append([new_state[0] + i, new_state[1] + j])\n        return new_states\n\n    # extend a state into higher resolution\n    # one state in original maze will become @factor^2 states in @return new maze\n    def extend_maze(self, factor):\n        new_maze = Maze()\n        new_maze.WORLD_WIDTH = self.WORLD_WIDTH * factor\n        new_maze.WORLD_HEIGHT = self.WORLD_HEIGHT * factor\n        new_maze.START_STATE = [self.START_STATE[0] * factor, self.START_STATE[1] * factor]\n        new_maze.GOAL_STATES = self.extend_state(self.GOAL_STATES[0], factor)\n        new_maze.obstacles = []\n        for state in self.obstacles:\n            new_maze.obstacles.extend(self.extend_state(state, factor))\n        new_maze.q_size = (new_maze.WORLD_HEIGHT, new_maze.WORLD_WIDTH, len(new_maze.actions))\n        # new_maze.stateActionValues = np.zeros((new_maze.WORLD_HEIGHT, new_maze.WORLD_WIDTH, len(new_maze.actions)))\n        new_maze.resolution = factor\n        return new_maze\n\n    # take @action in @state\n    # @return: [new state, reward]\n    def step(self, state, action):\n        x, y = state\n        if action == self.ACTION_UP:\n            x = max(x - 1, 0)\n        elif action == self.ACTION_DOWN:\n            x = min(x + 1, self.WORLD_HEIGHT - 1)\n        elif action == self.ACTION_LEFT:\n            y = max(y - 1, 0)\n        elif action == self.ACTION_RIGHT:\n            y = min(y + 1, self.WORLD_WIDTH - 1)\n        if [x, y] in self.obstacles:\n            x, y = state\n        if [x, y] in self.GOAL_STATES:\n            reward = 1.0\n        else:\n            reward = 0.0\n        return [x, y], reward\n\n# a wrapper class for parameters of dyna algorithms\nclass DynaParams:\n    def __init__(self):\n        # discount\n        self.gamma = 0.95\n\n        # probability for exploration\n        self.epsilon = 0.1\n\n        # step size\n        self.alpha = 0.1\n\n        # weight for elapsed time\n        self.time_weight = 0\n\n        # n-step planning\n        self.planning_steps = 5\n\n        # average over several independent runs\n        self.runs = 10\n\n        # algorithm names\n        self.methods = ['Dyna-Q', 'Dyna-Q+']\n\n        # threshold for priority queue\n        self.theta = 0\n\n\n# choose an action based on epsilon-greedy algorithm\ndef choose_action(state, q_value, maze, dyna_params):\n    if np.random.binomial(1, dyna_params.epsilon) == 1:\n        return np.random.choice(maze.actions)\n    else:\n        values = q_value[state[0], state[1], :]\n        return np.random.choice([action for action, value in enumerate(values) if value == np.max(values)])\n\n# Trivial model for planning in Dyna-Q\nclass TrivialModel:\n    # @rand: an instance of np.random.RandomState for sampling\n    def __init__(self, rand=np.random):\n        self.model = dict()\n        self.rand = rand\n\n    # feed the model with previous experience\n    def feed(self, state, action, next_state, reward):\n        state = deepcopy(state)\n        next_state = deepcopy(next_state)\n        if tuple(state) not in self.model.keys():\n            self.model[tuple(state)] = dict()\n        self.model[tuple(state)][action] = [list(next_state), reward]\n\n    # randomly sample from previous experience\n    def sample(self):\n        state_index = self.rand.choice(range(len(self.model.keys())))\n        state = list(self.model)[state_index]\n        action_index = self.rand.choice(range(len(self.model[state].keys())))\n        action = list(self.model[state])[action_index]\n        next_state, reward = self.model[state][action]\n        state = deepcopy(state)\n        next_state = deepcopy(next_state)\n        return list(state), action, list(next_state), reward\n\n# Time-based model for planning in Dyna-Q+\nclass TimeModel:\n    # @maze: the maze instance. Indeed it's not very reasonable to give access to maze to the model.\n    # @timeWeight: also called kappa, the weight for elapsed time in sampling reward, it need to be small\n    # @rand: an instance of np.random.RandomState for sampling\n    def __init__(self, maze, time_weight=1e-4, rand=np.random):\n        self.rand = rand\n        self.model = dict()\n\n        # track the total time\n        self.time = 0\n\n        self.time_weight = time_weight\n        self.maze = maze\n\n    # feed the model with previous experience\n    def feed(self, state, action, next_state, reward):\n        state = deepcopy(state)\n        next_state = deepcopy(next_state)\n        self.time += 1\n        if tuple(state) not in self.model.keys():\n            self.model[tuple(state)] = dict()\n\n            # Actions that had never been tried before from a state were allowed to be considered in the planning step\n            for action_ in self.maze.actions:\n                if action_ != action:\n                    # Such actions would lead back to the same state with a reward of zero\n                    # Notice that the minimum time stamp is 1 instead of 0\n                    self.model[tuple(state)][action_] = [list(state), 0, 1]\n\n        self.model[tuple(state)][action] = [list(next_state), reward, self.time]\n\n    # randomly sample from previous experience\n    def sample(self):\n        state_index = self.rand.choice(range(len(self.model.keys())))\n        state = list(self.model)[state_index]\n        action_index = self.rand.choice(range(len(self.model[state].keys())))\n        action = list(self.model[state])[action_index]\n        next_state, reward, time = self.model[state][action]\n\n        # adjust reward with elapsed time since last vist\n        reward += self.time_weight * np.sqrt(self.time - time)\n\n        state = deepcopy(state)\n        next_state = deepcopy(next_state)\n\n        return list(state), action, list(next_state), reward\n\n# Model containing a priority queue for Prioritized Sweeping\nclass PriorityModel(TrivialModel):\n    def __init__(self, rand=np.random):\n        TrivialModel.__init__(self, rand)\n        # maintain a priority queue\n        self.priority_queue = PriorityQueue()\n        # track predecessors for every state\n        self.predecessors = dict()\n\n    # add a @state-@action pair into the priority queue with priority @priority\n    def insert(self, priority, state, action):\n        # note the priority queue is a minimum heap, so we use -priority\n        self.priority_queue.add_item((tuple(state), action), -priority)\n\n    # @return: whether the priority queue is empty\n    def empty(self):\n        return self.priority_queue.empty()\n\n    # get the first item in the priority queue\n    def sample(self):\n        (state, action), priority = self.priority_queue.pop_item()\n        next_state, reward = self.model[state][action]\n        state = deepcopy(state)\n        next_state = deepcopy(next_state)\n        return -priority, list(state), action, list(next_state), reward\n\n    # feed the model with previous experience\n    def feed(self, state, action, next_state, reward):\n        state = deepcopy(state)\n        next_state = deepcopy(next_state)\n        TrivialModel.feed(self, state, action, next_state, reward)\n        if tuple(next_state) not in self.predecessors.keys():\n            self.predecessors[tuple(next_state)] = set()\n        self.predecessors[tuple(next_state)].add((tuple(state), action))\n\n    # get all seen predecessors of a state @state\n    def predecessor(self, state):\n        if tuple(state) not in self.predecessors.keys():\n            return []\n        predecessors = []\n        for state_pre, action_pre in list(self.predecessors[tuple(state)]):\n            predecessors.append([list(state_pre), action_pre, self.model[state_pre][action_pre][1]])\n        return predecessors\n\n\n# play for an episode for Dyna-Q algorithm\n# @q_value: state action pair values, will be updated\n# @model: model instance for planning\n# @maze: a maze instance containing all information about the environment\n# @dyna_params: several params for the algorithm\ndef dyna_q(q_value, model, maze, dyna_params):\n    state = maze.START_STATE\n    steps = 0\n    while state not in maze.GOAL_STATES:\n        # track the steps\n        steps += 1\n\n        # get action\n        action = choose_action(state, q_value, maze, dyna_params)\n\n        # take action\n        next_state, reward = maze.step(state, action)\n\n        # Q-Learning update\n        q_value[state[0], state[1], action] += \\\n            dyna_params.alpha * (reward + dyna_params.gamma * np.max(q_value[next_state[0], next_state[1], :]) -\n                                 q_value[state[0], state[1], action])\n\n        # feed the model with experience\n        model.feed(state, action, next_state, reward)\n\n        # sample experience from the model\n        for t in range(0, dyna_params.planning_steps):\n            state_, action_, next_state_, reward_ = model.sample()\n            q_value[state_[0], state_[1], action_] += \\\n                dyna_params.alpha * (reward_ + dyna_params.gamma * np.max(q_value[next_state_[0], next_state_[1], :]) -\n                                     q_value[state_[0], state_[1], action_])\n\n        state = next_state\n\n        # check whether it has exceeded the step limit\n        if steps > maze.max_steps:\n            break\n\n    return steps\n\n# play for an episode for prioritized sweeping algorithm\n# @q_value: state action pair values, will be updated\n# @model: model instance for planning\n# @maze: a maze instance containing all information about the environment\n# @dyna_params: several params for the algorithm\n# @return: # of backups during this episode\ndef prioritized_sweeping(q_value, model, maze, dyna_params):\n    state = maze.START_STATE\n\n    # track the steps in this episode\n    steps = 0\n\n    # track the backups in planning phase\n    backups = 0\n\n    while state not in maze.GOAL_STATES:\n        steps += 1\n\n        # get action\n        action = choose_action(state, q_value, maze, dyna_params)\n\n        # take action\n        next_state, reward = maze.step(state, action)\n\n        # feed the model with experience\n        model.feed(state, action, next_state, reward)\n\n        # get the priority for current state action pair\n        priority = np.abs(reward + dyna_params.gamma * np.max(q_value[next_state[0], next_state[1], :]) -\n                          q_value[state[0], state[1], action])\n\n        if priority > dyna_params.theta:\n            model.insert(priority, state, action)\n\n        # start planning\n        planning_step = 0\n\n        # planning for several steps,\n        # although keep planning until the priority queue becomes empty will converge much faster\n        while planning_step < dyna_params.planning_steps and not model.empty():\n            # get a sample with highest priority from the model\n            priority, state_, action_, next_state_, reward_ = model.sample()\n\n            # update the state action value for the sample\n            delta = reward_ + dyna_params.gamma * np.max(q_value[next_state_[0], next_state_[1], :]) - \\\n                    q_value[state_[0], state_[1], action_]\n            q_value[state_[0], state_[1], action_] += dyna_params.alpha * delta\n\n            # deal with all the predecessors of the sample state\n            for state_pre, action_pre, reward_pre in model.predecessor(state_):\n                priority = np.abs(reward_pre + dyna_params.gamma * np.max(q_value[state_[0], state_[1], :]) -\n                                  q_value[state_pre[0], state_pre[1], action_pre])\n                if priority > dyna_params.theta:\n                    model.insert(priority, state_pre, action_pre)\n            planning_step += 1\n\n        state = next_state\n\n        # update the # of backups\n        backups += planning_step + 1\n\n    return backups\n\n# Figure 8.2, DynaMaze, use 10 runs instead of 30 runs\ndef figure_8_2():\n    # set up an instance for DynaMaze\n    dyna_maze = Maze()\n    dyna_params = DynaParams()\n\n    runs = 10\n    episodes = 50\n    planning_steps = [0, 5, 50]\n    steps = np.zeros((len(planning_steps), episodes))\n\n    for run in tqdm(range(runs)):\n        for i, planning_step in enumerate(planning_steps):\n            dyna_params.planning_steps = planning_step\n            q_value = np.zeros(dyna_maze.q_size)\n\n            # generate an instance of Dyna-Q model\n            model = TrivialModel()\n            for ep in range(episodes):\n                # print('run:', run, 'planning step:', planning_step, 'episode:', ep)\n                steps[i, ep] += dyna_q(q_value, model, dyna_maze, dyna_params)\n\n    # averaging over runs\n    steps /= runs\n\n    for i in range(len(planning_steps)):\n        plt.plot(steps[i, :], label='%d planning steps' % (planning_steps[i]))\n    plt.xlabel('episodes')\n    plt.ylabel('steps per episode')\n    plt.legend()\n\n    plt.savefig('../images/figure_8_2.png')\n    plt.close()\n\n# wrapper function for changing maze\n# @maze: a maze instance\n# @dynaParams: several parameters for dyna algorithms\ndef changing_maze(maze, dyna_params):\n\n    # set up max steps\n    max_steps = maze.max_steps\n\n    # track the cumulative rewards\n    rewards = np.zeros((dyna_params.runs, 2, max_steps))\n\n    for run in tqdm(range(dyna_params.runs)):\n        # set up models\n        models = [TrivialModel(), TimeModel(maze, time_weight=dyna_params.time_weight)]\n\n        # initialize state action values\n        q_values = [np.zeros(maze.q_size), np.zeros(maze.q_size)]\n\n        for i in range(len(dyna_params.methods)):\n            # print('run:', run, dyna_params.methods[i])\n\n            # set old obstacles for the maze\n            maze.obstacles = maze.old_obstacles\n\n            steps = 0\n            last_steps = steps\n            while steps < max_steps:\n                # play for an episode\n                steps += dyna_q(q_values[i], models[i], maze, dyna_params)\n\n                # update cumulative rewards\n                rewards[run, i, last_steps: steps] = rewards[run, i, last_steps]\n                rewards[run, i, min(steps, max_steps - 1)] = rewards[run, i, last_steps] + 1\n                last_steps = steps\n\n                if steps > maze.obstacle_switch_time:\n                    # change the obstacles\n                    maze.obstacles = maze.new_obstacles\n\n    # averaging over runs\n    rewards = rewards.mean(axis=0)\n\n    return rewards\n\n# Figure 8.4, BlockingMaze\ndef figure_8_4():\n    # set up a blocking maze instance\n    blocking_maze = Maze()\n    blocking_maze.START_STATE = [5, 3]\n    blocking_maze.GOAL_STATES = [[0, 8]]\n    blocking_maze.old_obstacles = [[3, i] for i in range(0, 8)]\n\n    # new obstalces will block the optimal path\n    blocking_maze.new_obstacles = [[3, i] for i in range(1, 9)]\n\n    # step limit\n    blocking_maze.max_steps = 3000\n\n    # obstacles will change after 1000 steps\n    # the exact step for changing will be different\n    # However given that 1000 steps is long enough for both algorithms to converge,\n    # the difference is guaranteed to be very small\n    blocking_maze.obstacle_switch_time = 1000\n\n    # set up parameters\n    dyna_params = DynaParams()\n    dyna_params.alpha = 1.0\n    dyna_params.planning_steps = 10\n    dyna_params.runs = 20\n\n    # kappa must be small, as the reward for getting the goal is only 1\n    dyna_params.time_weight = 1e-4\n\n    # play\n    rewards = changing_maze(blocking_maze, dyna_params)\n\n    for i in range(len(dyna_params.methods)):\n        plt.plot(rewards[i, :], label=dyna_params.methods[i])\n    plt.xlabel('time steps')\n    plt.ylabel('cumulative reward')\n    plt.legend()\n\n    plt.savefig('../images/figure_8_4.png')\n    plt.close()\n\n# Figure 8.5, ShortcutMaze\ndef figure_8_5():\n    # set up a shortcut maze instance\n    shortcut_maze = Maze()\n    shortcut_maze.START_STATE = [5, 3]\n    shortcut_maze.GOAL_STATES = [[0, 8]]\n    shortcut_maze.old_obstacles = [[3, i] for i in range(1, 9)]\n\n    # new obstacles will have a shorter path\n    shortcut_maze.new_obstacles = [[3, i] for i in range(1, 8)]\n\n    # step limit\n    shortcut_maze.max_steps = 6000\n\n    # obstacles will change after 3000 steps\n    # the exact step for changing will be different\n    # However given that 3000 steps is long enough for both algorithms to converge,\n    # the difference is guaranteed to be very small\n    shortcut_maze.obstacle_switch_time = 3000\n\n    # set up parameters\n    dyna_params = DynaParams()\n\n    # 50-step planning\n    dyna_params.planning_steps = 50\n    dyna_params.runs = 5\n    dyna_params.time_weight = 1e-3\n    dyna_params.alpha = 1.0\n\n    # play\n    rewards = changing_maze(shortcut_maze, dyna_params)\n\n    for i in range(len(dyna_params.methods)):\n        plt.plot( rewards[i, :], label=dyna_params.methods[i])\n    plt.xlabel('time steps')\n    plt.ylabel('cumulative reward')\n    plt.legend()\n\n    plt.savefig('../images/figure_8_5.png')\n    plt.close()\n\n# Check whether state-action values are already optimal\ndef check_path(q_values, maze):\n    # get the length of optimal path\n    # 14 is the length of optimal path of the original maze\n    # 1.2 means it's a relaxed optifmal path\n    max_steps = 14 * maze.resolution * 1.2\n    state = maze.START_STATE\n    steps = 0\n    while state not in maze.GOAL_STATES:\n        action = np.argmax(q_values[state[0], state[1], :])\n        state, _ = maze.step(state, action)\n        steps += 1\n        if steps > max_steps:\n            return False\n    return True\n\n# Example 8.4, mazes with different resolution\ndef example_8_4():\n    # get the original 6 * 9 maze\n    original_maze = Maze()\n\n    # set up the parameters for each algorithm\n    params_dyna = DynaParams()\n    params_dyna.planning_steps = 5\n    params_dyna.alpha = 0.5\n    params_dyna.gamma = 0.95\n\n    params_prioritized = DynaParams()\n    params_prioritized.theta = 0.0001\n    params_prioritized.planning_steps = 5\n    params_prioritized.alpha = 0.5\n    params_prioritized.gamma = 0.95\n\n    params = [params_prioritized, params_dyna]\n\n    # set up models for planning\n    models = [PriorityModel, TrivialModel]\n    method_names = ['Prioritized Sweeping', 'Dyna-Q']\n\n    # due to limitation of my machine, I can only perform experiments for 5 mazes\n    # assuming the 1st maze has w * h states, then k-th maze has w * h * k * k states\n    num_of_mazes = 5\n\n    # build all the mazes\n    mazes = [original_maze.extend_maze(i) for i in range(1, num_of_mazes + 1)]\n    methods = [prioritized_sweeping, dyna_q]\n\n    # My machine cannot afford too many runs...\n    runs = 5\n\n    # track the # of backups\n    backups = np.zeros((runs, 2, num_of_mazes))\n\n    for run in range(0, runs):\n        for i in range(0, len(method_names)):\n            for mazeIndex, maze in zip(range(0, len(mazes)), mazes):\n                print('run %d, %s, maze size %d' % (run, method_names[i], maze.WORLD_HEIGHT * maze.WORLD_WIDTH))\n\n                # initialize the state action values\n                q_value = np.zeros(maze.q_size)\n\n                # track steps / backups for each episode\n                steps = []\n\n                # generate the model\n                model = models[i]()\n\n                # play for an episode\n                while True:\n                    steps.append(methods[i](q_value, model, maze, params[i]))\n\n                    # print best actions w.r.t. current state-action values\n                    # printActions(currentStateActionValues, maze)\n\n                    # check whether the (relaxed) optimal path is found\n                    if check_path(q_value, maze):\n                        break\n\n                # update the total steps / backups for this maze\n                backups[run, i, mazeIndex] = np.sum(steps)\n\n    backups = backups.mean(axis=0)\n\n    # Dyna-Q performs several backups per step\n    backups[1, :] *= params_dyna.planning_steps + 1\n\n    for i in range(0, len(method_names)):\n        plt.plot(np.arange(1, num_of_mazes + 1), backups[i, :], label=method_names[i])\n    plt.xlabel('maze resolution factor')\n    plt.ylabel('backups until optimal solution')\n    plt.yscale('log')\n    plt.legend()\n\n    plt.savefig('../images/example_8_4.png')\n    plt.close()\n\nif __name__ == '__main__':\n    figure_8_2()\n    figure_8_4()\n    figure_8_5()\n    example_8_4()\n\n"
  },
  {
    "path": "chapter08/trajectory_sampling.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)                  #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\n\nmatplotlib.use('Agg')\n\n# 2 actions\nACTIONS = [0, 1]\n\n# each transition has a probability to terminate with 0\nTERMINATION_PROB = 0.1\n\n# maximum expected updates\nMAX_STEPS = 20000\n\n# epsilon greedy for behavior policy\nEPSILON = 0.1\n\n\n# break tie randomly\ndef argmax(value):\n    max_q = np.max(value)\n    return np.random.choice([a for a, q in enumerate(value) if q == max_q])\n\n\nclass Task:\n    # @n_states: number of non-terminal states\n    # @b: branch\n    # Each episode starts with state 0, and state n_states is a terminal state\n    def __init__(self, n_states, b):\n        self.n_states = n_states\n        self.b = b\n\n        # transition matrix, each state-action pair leads to b possible states\n        self.transition = np.random.randint(n_states, size=(n_states, len(ACTIONS), b))\n\n        # it is not clear how to set the reward, I use a unit normal distribution here\n        # reward is determined by (s, a, s')\n        self.reward = np.random.randn(n_states, len(ACTIONS), b)\n\n    def step(self, state, action):\n        if np.random.rand() < TERMINATION_PROB:\n            return self.n_states, 0\n        next_ = np.random.randint(self.b)\n        return self.transition[state, action, next_], self.reward[state, action, next_]\n\n\n# Evaluate the value of the start state for the greedy policy\n# derived from @q under the MDP @task\ndef evaluate_pi(q, task):\n    # use Monte Carlo method to estimate the state value\n    runs = 1000\n    returns = []\n    for r in range(runs):\n        rewards = 0\n        state = 0\n        while state < task.n_states:\n            action = argmax(q[state])\n            state, r = task.step(state, action)\n            rewards += r\n        returns.append(rewards)\n    return np.mean(returns)\n\n\n# perform expected update from a uniform state-action distribution of the MDP @task\n# evaluate the learned q value every @eval_interval steps\ndef uniform(task, eval_interval):\n    performance = []\n    q = np.zeros((task.n_states, 2))\n    for step in tqdm(range(MAX_STEPS)):\n        state = step // len(ACTIONS) % task.n_states\n        action = step % len(ACTIONS)\n\n        next_states = task.transition[state, action]\n        q[state, action] = (1 - TERMINATION_PROB) * np.mean(\n            task.reward[state, action] + np.max(q[next_states, :], axis=1))\n\n        if step % eval_interval == 0:\n            v_pi = evaluate_pi(q, task)\n            performance.append([step, v_pi])\n\n    return zip(*performance)\n\n\n# perform expected update from an on-policy distribution of the MDP @task\n# evaluate the learned q value every @eval_interval steps\ndef on_policy(task, eval_interval):\n    performance = []\n    q = np.zeros((task.n_states, 2))\n    state = 0\n    for step in tqdm(range(MAX_STEPS)):\n        if np.random.rand() < EPSILON:\n            action = np.random.choice(ACTIONS)\n        else:\n            action = argmax(q[state])\n\n        next_state, _ = task.step(state, action)\n\n        next_states = task.transition[state, action]\n        q[state, action] = (1 - TERMINATION_PROB) * np.mean(\n            task.reward[state, action] + np.max(q[next_states, :], axis=1))\n\n        if next_state == task.n_states:\n            next_state = 0\n        state = next_state\n\n        if step % eval_interval == 0:\n            v_pi = evaluate_pi(q, task)\n            performance.append([step, v_pi])\n\n    return zip(*performance)\n\n\ndef figure_8_8():\n    num_states = [1000, 10000]\n    branch = [1, 3, 10]\n    methods = [on_policy, uniform]\n\n    # average across 30 tasks\n    n_tasks = 30\n\n    # number of evaluation points\n    x_ticks = 100\n\n    plt.figure(figsize=(10, 20))\n    for i, n in enumerate(num_states):\n        plt.subplot(2, 1, i+1)\n        for b in branch:\n            tasks = [Task(n, b) for _ in range(n_tasks)]\n            for method in methods:\n                steps = None\n                value = []\n                for task in tasks:\n                    steps, v = method(task, MAX_STEPS / x_ticks)\n                    value.append(v)\n                value = np.mean(np.asarray(value), axis=0)\n                plt.plot(steps, value, label=f'b = {b}, {method.__name__}')\n        plt.title(f'{n} states')\n\n        plt.ylabel('value of start state')\n        plt.legend()\n\n    plt.subplot(2, 1, 2)\n    plt.xlabel('computation time, in expected updates')\n\n    plt.savefig('../images/figure_8_8.png')\n    plt.close()\n\n\nif __name__ == '__main__':\n    figure_8_8()\n"
  },
  {
    "path": "chapter09/random_walk.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\n\n# # of states except for terminal states\nN_STATES = 1000\n\n# all states\nSTATES = np.arange(1, N_STATES + 1)\n\n# start from a central state\nSTART_STATE = 500\n\n# terminal states\nEND_STATES = [0, N_STATES + 1]\n\n# possible actions\nACTION_LEFT = -1\nACTION_RIGHT = 1\nACTIONS = [ACTION_LEFT, ACTION_RIGHT]\n\n# maximum stride for an action\nSTEP_RANGE = 100\n\ndef compute_true_value():\n    # true state value, just a promising guess\n    true_value = np.arange(-1001, 1003, 2) / 1001.0\n\n    # Dynamic programming to find the true state values, based on the promising guess above\n    # Assume all rewards are 0, given that we have already given value -1 and 1 to terminal states\n    while True:\n        old_value = np.copy(true_value)\n        for state in STATES:\n            true_value[state] = 0\n            for action in ACTIONS:\n                for step in range(1, STEP_RANGE + 1):\n                    step *= action\n                    next_state = state + step\n                    next_state = max(min(next_state, N_STATES + 1), 0)\n                    # asynchronous update for faster convergence\n                    true_value[state] += 1.0 / (2 * STEP_RANGE) * true_value[next_state]\n        error = np.sum(np.abs(old_value - true_value))\n        if error < 1e-2:\n            break\n    # correct the state value for terminal states to 0\n    true_value[0] = true_value[-1] = 0\n\n    return true_value\n\n# take an @action at @state, return new state and reward for this transition\ndef step(state, action):\n    step = np.random.randint(1, STEP_RANGE + 1)\n    step *= action\n    state += step\n    state = max(min(state, N_STATES + 1), 0)\n    if state == 0:\n        reward = -1\n    elif state == N_STATES + 1:\n        reward = 1\n    else:\n        reward = 0\n    return state, reward\n\n# get an action, following random policy\ndef get_action():\n    if np.random.binomial(1, 0.5) == 1:\n        return 1\n    return -1\n\n# a wrapper class for aggregation value function\nclass ValueFunction:\n    # @num_of_groups: # of aggregations\n    def __init__(self, num_of_groups):\n        self.num_of_groups = num_of_groups\n        self.group_size = N_STATES // num_of_groups\n\n        # thetas\n        self.params = np.zeros(num_of_groups)\n\n    # get the value of @state\n    def value(self, state):\n        if state in END_STATES:\n            return 0\n        group_index = (state - 1) // self.group_size\n        return self.params[group_index]\n\n    # update parameters\n    # @delta: step size * (target - old estimation)\n    # @state: state of current sample\n    def update(self, delta, state):\n        group_index = (state - 1) // self.group_size\n        self.params[group_index] += delta\n\n# a wrapper class for tile coding value function\nclass TilingsValueFunction:\n    # @num_of_tilings: # of tilings\n    # @tileWidth: each tiling has several tiles, this parameter specifies the width of each tile\n    # @tilingOffset: specifies how tilings are put together\n    def __init__(self, numOfTilings, tileWidth, tilingOffset):\n        self.numOfTilings = numOfTilings\n        self.tileWidth = tileWidth\n        self.tilingOffset = tilingOffset\n\n        # To make sure that each sate is covered by same number of tiles,\n        # we need one more tile for each tiling\n        self.tilingSize = N_STATES // tileWidth + 1\n\n        # weight for each tile\n        self.params = np.zeros((self.numOfTilings, self.tilingSize))\n\n        # For performance, only track the starting position for each tiling\n        # As we have one more tile for each tiling, the starting position will be negative\n        self.tilings = np.arange(-tileWidth + 1, 0, tilingOffset)\n\n    # get the value of @state\n    def value(self, state):\n        stateValue = 0.0\n        # go through all the tilings\n        for tilingIndex in range(0, len(self.tilings)):\n            # find the active tile in current tiling\n            tileIndex = (state - self.tilings[tilingIndex]) // self.tileWidth\n            stateValue += self.params[tilingIndex, tileIndex]\n        return stateValue\n\n    # update parameters\n    # @delta: step size * (target - old estimation)\n    # @state: state of current sample\n    def update(self, delta, state):\n\n        # each state is covered by same number of tilings\n        # so the delta should be divided equally into each tiling (tile)\n        delta /= self.numOfTilings\n\n        # go through all the tilings\n        for tilingIndex in range(0, len(self.tilings)):\n            # find the active tile in current tiling\n            tileIndex = (state - self.tilings[tilingIndex]) // self.tileWidth\n            self.params[tilingIndex, tileIndex] += delta\n\n# a wrapper class for polynomial / Fourier -based value function\nPOLYNOMIAL_BASES = 0\nFOURIER_BASES = 1\nclass BasesValueFunction:\n    # @order: # of bases, each function also has one more constant parameter (called bias in machine learning)\n    # @type: polynomial bases or Fourier bases\n    def __init__(self, order, type):\n        self.order = order\n        self.weights = np.zeros(order + 1)\n\n        # set up bases function\n        self.bases = []\n        if type == POLYNOMIAL_BASES:\n            for i in range(0, order + 1):\n                self.bases.append(lambda s, i=i: pow(s, i))\n        elif type == FOURIER_BASES:\n            for i in range(0, order + 1):\n                self.bases.append(lambda s, i=i: np.cos(i * np.pi * s))\n\n    # get the value of @state\n    def value(self, state):\n        # map the state space into [0, 1]\n        state /= float(N_STATES)\n        # get the feature vector\n        feature = np.asarray([func(state) for func in self.bases])\n        return np.dot(self.weights, feature)\n\n    def update(self, delta, state):\n        # map the state space into [0, 1]\n        state /= float(N_STATES)\n        # get derivative value\n        derivative_value = np.asarray([func(state) for func in self.bases])\n        self.weights += delta * derivative_value\n\n# gradient Monte Carlo algorithm\n# @value_function: an instance of class ValueFunction\n# @alpha: step size\n# @distribution: array to store the distribution statistics\ndef gradient_monte_carlo(value_function, alpha, distribution=None):\n    state = START_STATE\n    trajectory = [state]\n\n    # We assume gamma = 1, so return is just the same as the latest reward\n    reward = 0.0\n    while state not in END_STATES:\n        action = get_action()\n        next_state, reward = step(state, action)\n        trajectory.append(next_state)\n        state = next_state\n\n    # Gradient update for each state in this trajectory\n    for state in trajectory[:-1]:\n        delta = alpha * (reward - value_function.value(state))\n        value_function.update(delta, state)\n        if distribution is not None:\n            distribution[state] += 1\n\n# semi-gradient n-step TD algorithm\n# @valueFunction: an instance of class ValueFunction\n# @n: # of steps\n# @alpha: step size\ndef semi_gradient_temporal_difference(value_function, n, alpha):\n    # initial starting state\n    state = START_STATE\n\n    # arrays to store states and rewards for an episode\n    # space isn't a major consideration, so I didn't use the mod trick\n    states = [state]\n    rewards = [0]\n\n    # track the time\n    time = 0\n\n    # the length of this episode\n    T = float('inf')\n    while True:\n        # go to next time step\n        time += 1\n\n        if time < T:\n            # choose an action randomly\n            action = get_action()\n            next_state, reward = step(state, action)\n\n            # store new state and new reward\n            states.append(next_state)\n            rewards.append(reward)\n\n            if next_state in END_STATES:\n                T = time\n\n        # get the time of the state to update\n        update_time = time - n\n        if update_time >= 0:\n            returns = 0.0\n            # calculate corresponding rewards\n            for t in range(update_time + 1, min(T, update_time + n) + 1):\n                returns += rewards[t]\n            # add state value to the return\n            if update_time + n <= T:\n                returns += value_function.value(states[update_time + n])\n            state_to_update = states[update_time]\n            # update the value function\n            if not state_to_update in END_STATES:\n                delta = alpha * (returns - value_function.value(state_to_update))\n                value_function.update(delta, state_to_update)\n        if update_time == T - 1:\n            break\n        state = next_state\n\n# Figure 9.1, gradient Monte Carlo algorithm\ndef figure_9_1(true_value):\n    episodes = int(1e5)\n    alpha = 2e-5\n\n    # we have 10 aggregations in this example, each has 100 states\n    value_function = ValueFunction(10)\n    distribution = np.zeros(N_STATES + 2)\n    for ep in tqdm(range(episodes)):\n        gradient_monte_carlo(value_function, alpha, distribution)\n\n    distribution /= np.sum(distribution)\n    state_values = [value_function.value(i) for i in STATES]\n\n    plt.figure(figsize=(10, 20))\n\n    plt.subplot(2, 1, 1)\n    plt.plot(STATES, state_values, label='Approximate MC value')\n    plt.plot(STATES, true_value[1: -1], label='True value')\n    plt.xlabel('State')\n    plt.ylabel('Value')\n    plt.legend()\n\n    plt.subplot(2, 1, 2)\n    plt.plot(STATES, distribution[1: -1], label='State distribution')\n    plt.xlabel('State')\n    plt.ylabel('Distribution')\n    plt.legend()\n\n    plt.savefig('../images/figure_9_1.png')\n    plt.close()\n\n# semi-gradient TD on 1000-state random walk\ndef figure_9_2_left(true_value):\n    episodes = int(1e5)\n    alpha = 2e-4\n    value_function = ValueFunction(10)\n    for ep in tqdm(range(episodes)):\n        semi_gradient_temporal_difference(value_function, 1, alpha)\n\n    stateValues = [value_function.value(i) for i in STATES]\n    plt.plot(STATES, stateValues, label='Approximate TD value')\n    plt.plot(STATES, true_value[1: -1], label='True value')\n    plt.xlabel('State')\n    plt.ylabel('Value')\n    plt.legend()\n\n# different alphas and steps for semi-gradient TD\ndef figure_9_2_right(true_value):\n    # all possible steps\n    steps = np.power(2, np.arange(0, 10))\n\n    # all possible alphas\n    alphas = np.arange(0, 1.1, 0.1)\n\n    # each run has 10 episodes\n    episodes = 10\n\n    # perform 100 independent runs\n    runs = 100\n\n    # track the errors for each (step, alpha) combination\n    errors = np.zeros((len(steps), len(alphas)))\n    for run in tqdm(range(runs)):\n        for step_ind, step in zip(range(len(steps)), steps):\n            for alpha_ind, alpha in zip(range(len(alphas)), alphas):\n                # we have 20 aggregations in this example\n                value_function = ValueFunction(20)\n                for ep in range(0, episodes):\n                    semi_gradient_temporal_difference(value_function, step, alpha)\n                    # calculate the RMS error\n                    state_value = np.asarray([value_function.value(i) for i in STATES])\n                    errors[step_ind, alpha_ind] += np.sqrt(np.sum(np.power(state_value - true_value[1: -1], 2)) / N_STATES)\n    # take average\n    errors /= episodes * runs\n    # truncate the error\n    for i in range(len(steps)):\n        plt.plot(alphas, errors[i, :], label='n = ' + str(steps[i]))\n    plt.xlabel('alpha')\n    plt.ylabel('RMS error')\n    plt.ylim([0.25, 0.55])\n    plt.legend()\n\ndef figure_9_2(true_value):\n    plt.figure(figsize=(10, 20))\n    plt.subplot(2, 1, 1)\n    figure_9_2_left(true_value)\n    plt.subplot(2, 1, 2)\n    figure_9_2_right(true_value)\n\n    plt.savefig('../images/figure_9_2.png')\n    plt.close()\n\n# Figure 9.5, Fourier basis and polynomials\ndef figure_9_5(true_value):\n    # my machine can only afford 1 run\n    runs = 1\n\n    episodes = 5000\n\n    # # of bases\n    orders = [5, 10, 20]\n\n    alphas = [1e-4, 5e-5]\n    labels = [['polynomial basis'] * 3, ['fourier basis'] * 3]\n\n    # track errors for each episode\n    errors = np.zeros((len(alphas), len(orders), episodes))\n    for run in range(runs):\n        for i in range(len(orders)):\n            value_functions = [BasesValueFunction(orders[i], POLYNOMIAL_BASES), BasesValueFunction(orders[i], FOURIER_BASES)]\n            for j in range(len(value_functions)):\n                for episode in tqdm(range(episodes)):\n\n                    # gradient Monte Carlo algorithm\n                    gradient_monte_carlo(value_functions[j], alphas[j])\n\n                    # get state values under current value function\n                    state_values = [value_functions[j].value(state) for state in STATES]\n\n                    # get the root-mean-squared error\n                    errors[j, i, episode] += np.sqrt(np.mean(np.power(true_value[1: -1] - state_values, 2)))\n\n    # average over independent runs\n    errors /= runs\n\n    for i in range(len(alphas)):\n        for j in range(len(orders)):\n            plt.plot(errors[i, j, :], label='%s order = %d' % (labels[i][j], orders[j]))\n    plt.xlabel('Episodes')\n    # The book plots RMSVE, which is RMSE weighted by a state distribution\n    plt.ylabel('RMSE')\n    plt.legend()\n\n    plt.savefig('../images/figure_9_5.png')\n    plt.close()\n\n# Figure 9.10, it will take quite a while\ndef figure_9_10(true_value):\n\n    # My machine can only afford one run, thus the curve isn't so smooth\n    runs = 1\n\n    # number of episodes\n    episodes = 5000\n\n    num_of_tilings = 50\n\n    # each tile will cover 200 states\n    tile_width = 200\n\n    # how to put so many tilings\n    tiling_offset = 4\n\n    labels = ['tile coding (50 tilings)', 'state aggregation (one tiling)']\n\n    # track errors for each episode\n    errors = np.zeros((len(labels), episodes))\n    for run in range(runs):\n        # initialize value functions for multiple tilings and single tiling\n        value_functions = [TilingsValueFunction(num_of_tilings, tile_width, tiling_offset),\n                         ValueFunction(N_STATES // tile_width)]\n        for i in range(len(value_functions)):\n            for episode in tqdm(range(episodes)):\n                # I use a changing alpha according to the episode instead of a small fixed alpha\n                # With a small fixed alpha, I don't think 5000 episodes is enough for so many\n                # parameters in multiple tilings.\n                # The asymptotic performance for single tiling stays unchanged under a changing alpha,\n                # however the asymptotic performance for multiple tilings improves significantly\n                alpha = 1.0 / (episode + 1)\n\n                # gradient Monte Carlo algorithm\n                gradient_monte_carlo(value_functions[i], alpha)\n\n                # get state values under current value function\n                state_values = [value_functions[i].value(state) for state in STATES]\n\n                # get the root-mean-squared error\n                errors[i][episode] += np.sqrt(np.mean(np.power(true_value[1: -1] - state_values, 2)))\n\n    # average over independent runs\n    errors /= runs\n\n    for i in range(0, len(labels)):\n        plt.plot(errors[i], label=labels[i])\n    plt.xlabel('Episodes')\n    # The book plots RMSVE, which is RMSE weighted by a state distribution\n    plt.ylabel('RMSE')\n    plt.legend()\n\n    plt.savefig('../images/figure_9_10.png')\n    plt.close()\n\nif __name__ == '__main__':\n    true_value = compute_true_value()\n\n    figure_9_1(true_value)\n    figure_9_2(true_value)\n    figure_9_5(true_value)\n    figure_9_10(true_value)\n"
  },
  {
    "path": "chapter09/square_wave.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\n\n# wrapper class for an interval\n# readability is more important than efficiency, so I won't use many tricks\nclass Interval:\n    # [@left, @right)\n    def __init__(self, left, right):\n        self.left = left\n        self.right = right\n\n    # whether a point is in this interval\n    def contain(self, x):\n        return self.left <= x < self.right\n\n    # length of this interval\n    def size(self):\n        return self.right - self.left\n\n# domain of the square wave, [0, 2)\nDOMAIN = Interval(0.0, 2.0)\n\n# square wave function\ndef square_wave(x):\n    if 0.5 < x < 1.5:\n        return 1\n    return 0\n\n# get @n samples randomly from the square wave\ndef sample(n):\n    samples = []\n    for i in range(0, n):\n        x = np.random.uniform(DOMAIN.left, DOMAIN.right)\n        y = square_wave(x)\n        samples.append([x, y])\n    return samples\n\n# wrapper class for value function\nclass ValueFunction:\n    # @domain: domain of this function, an instance of Interval\n    # @alpha: basic step size for one update\n    def __init__(self, feature_width, domain=DOMAIN, alpha=0.2, num_of_features=50):\n        self.feature_width = feature_width\n        self.num_of_featrues = num_of_features\n        self.features = []\n        self.alpha = alpha\n        self.domain = domain\n\n        # there are many ways to place those feature windows,\n        # following is just one possible way\n        step = (domain.size() - feature_width) / (num_of_features - 1)\n        left = domain.left\n        for i in range(0, num_of_features - 1):\n            self.features.append(Interval(left, left + feature_width))\n            left += step\n        self.features.append(Interval(left, domain.right))\n\n        # initialize weight for each feature\n        self.weights = np.zeros(num_of_features)\n\n    # for point @x, return the indices of corresponding feature windows\n    def get_active_features(self, x):\n        active_features = []\n        for i in range(0, len(self.features)):\n            if self.features[i].contain(x):\n                active_features.append(i)\n        return active_features\n\n    # estimate the value for point @x\n    def value(self, x):\n        active_features = self.get_active_features(x)\n        return np.sum(self.weights[active_features])\n\n    # update weights given sample of point @x\n    # @delta: y - x\n    def update(self, delta, x):\n        active_features = self.get_active_features(x)\n        delta *= self.alpha / len(active_features)\n        for index in active_features:\n            self.weights[index] += delta\n\n# train @value_function with a set of samples @samples\ndef approximate(samples, value_function):\n    for x, y in samples:\n        delta = y - value_function.value(x)\n        value_function.update(delta, x)\n\n# Figure 9.8\ndef figure_9_8():\n    num_of_samples = [10, 40, 160, 640, 2560, 10240]\n    feature_widths = [0.2, 0.4, 1.0]\n    plt.figure(figsize=(30, 20))\n    axis_x = np.arange(DOMAIN.left, DOMAIN.right, 0.02)\n    for index, num_of_sample in enumerate(num_of_samples):\n        print(num_of_sample, 'samples')\n        samples = sample(num_of_sample)\n        value_functions = [ValueFunction(feature_width) for feature_width in feature_widths]\n        plt.subplot(2, 3, index + 1)\n        plt.title('%d samples' % (num_of_sample))\n        for value_function in value_functions:\n            approximate(samples, value_function)\n            values = [value_function.value(x) for x in axis_x]\n            plt.plot(axis_x, values, label='feature width %.01f' % (value_function.feature_width))\n        plt.legend()\n\n    plt.savefig('../images/figure_9_8.png')\n    plt.close()\n\nif __name__ == '__main__':\n    figure_9_8()"
  },
  {
    "path": "chapter10/access_control.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\nfrom mpl_toolkits.mplot3d.axes3d import Axes3D\nfrom math import floor\nimport seaborn as sns\n\n#######################################################################\n# Following are some utilities for tile coding from Rich.\n# To make each file self-contained, I copied them from\n# http://incompleteideas.net/tiles/tiles3.py-remove\n# with some naming convention changes\n#\n# Tile coding starts\nclass IHT:\n    \"Structure to handle collisions\"\n    def __init__(self, size_val):\n        self.size = size_val\n        self.overfull_count = 0\n        self.dictionary = {}\n\n    def count(self):\n        return len(self.dictionary)\n\n    def full(self):\n        return len(self.dictionary) >= self.size\n\n    def get_index(self, obj, read_only=False):\n        d = self.dictionary\n        if obj in d:\n            return d[obj]\n        elif read_only:\n            return None\n        size = self.size\n        count = self.count()\n        if count >= size:\n            if self.overfull_count == 0: print('IHT full, starting to allow collisions')\n            self.overfull_count += 1\n            return hash(obj) % self.size\n        else:\n            d[obj] = count\n            return count\n\ndef hash_coords(coordinates, m, read_only=False):\n    if isinstance(m, IHT): return m.get_index(tuple(coordinates), read_only)\n    if isinstance(m, int): return hash(tuple(coordinates)) % m\n    if m is None: return coordinates\n\ndef tiles(iht_or_size, num_tilings, floats, ints=None, read_only=False):\n    \"\"\"returns num-tilings tile indices corresponding to the floats and ints\"\"\"\n    if ints is None:\n        ints = []\n    qfloats = [floor(f * num_tilings) for f in floats]\n    tiles = []\n    for tiling in range(num_tilings):\n        tilingX2 = tiling * 2\n        coords = [tiling]\n        b = tiling\n        for q in qfloats:\n            coords.append((q + b) // num_tilings)\n            b += tilingX2\n        coords.extend(ints)\n        tiles.append(hash_coords(coords, iht_or_size, read_only))\n    return tiles\n# Tile coding ends\n#######################################################################\n\n# possible priorities\nPRIORITIES = np.arange(0, 4)\n# reward for each priority\nREWARDS = np.power(2, np.arange(0, 4))\n\n# possible actions\nREJECT = 0\nACCEPT = 1\nACTIONS = [REJECT, ACCEPT]\n\n# total number of servers\nNUM_OF_SERVERS = 10\n\n# at each time step, a busy server will be free w.p. 0.06\nPROBABILITY_FREE = 0.06\n\n# step size for learning state-action value\nALPHA = 0.01\n\n# step size for learning average reward\nBETA = 0.01\n\n# probability for exploration\nEPSILON = 0.1\n\n# a wrapper class for differential semi-gradient Sarsa state-action function\nclass ValueFunction:\n    # In this example I use the tiling software instead of implementing standard tiling by myself\n    # One important thing is that tiling is only a map from (state, action) to a series of indices\n    # It doesn't matter whether the indices have meaning, only if this map satisfy some property\n    # View the following webpage for more information\n    # http://incompleteideas.net/sutton/tiles/tiles3.html\n    # @alpha: step size for learning state-action value\n    # @beta: step size for learning average reward\n    def __init__(self, num_of_tilings, alpha=ALPHA, beta=BETA):\n        self.num_of_tilings = num_of_tilings\n        self.max_size = 2048\n        self.hash_table = IHT(self.max_size)\n        self.weights = np.zeros(self.max_size)\n\n        # state features needs scaling to satisfy the tile software\n        self.server_scale = self.num_of_tilings / float(NUM_OF_SERVERS)\n        self.priority_scale = self.num_of_tilings / float(len(PRIORITIES) - 1)\n\n        self.average_reward = 0.0\n\n        # divide step size equally to each tiling\n        self.alpha = alpha / self.num_of_tilings\n\n        self.beta = beta\n\n    # get indices of active tiles for given state and action\n    def get_active_tiles(self, free_servers, priority, action):\n        active_tiles = tiles(self.hash_table, self.num_of_tilings,\n                            [self.server_scale * free_servers, self.priority_scale * priority],\n                            [action])\n        return active_tiles\n\n    # estimate the value of given state and action without subtracting average\n    def value(self, free_servers, priority, action):\n        active_tiles = self.get_active_tiles(free_servers, priority, action)\n        return np.sum(self.weights[active_tiles])\n\n    # estimate the value of given state without subtracting average\n    def state_value(self, free_servers, priority):\n        values = [self.value(free_servers, priority, action) for action in ACTIONS]\n        # if no free server, can't accept\n        if free_servers == 0:\n            return values[REJECT]\n        return np.max(values)\n\n    # learn with given sequence\n    def learn(self, free_servers, priority, action, new_free_servers, new_priority, new_action, reward):\n        active_tiles = self.get_active_tiles(free_servers, priority, action)\n        estimation = np.sum(self.weights[active_tiles])\n        delta = reward - self.average_reward + self.value(new_free_servers, new_priority, new_action) - estimation\n        # update average reward\n        self.average_reward += self.beta * delta\n        delta *= self.alpha\n        for active_tile in active_tiles:\n            self.weights[active_tile] += delta\n\n# get action based on epsilon greedy policy and @valueFunction\ndef get_action(free_servers, priority, value_function):\n    # if no free server, can't accept\n    if free_servers == 0:\n        return REJECT\n    if np.random.binomial(1, EPSILON) == 1:\n        return np.random.choice(ACTIONS)\n    values = [value_function.value(free_servers, priority, action) for action in ACTIONS]\n    return np.random.choice([action_ for action_, value_ in enumerate(values) if value_ == np.max(values)])\n\n# take an action\ndef take_action(free_servers, priority, action):\n    if free_servers > 0 and action == ACCEPT:\n        free_servers -= 1\n    reward = REWARDS[priority] * action\n    # some busy servers may become free\n    busy_servers = NUM_OF_SERVERS - free_servers\n    free_servers += np.random.binomial(busy_servers, PROBABILITY_FREE)\n    return free_servers, np.random.choice(PRIORITIES), reward\n\n# differential semi-gradient Sarsa\n# @valueFunction: state value function to learn\n# @maxSteps: step limit in the continuing task\ndef differential_semi_gradient_sarsa(value_function, max_steps):\n    current_free_servers = NUM_OF_SERVERS\n    current_priority = np.random.choice(PRIORITIES)\n    current_action = get_action(current_free_servers, current_priority, value_function)\n    # track the hit for each number of free servers\n    freq = np.zeros(NUM_OF_SERVERS + 1)\n\n    for _ in tqdm(range(max_steps)):\n        freq[current_free_servers] += 1\n        new_free_servers, new_priority, reward = take_action(current_free_servers, current_priority, current_action)\n        new_action = get_action(new_free_servers, new_priority, value_function)\n        value_function.learn(current_free_servers, current_priority, current_action,\n                             new_free_servers, new_priority, new_action, reward)\n        current_free_servers = new_free_servers\n        current_priority = new_priority\n        current_action = new_action\n    print('Frequency of number of free servers:')\n    print(freq / max_steps)\n\n# Figure 10.5, Differential semi-gradient Sarsa on the access-control queuing task\ndef figure_10_5():\n    max_steps = int(1e6)\n    # use tile coding with 8 tilings\n    num_of_tilings = 8\n    value_function = ValueFunction(num_of_tilings)\n    differential_semi_gradient_sarsa(value_function, max_steps)\n    values = np.zeros((len(PRIORITIES), NUM_OF_SERVERS + 1))\n    for priority in PRIORITIES:\n        for free_servers in range(NUM_OF_SERVERS + 1):\n            values[priority, free_servers] = value_function.state_value(free_servers, priority)\n\n    fig = plt.figure(figsize=(10, 20))\n    plt.subplot(2, 1, 1)\n    for priority in PRIORITIES:\n        plt.plot(range(NUM_OF_SERVERS + 1), values[priority, :], label='priority %d' % (REWARDS[priority]))\n    plt.xlabel('Number of free servers')\n    plt.ylabel('Differential value of best action')\n    plt.legend()\n\n    ax = fig.add_subplot(2, 1, 2)\n    policy = np.zeros((len(PRIORITIES), NUM_OF_SERVERS + 1))\n    for priority in PRIORITIES:\n        for free_servers in range(NUM_OF_SERVERS + 1):\n            values = [value_function.value(free_servers, priority, action) for action in ACTIONS]\n            if free_servers == 0:\n                policy[priority, free_servers] = REJECT\n            else:\n                policy[priority, free_servers] = np.argmax(values)\n\n    fig = sns.heatmap(policy, cmap=\"YlGnBu\", ax=ax, xticklabels=range(NUM_OF_SERVERS + 1), yticklabels=PRIORITIES)\n    fig.set_title('Policy (0 Reject, 1 Accept)')\n    fig.set_xlabel('Number of free servers')\n    fig.set_ylabel('Priority')\n\n    plt.savefig('../images/figure_10_5.png')\n    plt.close()\n\nif __name__ == '__main__':\n    figure_10_5()\n"
  },
  {
    "path": "chapter10/mountain_car.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\nfrom mpl_toolkits.mplot3d.axes3d import Axes3D\nfrom math import floor\n\n#######################################################################\n# Following are some utilities for tile coding from Rich.\n# To make each file self-contained, I copied them from\n# http://incompleteideas.net/tiles/tiles3.py-remove\n# with some naming convention changes\n#\n# Tile coding starts\nclass IHT:\n    \"Structure to handle collisions\"\n    def __init__(self, size_val):\n        self.size = size_val\n        self.overfull_count = 0\n        self.dictionary = {}\n\n    def count(self):\n        return len(self.dictionary)\n\n    def full(self):\n        return len(self.dictionary) >= self.size\n\n    def get_index(self, obj, read_only=False):\n        d = self.dictionary\n        if obj in d:\n            return d[obj]\n        elif read_only:\n            return None\n        size = self.size\n        count = self.count()\n        if count >= size:\n            if self.overfull_count == 0: print('IHT full, starting to allow collisions')\n            self.overfull_count += 1\n            return hash(obj) % self.size\n        else:\n            d[obj] = count\n            return count\n\ndef hash_coords(coordinates, m, read_only=False):\n    if isinstance(m, IHT): return m.get_index(tuple(coordinates), read_only)\n    if isinstance(m, int): return hash(tuple(coordinates)) % m\n    if m is None: return coordinates\n\ndef tiles(iht_or_size, num_tilings, floats, ints=None, read_only=False):\n    \"\"\"returns num-tilings tile indices corresponding to the floats and ints\"\"\"\n    if ints is None:\n        ints = []\n    qfloats = [floor(f * num_tilings) for f in floats]\n    tiles = []\n    for tiling in range(num_tilings):\n        tilingX2 = tiling * 2\n        coords = [tiling]\n        b = tiling\n        for q in qfloats:\n            coords.append((q + b) // num_tilings)\n            b += tilingX2\n        coords.extend(ints)\n        tiles.append(hash_coords(coords, iht_or_size, read_only))\n    return tiles\n# Tile coding ends\n#######################################################################\n\n# all possible actions\nACTION_REVERSE = -1\nACTION_ZERO = 0\nACTION_FORWARD = 1\n# order is important\nACTIONS = [ACTION_REVERSE, ACTION_ZERO, ACTION_FORWARD]\n\n# bound for position and velocity\nPOSITION_MIN = -1.2\nPOSITION_MAX = 0.5\nVELOCITY_MIN = -0.07\nVELOCITY_MAX = 0.07\n\n# use optimistic initial value, so it's ok to set epsilon to 0\nEPSILON = 0\n\n# take an @action at @position and @velocity\n# @return: new position, new velocity, reward (always -1)\ndef step(position, velocity, action):\n    new_velocity = velocity + 0.001 * action - 0.0025 * np.cos(3 * position)\n    new_velocity = min(max(VELOCITY_MIN, new_velocity), VELOCITY_MAX)\n    new_position = position + new_velocity\n    new_position = min(max(POSITION_MIN, new_position), POSITION_MAX)\n    reward = -1.0\n    if new_position == POSITION_MIN:\n        new_velocity = 0.0\n    return new_position, new_velocity, reward\n\n# wrapper class for state action value function\nclass ValueFunction:\n    # In this example I use the tiling software instead of implementing standard tiling by myself\n    # One important thing is that tiling is only a map from (state, action) to a series of indices\n    # It doesn't matter whether the indices have meaning, only if this map satisfy some property\n    # View the following webpage for more information\n    # http://incompleteideas.net/sutton/tiles/tiles3.html\n    # @max_size: the maximum # of indices\n    def __init__(self, step_size, num_of_tilings=8, max_size=2048):\n        self.max_size = max_size\n        self.num_of_tilings = num_of_tilings\n\n        # divide step size equally to each tiling\n        self.step_size = step_size / num_of_tilings\n\n        self.hash_table = IHT(max_size)\n\n        # weight for each tile\n        self.weights = np.zeros(max_size)\n\n        # position and velocity needs scaling to satisfy the tile software\n        self.position_scale = self.num_of_tilings / (POSITION_MAX - POSITION_MIN)\n        self.velocity_scale = self.num_of_tilings / (VELOCITY_MAX - VELOCITY_MIN)\n\n    # get indices of active tiles for given state and action\n    def get_active_tiles(self, position, velocity, action):\n        # I think positionScale * (position - position_min) would be a good normalization.\n        # However positionScale * position_min is a constant, so it's ok to ignore it.\n        active_tiles = tiles(self.hash_table, self.num_of_tilings,\n                            [self.position_scale * position, self.velocity_scale * velocity],\n                            [action])\n        return active_tiles\n\n    # estimate the value of given state and action\n    def value(self, position, velocity, action):\n        if position == POSITION_MAX:\n            return 0.0\n        active_tiles = self.get_active_tiles(position, velocity, action)\n        return np.sum(self.weights[active_tiles])\n\n    # learn with given state, action and target\n    def learn(self, position, velocity, action, target):\n        active_tiles = self.get_active_tiles(position, velocity, action)\n        estimation = np.sum(self.weights[active_tiles])\n        delta = self.step_size * (target - estimation)\n        for active_tile in active_tiles:\n            self.weights[active_tile] += delta\n\n    # get # of steps to reach the goal under current state value function\n    def cost_to_go(self, position, velocity):\n        costs = []\n        for action in ACTIONS:\n            costs.append(self.value(position, velocity, action))\n        return -np.max(costs)\n\n# get action at @position and @velocity based on epsilon greedy policy and @valueFunction\ndef get_action(position, velocity, value_function):\n    if np.random.binomial(1, EPSILON) == 1:\n        return np.random.choice(ACTIONS)\n    values = []\n    for action in ACTIONS:\n        values.append(value_function.value(position, velocity, action))\n    return np.random.choice([action_ for action_, value_ in enumerate(values) if value_ == np.max(values)]) - 1\n\n# semi-gradient n-step Sarsa\n# @valueFunction: state value function to learn\n# @n: # of steps\ndef semi_gradient_n_step_sarsa(value_function, n=1):\n    # start at a random position around the bottom of the valley\n    current_position = np.random.uniform(-0.6, -0.4)\n    # initial velocity is 0\n    current_velocity = 0.0\n    # get initial action\n    current_action = get_action(current_position, current_velocity, value_function)\n\n    # track previous position, velocity, action and reward\n    positions = [current_position]\n    velocities = [current_velocity]\n    actions = [current_action]\n    rewards = [0.0]\n\n    # track the time\n    time = 0\n\n    # the length of this episode\n    T = float('inf')\n    while True:\n        # go to next time step\n        time += 1\n\n        if time < T:\n            # take current action and go to the new state\n            new_position, new_velocity, reward = step(current_position, current_velocity, current_action)\n            # choose new action\n            new_action = get_action(new_position, new_velocity, value_function)\n\n            # track new state and action\n            positions.append(new_position)\n            velocities.append(new_velocity)\n            actions.append(new_action)\n            rewards.append(reward)\n\n            if new_position == POSITION_MAX:\n                T = time\n\n        # get the time of the state to update\n        update_time = time - n\n        if update_time >= 0:\n            returns = 0.0\n            # calculate corresponding rewards\n            for t in range(update_time + 1, min(T, update_time + n) + 1):\n                returns += rewards[t]\n            # add estimated state action value to the return\n            if update_time + n <= T:\n                returns += value_function.value(positions[update_time + n],\n                                                velocities[update_time + n],\n                                                actions[update_time + n])\n            # update the state value function\n            if positions[update_time] != POSITION_MAX:\n                value_function.learn(positions[update_time], velocities[update_time], actions[update_time], returns)\n        if update_time == T - 1:\n            break\n        current_position = new_position\n        current_velocity = new_velocity\n        current_action = new_action\n\n    return time\n\n# print learned cost to go\ndef print_cost(value_function, episode, ax):\n    grid_size = 40\n    positions = np.linspace(POSITION_MIN, POSITION_MAX, grid_size)\n    # positionStep = (POSITION_MAX - POSITION_MIN) / grid_size\n    # positions = np.arange(POSITION_MIN, POSITION_MAX + positionStep, positionStep)\n    # velocityStep = (VELOCITY_MAX - VELOCITY_MIN) / grid_size\n    # velocities = np.arange(VELOCITY_MIN, VELOCITY_MAX + velocityStep, velocityStep)\n    velocities = np.linspace(VELOCITY_MIN, VELOCITY_MAX, grid_size)\n    axis_x = []\n    axis_y = []\n    axis_z = []\n    for position in positions:\n        for velocity in velocities:\n            axis_x.append(position)\n            axis_y.append(velocity)\n            axis_z.append(value_function.cost_to_go(position, velocity))\n\n    ax.scatter(axis_x, axis_y, axis_z)\n    ax.set_xlabel('Position')\n    ax.set_ylabel('Velocity')\n    ax.set_zlabel('Cost to go')\n    ax.set_title('Episode %d' % (episode + 1))\n\n# Figure 10.1, cost to go in a single run\ndef figure_10_1():\n    episodes = 9000\n    plot_episodes = [0, 99, episodes - 1]\n    fig = plt.figure(figsize=(40, 10))\n    axes = [fig.add_subplot(1, len(plot_episodes), i+1, projection='3d') for i in range(len(plot_episodes))]\n    num_of_tilings = 8\n    alpha = 0.3\n    value_function = ValueFunction(alpha, num_of_tilings)\n    for ep in tqdm(range(episodes)):\n        semi_gradient_n_step_sarsa(value_function)\n        if ep in plot_episodes:\n            print_cost(value_function, ep, axes[plot_episodes.index(ep)])\n\n    plt.savefig('../images/figure_10_1.png')\n    plt.close()\n\n# Figure 10.2, semi-gradient Sarsa with different alphas\ndef figure_10_2():\n    runs = 10\n    episodes = 500\n    num_of_tilings = 8\n    alphas = [0.1, 0.2, 0.5]\n\n    steps = np.zeros((len(alphas), episodes))\n    for run in range(runs):\n        value_functions = [ValueFunction(alpha, num_of_tilings) for alpha in alphas]\n        for index in range(len(value_functions)):\n            for episode in tqdm(range(episodes)):\n                step = semi_gradient_n_step_sarsa(value_functions[index])\n                steps[index, episode] += step\n\n    steps /= runs\n\n    for i in range(0, len(alphas)):\n        plt.plot(steps[i], label='alpha = '+str(alphas[i])+'/'+str(num_of_tilings))\n    plt.xlabel('Episode')\n    plt.ylabel('Steps per episode')\n    plt.yscale('log')\n    plt.legend()\n\n    plt.savefig('../images/figure_10_2.png')\n    plt.close()\n\n# Figure 10.3, one-step semi-gradient Sarsa vs multi-step semi-gradient Sarsa\ndef figure_10_3():\n    runs = 10\n    episodes = 500\n    num_of_tilings = 8\n    alphas = [0.5, 0.3]\n    n_steps = [1, 8]\n\n    steps = np.zeros((len(alphas), episodes))\n    for run in range(runs):\n        value_functions = [ValueFunction(alpha, num_of_tilings) for alpha in alphas]\n        for index in range(len(value_functions)):\n            for episode in tqdm(range(episodes)):\n                step = semi_gradient_n_step_sarsa(value_functions[index], n_steps[index])\n                steps[index, episode] += step\n\n    steps /= runs\n\n    for i in range(0, len(alphas)):\n        plt.plot(steps[i], label='n = %.01f' % (n_steps[i]))\n    plt.xlabel('Episode')\n    plt.ylabel('Steps per episode')\n    plt.yscale('log')\n    plt.legend()\n\n    plt.savefig('../images/figure_10_3.png')\n    plt.close()\n\n# Figure 10.4, effect of alpha and n on multi-step semi-gradient Sarsa\ndef figure_10_4():\n    alphas = np.arange(0.25, 1.75, 0.25)\n    n_steps = np.power(2, np.arange(0, 5))\n    episodes = 50\n    runs = 5\n\n    max_steps = 300\n    steps = np.zeros((len(n_steps), len(alphas)))\n    for run in range(runs):\n        for n_step_index, n_step in enumerate(n_steps):\n            for alpha_index, alpha in enumerate(alphas):\n                if (n_step == 8 and alpha > 1) or \\\n                        (n_step == 16 and alpha > 0.75):\n                    # In these cases it won't converge, so ignore them\n                    steps[n_step_index, alpha_index] += max_steps * episodes\n                    continue\n                value_function = ValueFunction(alpha)\n                for episode in tqdm(range(episodes)):\n                    step = semi_gradient_n_step_sarsa(value_function, n_step)\n                    steps[n_step_index, alpha_index] += step\n\n    # average over independent runs and episodes\n    steps /= runs * episodes\n\n    for i in range(0, len(n_steps)):\n        plt.plot(alphas, steps[i, :], label='n = '+str(n_steps[i]))\n    plt.xlabel('alpha * number of tilings(8)')\n    plt.ylabel('Steps per episode')\n    plt.ylim([220, max_steps])\n    plt.legend()\n\n    plt.savefig('../images/figure_10_4.png')\n    plt.close()\n\nif __name__ == '__main__':\n    figure_10_1()\n    figure_10_2()\n    figure_10_3()\n    figure_10_4()\n"
  },
  {
    "path": "chapter11/counterexample.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016 - 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)           #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\nfrom mpl_toolkits.mplot3d.axes3d import Axes3D\n\n# all states: state 0-5 are upper states\nSTATES = np.arange(0, 7)\n# state 6 is lower state\nLOWER_STATE = 6\n# discount factor\nDISCOUNT = 0.99\n\n# each state is represented by a vector of length 8\nFEATURE_SIZE = 8\nFEATURES = np.zeros((len(STATES), FEATURE_SIZE))\nfor i in range(LOWER_STATE):\n    FEATURES[i, i] = 2\n    FEATURES[i, 7] = 1\nFEATURES[LOWER_STATE, 6] = 1\nFEATURES[LOWER_STATE, 7] = 2\n\n# all possible actions\nDASHED = 0\nSOLID = 1\nACTIONS = [DASHED, SOLID]\n\n# reward is always zero\nREWARD = 0\n\n# take @action at @state, return the new state\ndef step(state, action):\n    if action == SOLID:\n        return LOWER_STATE\n    return np.random.choice(STATES[: LOWER_STATE])\n\n# target policy\ndef target_policy(state):\n    return SOLID\n\n# state distribution for the behavior policy\nSTATE_DISTRIBUTION = np.ones(len(STATES)) / 7\nSTATE_DISTRIBUTION_MAT = np.matrix(np.diag(STATE_DISTRIBUTION))\n# projection matrix for minimize MSVE\nPROJECTION_MAT = np.matrix(FEATURES) * \\\n                 np.linalg.pinv(np.matrix(FEATURES.T) * STATE_DISTRIBUTION_MAT * np.matrix(FEATURES)) * \\\n                 np.matrix(FEATURES.T) * \\\n                 STATE_DISTRIBUTION_MAT\n\n# behavior policy\nBEHAVIOR_SOLID_PROBABILITY = 1.0 / 7\ndef behavior_policy(state):\n    if np.random.binomial(1, BEHAVIOR_SOLID_PROBABILITY) == 1:\n        return SOLID\n    return DASHED\n\n# Semi-gradient off-policy temporal difference\n# @state: current state\n# @theta: weight for each component of the feature vector\n# @alpha: step size\n# @return: next state\ndef semi_gradient_off_policy_TD(state, theta, alpha):\n    action = behavior_policy(state)\n    next_state = step(state, action)\n    # get the importance ratio\n    if action == DASHED:\n        rho = 0.0\n    else:\n        rho = 1.0 / BEHAVIOR_SOLID_PROBABILITY\n    delta = REWARD + DISCOUNT * np.dot(FEATURES[next_state, :], theta) - \\\n            np.dot(FEATURES[state, :], theta)\n    delta *= rho * alpha\n    # derivatives happen to be the same matrix due to the linearity\n    theta += FEATURES[state, :] * delta\n    return next_state\n\n# Semi-gradient DP\n# @theta: weight for each component of the feature vector\n# @alpha: step size\ndef semi_gradient_DP(theta, alpha):\n    delta = 0.0\n    # go through all the states\n    for state in STATES:\n        expected_return = 0.0\n        # compute bellman error for each state\n        for next_state in STATES:\n            if next_state == LOWER_STATE:\n                expected_return += REWARD + DISCOUNT * np.dot(theta, FEATURES[next_state, :])\n        bellmanError = expected_return - np.dot(theta, FEATURES[state, :])\n        # accumulate gradients\n        delta += bellmanError * FEATURES[state, :]\n    # derivatives happen to be the same matrix due to the linearity\n    theta += alpha / len(STATES) * delta\n\n# temporal difference with gradient correction\n# @state: current state\n# @theta: weight of each component of the feature vector\n# @weight: auxiliary trace for gradient correction\n# @alpha: step size of @theta\n# @beta: step size of @weight\ndef TDC(state, theta, weight, alpha, beta):\n    action = behavior_policy(state)\n    next_state = step(state, action)\n    # get the importance ratio\n    if action == DASHED:\n        rho = 0.0\n    else:\n        rho = 1.0 / BEHAVIOR_SOLID_PROBABILITY\n    delta = REWARD + DISCOUNT * np.dot(FEATURES[next_state, :], theta) - \\\n            np.dot(FEATURES[state, :], theta)\n    theta += alpha * rho * (delta * FEATURES[state, :] - DISCOUNT * FEATURES[next_state, :] * np.dot(FEATURES[state, :], weight))\n    weight += beta * rho * (delta - np.dot(FEATURES[state, :], weight)) * FEATURES[state, :]\n    return next_state\n\n# expected temporal difference with gradient correction\n# @theta: weight of each component of the feature vector\n# @weight: auxiliary trace for gradient correction\n# @alpha: step size of @theta\n# @beta: step size of @weight\ndef expected_TDC(theta, weight, alpha, beta):\n    for state in STATES:\n        # When computing expected update target, if next state is not lower state, importance ratio will be 0,\n        # so we can safely ignore this case and assume next state is always lower state\n        delta = REWARD + DISCOUNT * np.dot(FEATURES[LOWER_STATE, :], theta) - np.dot(FEATURES[state, :], theta)\n        rho = 1 / BEHAVIOR_SOLID_PROBABILITY\n        # Under behavior policy, state distribution is uniform, so the probability for each state is 1.0 / len(STATES)\n        expected_update_theta = 1.0 / len(STATES) * BEHAVIOR_SOLID_PROBABILITY * rho * (\n            delta * FEATURES[state, :] - DISCOUNT * FEATURES[LOWER_STATE, :] * np.dot(weight, FEATURES[state, :]))\n        theta += alpha * expected_update_theta\n        expected_update_weight = 1.0 / len(STATES) * BEHAVIOR_SOLID_PROBABILITY * rho * (\n            delta - np.dot(weight, FEATURES[state, :])) * FEATURES[state, :]\n        weight += beta * expected_update_weight\n\n    # if *accumulate* expected update and actually apply update here, then it's synchronous\n    # theta += alpha * expectedUpdateTheta\n    # weight += beta * expectedUpdateWeight\n\n# interest is 1 for every state\nINTEREST = 1\n\n# expected update of ETD\n# @theta: weight of each component of the feature vector\n# @emphasis: current emphasis\n# @alpha: step size of @theta\n# @return: expected next emphasis\ndef expected_emphatic_TD(theta, emphasis, alpha):\n    # we perform synchronous update for both theta and emphasis\n    expected_update = 0\n    expected_next_emphasis = 0.0\n    # go through all the states\n    for state in STATES:\n        # compute rho(t-1)\n        if state == LOWER_STATE:\n            rho = 1.0 / BEHAVIOR_SOLID_PROBABILITY\n        else:\n            rho = 0\n        # update emphasis\n        next_emphasis = DISCOUNT * rho * emphasis + INTEREST\n        expected_next_emphasis += next_emphasis\n        # When computing expected update target, if next state is not lower state, importance ratio will be 0,\n        # so we can safely ignore this case and assume next state is always lower state\n        next_state = LOWER_STATE\n        delta = REWARD + DISCOUNT * np.dot(FEATURES[next_state, :], theta) - np.dot(FEATURES[state, :], theta)\n        expected_update += 1.0 / len(STATES) * BEHAVIOR_SOLID_PROBABILITY * next_emphasis * 1 / BEHAVIOR_SOLID_PROBABILITY * delta * FEATURES[state, :]\n    theta += alpha * expected_update\n    return expected_next_emphasis / len(STATES)\n\n# compute RMSVE for a value function parameterized by @theta\n# true value function is always 0 in this example\ndef compute_RMSVE(theta):\n    return np.sqrt(np.dot(np.power(np.dot(FEATURES, theta), 2), STATE_DISTRIBUTION))\n\n# compute RMSPBE for a value function parameterized by @theta\n# true value function is always 0 in this example\ndef compute_RMSPBE(theta):\n    bellman_error = np.zeros(len(STATES))\n    for state in STATES:\n        for next_state in STATES:\n            if next_state == LOWER_STATE:\n                bellman_error[state] += REWARD + DISCOUNT * np.dot(theta, FEATURES[next_state, :]) - np.dot(theta, FEATURES[state, :])\n    bellman_error = np.dot(np.asarray(PROJECTION_MAT), bellman_error)\n    return np.sqrt(np.dot(np.power(bellman_error, 2), STATE_DISTRIBUTION))\n\nfigureIndex = 0\n\n# Figure 11.2(left), semi-gradient off-policy TD\ndef figure_11_2_left():\n    # Initialize the theta\n    theta = np.ones(FEATURE_SIZE)\n    theta[6] = 10\n\n    alpha = 0.01\n\n    steps = 1000\n    thetas = np.zeros((FEATURE_SIZE, steps))\n    state = np.random.choice(STATES)\n    for step in tqdm(range(steps)):\n        state = semi_gradient_off_policy_TD(state, theta, alpha)\n        thetas[:, step] = theta\n\n    for i in range(FEATURE_SIZE):\n        plt.plot(thetas[i, :], label='theta' + str(i + 1))\n    plt.xlabel('Steps')\n    plt.ylabel('Theta value')\n    plt.title('semi-gradient off-policy TD')\n    plt.legend()\n\n# Figure 11.2(right), semi-gradient DP\ndef figure_11_2_right():\n    # Initialize the theta\n    theta = np.ones(FEATURE_SIZE)\n    theta[6] = 10\n\n    alpha = 0.01\n\n    sweeps = 1000\n    thetas = np.zeros((FEATURE_SIZE, sweeps))\n    for sweep in tqdm(range(sweeps)):\n        semi_gradient_DP(theta, alpha)\n        thetas[:, sweep] = theta\n\n    for i in range(FEATURE_SIZE):\n        plt.plot(thetas[i, :], label='theta' + str(i + 1))\n    plt.xlabel('Sweeps')\n    plt.ylabel('Theta value')\n    plt.title('semi-gradient DP')\n    plt.legend()\n\ndef figure_11_2():\n    plt.figure(figsize=(10, 20))\n    plt.subplot(2, 1, 1)\n    figure_11_2_left()\n    plt.subplot(2, 1, 2)\n    figure_11_2_right()\n\n    plt.savefig('../images/figure_11_2.png')\n    plt.close()\n\n# Figure 11.6(left), temporal difference with gradient correction\ndef figure_11_6_left():\n    # Initialize the theta\n    theta = np.ones(FEATURE_SIZE)\n    theta[6] = 10\n    weight = np.zeros(FEATURE_SIZE)\n\n    alpha = 0.005\n    beta = 0.05\n\n    steps = 1000\n    thetas = np.zeros((FEATURE_SIZE, steps))\n    RMSVE = np.zeros(steps)\n    RMSPBE = np.zeros(steps)\n    state = np.random.choice(STATES)\n    for step in tqdm(range(steps)):\n        state = TDC(state, theta, weight, alpha, beta)\n        thetas[:, step] = theta\n        RMSVE[step] = compute_RMSVE(theta)\n        RMSPBE[step] = compute_RMSPBE(theta)\n\n    for i in range(FEATURE_SIZE):\n        plt.plot(thetas[i, :], label='theta' + str(i + 1))\n    plt.plot(RMSVE, label='RMSVE')\n    plt.plot(RMSPBE, label='RMSPBE')\n    plt.xlabel('Steps')\n    plt.title('TDC')\n    plt.legend()\n\n# Figure 11.6(right), expected temporal difference with gradient correction\ndef figure_11_6_right():\n    # Initialize the theta\n    theta = np.ones(FEATURE_SIZE)\n    theta[6] = 10\n    weight = np.zeros(FEATURE_SIZE)\n\n    alpha = 0.005\n    beta = 0.05\n\n    sweeps = 1000\n    thetas = np.zeros((FEATURE_SIZE, sweeps))\n    RMSVE = np.zeros(sweeps)\n    RMSPBE = np.zeros(sweeps)\n    for sweep in tqdm(range(sweeps)):\n        expected_TDC(theta, weight, alpha, beta)\n        thetas[:, sweep] = theta\n        RMSVE[sweep] = compute_RMSVE(theta)\n        RMSPBE[sweep] = compute_RMSPBE(theta)\n\n    for i in range(FEATURE_SIZE):\n        plt.plot(thetas[i, :], label='theta' + str(i + 1))\n    plt.plot(RMSVE, label='RMSVE')\n    plt.plot(RMSPBE, label='RMSPBE')\n    plt.xlabel('Sweeps')\n    plt.title('Expected TDC')\n    plt.legend()\n\ndef figure_11_6():\n    plt.figure(figsize=(10, 20))\n    plt.subplot(2, 1, 1)\n    figure_11_6_left()\n    plt.subplot(2, 1, 2)\n    figure_11_6_right()\n\n    plt.savefig('../images/figure_11_6.png')\n    plt.close()\n\n# Figure 11.7, expected ETD\ndef figure_11_7():\n    # Initialize the theta\n    theta = np.ones(FEATURE_SIZE)\n    theta[6] = 10\n\n    alpha = 0.03\n\n    sweeps = 1000\n    thetas = np.zeros((FEATURE_SIZE, sweeps))\n    RMSVE = np.zeros(sweeps)\n    emphasis = 0.0\n    for sweep in tqdm(range(sweeps)):\n        emphasis = expected_emphatic_TD(theta, emphasis, alpha)\n        thetas[:, sweep] = theta\n        RMSVE[sweep] = compute_RMSVE(theta)\n\n    for i in range(FEATURE_SIZE):\n        plt.plot(thetas[i, :], label='theta' + str(i + 1))\n    plt.plot(RMSVE, label='RMSVE')\n    plt.xlabel('Sweeps')\n    plt.title('emphatic TD')\n    plt.legend()\n\n    plt.savefig('../images/figure_11_7.png')\n    plt.close()\n\nif __name__ == '__main__':\n    figure_11_2()\n    figure_11_6()\n    figure_11_7()\n"
  },
  {
    "path": "chapter12/lambda_effect.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2021 Johann Huber (huber.joh@hotmail.fr)                            #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\n\"\"\"\n\nDescription:\n    This script is meant to reproduce Figure 12.14 of Sutton and Barto's book. This example shows\n    the effect of λ on 4 reinforcement learning tasks.\n\nCredits:\n    The \"Cart and Pole\" environment's code has been taken from openai gym source code.\n        Link : https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py#L7\n    The tile coding software has been taken from Sutton's website.\n        Link : http://www.incompleteideas.net/tiles/tiles3.html\n\nRemark:\n    - The optimum step-size parameters search have been omitted to avoid an even longer code. This\n    problem has already been met several times in the chapter.\n\n\nStructure:\n    1. Utils\n        1.1. Tiling utils\n        1.2. Eligibility traces utils\n    2. Random walk\n    3. Mountain Car\n    4. Cart and Pole\n    5. Results\n        5.1. Getting plot data\n        5.2. Reproducing figure 12.14\n        5.3. Main\n\n\"\"\";\n\n\nimport math\nimport numpy as np\nimport pandas as pd\nfrom tqdm import tqdm\nimport matplotlib.pyplot as plt\nimport seaborn as sns; sns.set_theme()\n\n\n#############################################################################################\n#                                          1. Utils                                         #\n#############################################################################################\n\n#-------------------#\n# 1.1. Tiling utils #\n#-------------------#\n\n# Credit : http://www.incompleteideas.net/tiles/tiles3.html\n\nbasehash = hash\n\nclass IHT:\n    \"\"\"Structure to handle collisions.\"\"\"\n\n    def __init__(self, sizeval):\n        self.size = sizeval\n        self.overfullCount = 0\n        self.dictionary = {}\n\n    def __str__(self):\n        \"\"\"Prepares a string for printing whenever this object is printed.\"\"\"\n        return \"Collision table:\" + \\\n               \" size:\" + str(self.size) + \\\n               \" overfullCount:\" + str(self.overfullCount) + \\\n               \" dictionary:\" + str(len(self.dictionary)) + \" items\"\n\n    def count(self):\n        return len(self.dictionary)\n\n    def fullp(self):\n        return len(self.dictionary) >= self.size\n\n    def getindex(self, obj, readonly=False):\n        d = self.dictionary\n        if obj in d:\n            return d[obj]\n        elif readonly:\n            return None\n        size = self.size\n        count = self.count()\n        if count >= size:\n            if self.overfullCount == 0: print('IHT full, starting to allow collisions')\n            assert self.overfullCount != 0\n            self.overfullCount += 1\n            return basehash(obj) % self.size\n        else:\n            d[obj] = count\n            return count\n\ndef hashcoords(coordinates, m, readonly=False):\n    if type(m) == IHT: return m.getindex(tuple(coordinates), readonly)\n    if type(m) == int: return basehash(tuple(coordinates)) % m\n    if m == None: return coordinates\n\nfrom math import floor, log\nfrom itertools import zip_longest\n\ndef tiles(ihtORsize, numtilings, floats, ints=[], readonly=False):\n    \"\"\"Returns num-tilings tile indices corresponding to the floats and ints\"\"\"\n    qfloats = [floor(f * numtilings) for f in floats]\n    Tiles = []\n    for tiling in range(numtilings):\n        tilingX2 = tiling * 2\n        coords = [tiling]\n        b = tiling\n        for q in qfloats:\n            coords.append((q + b) // numtilings)\n            b += tilingX2\n        coords.extend(ints)\n        Tiles.append(hashcoords(coords, ihtORsize, readonly))\n    return Tiles\n\n\ndef tileswrap(ihtORsize, numtilings, floats, wrapwidths, ints=[], readonly=False):\n    \"\"\"Returns num-tilings tile indices corresponding to the floats and ints, wrapping some floats\"\"\"\n    qfloats = [floor(f * numtilings) for f in floats]\n    Tiles = []\n    for tiling in range(numtilings):\n        tilingX2 = tiling * 2\n        coords = [tiling]\n        b = tiling\n        for q, width in zip_longest(qfloats, wrapwidths):\n            c = (q + b % numtilings) // numtilings\n            coords.append(c % width if width else c)\n            b += tilingX2\n        coords.extend(ints)\n        Tiles.append(hashcoords(coords, ihtORsize, readonly))\n    return Tiles\n\n\nclass IndexHashTable:\n\n    def __init__(self, iht_size, num_tilings, tiling_size, obs_bounds):\n        # Index Hash Table size\n        self._iht = IHT(iht_size)\n        # Number of tilings\n        self._num_tilings = num_tilings\n        # Tiling size\n        self._tiling_size = tiling_size\n        # Observation boundaries\n        # (format : [[min_1, max_1], ..., [min_i, max_i], ... ] for i in state's components)\n        self._obs_bounds = obs_bounds\n\n\n    def get_tiles(self, state, action):\n        \"\"\"Get the encoded state_action using Sutton's grid tiling software.\"\"\"\n        # List of floats numbers to be tiled\n        floats = [s * self._tiling_size/(obs_max - obs_min)\n                  for (s, (obs_min, obs_max)) in zip(state, self._obs_bounds)]\n\n        return tiles(self._iht, self._num_tilings, floats, [action])\n\n\n#-------------------------------#\n# 1.2. Eligibility traces utils #\n#-------------------------------#\n\n\ndef update_trace_vector(agent, method, state, action=None):\n    \"\"\"Updates agent's trace vector (z) with then current state (or state-action pair) using to the given method.\n    Returns the updated vector.\"\"\"\n\n    assert method in ['replace', 'replace_reset', 'accumulating'], 'Invalid trace update method.'\n\n    # Trace step\n    z = agent._γ * agent._λ * agent._z\n\n    # Update last observations components\n    if action is not None:\n        x_ids = agent.get_active_features(state, action)  # x(s,a)\n    else:\n        x_ids = agent.get_active_features(state)  # x(s)\n\n    if method == 'replace_reset':\n        for a in agent._all_actions:\n            if a != action:\n                x_ids2clear = agent.get_active_features(state, a)  # always x(s,a)\n                for id_w in x_ids2clear:\n                    z[id_w] = 0\n\n    for id_w in x_ids:\n        if (method == 'replace') or (method == 'replace_reset'):\n            z[id_w] = 1\n        elif method == 'accumulating':\n            z[id_w] += 1\n\n    return z\n\n\n#############################################################################################\n#                                     2. Random walk                                        #\n#############################################################################################\n\nclass RandomWalkEnvironment:\n\n    def __init__(self):\n        # Number of states\n        n_states = 21  # [term=0] [1, ... , 19] [term=20]\n        # Transition rewards\n        self._rewards = {key:0 for key in range(n_states)}\n        self._rewards[0] = -1\n        self._rewards[n_states-1] = 1\n        # Id terminal states\n        self._terminal_states = [0, n_states - 1]\n\n    def step(self, state, action):\n        next_state = state + action\n        reward = self._rewards[next_state]\n        return next_state, reward\n\n\nclass RandomWalkAgent:\n    def __init__(self, lmbda, alpha):\n        # Number of states\n        self._n_states = 21  # [term=0] [1, ... , 19] [term=20]\n        # Weight vector\n        self._w = np.zeros(self._n_states)\n        # Eligibility trace\n        self._z = np.zeros(self._n_states)\n        # Id initial state\n        self._init_state = int(self._n_states/2) + 1\n        # Id terminal states\n        self._terminal_states = [0, self._n_states - 1]\n        # Action space\n        self._all_actions = [-1, 1]\n        # Learning step-size\n        self._α = alpha\n        # Discount factor\n        self._γ = 1.\n        # Exponential weighting decrease\n        self._λ = lmbda\n        # True values (to compute RMS error)\n        self._target_values = np.array([i/10 for i in range(-9,10)])\n        # RMS error computed for each episode\n        self._error_hist = []\n\n    @property\n    def error_hist(self):\n        return self._error_hist\n\n    def get_all_v_hat(self):\n        all_v_hats = np.array([self.v_hat(s) for s in range(self._n_states)])\n        return all_v_hats[1:-1] # discard terminal states\n\n    def policy(self, state):\n        \"\"\"Action selection : uniform distribution. State argument is given for consistency.\"\"\"\n        return np.random.choice(self._all_actions)\n\n    def v_hat(self, state):\n        \"\"\"Returns the approximated value for state, w.r.t. the weight vector.\"\"\"\n        if state in self._terminal_states:\n            return 0. # by convention : R(S(T)) = 0\n        value = self._w[state]\n        return value\n\n    def grad_v_hat(self, state):\n        \"\"\"Compute the gradient of the state value w.r.t. the weight vector.\"\"\"\n        grad_v_hat = np.zeros_like(self._z)\n        grad_v_hat[state] = 1\n        return grad_v_hat\n\n    def get_active_features(self, state):\n        \"\"\"Get an array containing the id of the current active feature.\"\"\"\n        return [np.where(self.grad_v_hat(state) == 1)[0][0]]\n\n    def run_td_lambda(self, env, n_episodes, method):\n        \"\"\"Method described p293 of the book.\n\n        :param env: environment to interact with.\n        :param n_episodes: number of episodes to train on.\n        :param method: specify the TD(λ) method :\n                * 'accumulating' : With accumulating traces ;\n                * 'replace' : With replacing traces ;\n        :return: None\n        \"\"\"\n\n        assert method in ['replace', 'accumulating'], 'Invalid method'\n\n        for n_ep in range(n_episodes):\n\n            curr_state = self._init_state\n            self._z = np.zeros(self._n_states)\n\n            running = True\n            while running:\n                state = curr_state\n                action = self.policy(state)\n\n                next_state, reward = env.step(state, action)\n\n                self._z = update_trace_vector(agent=self, method=method, state=state)\n\n                # Moment-by-moment TD error\n                δ = reward + self._γ * self.v_hat(next_state) - self.v_hat(state)\n\n                # Weight vector update\n                self._w += self._α * δ * self._z\n\n                if next_state in self._terminal_states:\n                    running = False\n                else:\n                    curr_state = next_state\n\n            rms_err = np.sqrt(np.array((self._target_values - self.get_all_v_hat()) ** 2).mean())\n            self._error_hist.append(rms_err)\n\n\nclass RandomWalk:\n    def __init__(self, lmbda, alpha):\n        self._env = RandomWalkEnvironment()\n        self._agent = RandomWalkAgent(lmbda=lmbda, alpha=alpha)\n\n    @property\n    def error_hist(self):\n        return self._agent.error_hist\n\n    def train(self, n_episodes, method):\n        assert method in ['replace', 'accumulating'], 'Invalid method'\n        self._agent.run_td_lambda(self._env, n_episodes=n_episodes, method=method)\n\n\n#############################################################################################\n#                                     3. Mountain Car                                       #\n#############################################################################################\n\nclass MountainCarEnvironment:\n\n    def __init__(self):\n        # Action space\n        self._all_actions = [-1, 0, 1]\n        # Position bounds\n        self._pos_lims = [-1.2, 0.5]\n        # Speed bounds\n        self._vel_lims = [-0.07, 0.07]\n        # Terminal state position\n        self._pos_terminal = self._pos_lims[1]\n        # Terminal state reward\n        self._terminal_reward = 0\n        # Non-terminal state reward\n        self.step_reward = -1\n\n    def step(self, state, action):\n        x, x_dot = state\n\n        x_dot_next = x_dot + 0.001 * action - 0.0025 * np.cos(3 * x)\n        x_dot_next = np.clip(x_dot_next, a_min=self._vel_lims[0], a_max=self._vel_lims[1])\n\n        x_next = x + x_dot_next\n        x_next = np.clip(x_next, a_min=self._pos_lims[0], a_max=self._pos_lims[1])\n\n        if x_next == self._pos_lims[0]:\n            x_dot_next = 0. # left border : reset speed\n\n        next_state = (x_next, x_dot_next)\n        reward = self._terminal_reward if (x_next == self._pos_terminal) else self.step_reward\n\n        return next_state, reward\n\n\nclass MountainCarAgent:\n    def __init__(self, alpha, lmbda, iht_args):\n        # Index Hash Table for position encoding\n        self._iht = IndexHashTable(**iht_args)\n        # Number of tilings\n        self._num_tilings = iht_args['num_tilings']\n        # Weight vector\n        init_w_val = -20. # optimistic initial values to make the agent explore\n        self._w = np.full(iht_args['iht_size'], init_w_val)\n        # Eligibility trace\n        self._z = np.zeros(iht_args['iht_size'])\n        # Maximum number of step within an episode (avoid infinite episode)\n        self.max_n_step = 4000\n        # Minimum cumulated reward (means that q values have diverged).\n        self.default_min_reward = -4000\n        # Action space\n        self._all_actions = [-1, 0, 1]\n        # Position bounds\n        self._pos_lims = [-1.2, 0.5]\n        # Speed bounds\n        self._vel_lims = [-0.07, 0.07]\n        # Terminal state position\n        self._pos_terminal = self._pos_lims[1]\n        # Learning step-size\n        self._α = alpha\n        # Exponential weighting decrease\n        self._λ = lmbda\n        # Discount factor\n        self._γ = 1.\n        # Number of steps before termination, computed for each episode\n        self._n_step_hist = []\n\n    @property\n    def n_step_hist(self):\n        return self._n_step_hist\n\n    def policy(self, state):\n        \"\"\"Apply a ε-greedy policy to choose an action from state.\"\"\"\n\n        # Always greedy : exploration is assured by optimistic initial values\n        q_sa_next = np.array([self.q_hat(state, a) for a in self._all_actions])\n        greedy_action_inds = np.where(q_sa_next == q_sa_next.max())[0]\n        ind_action = np.random.choice(greedy_action_inds) # randomly choose between maximum q values\n        action = self._all_actions[ind_action]\n\n        return action\n\n    def get_init_state(self):\n        \"\"\"Get a random starting position in the interval [-0.6, -0.4).\"\"\"\n        x = np.random.uniform(low=-0.6, high=-0.4)\n        x_dot = 0.\n        return x, x_dot\n\n    def is_terminal_state(self, state):\n        return state[0] == self._pos_terminal\n\n    def q_hat(self, state, action):\n        \"\"\"Compute the q value for the current state-action pair.\"\"\"\n        x, x_dot = state\n        if x == self._pos_terminal: return 0\n\n        x_s_a = self._iht.get_tiles(state, action)\n        q = np.array([self._w[id_w] for id_w in x_s_a]).sum()\n        return q\n\n    def get_active_features(self, state, action):\n        \"\"\"Get an array containing the ids of the current active features.\"\"\"\n        return self._iht.get_tiles(state, action)\n\n    def run_sarsa_lambda(self, env, n_episodes, method):\n        \"\"\"Apply Sarsa(λ) algorithm. (p.305)\n\n        :param env: environment to interact with.\n        :param n_episodes: number of episodes to train on.\n        :param method: specify the Sarsa(λ) method :\n                * 'accumulating' : With accumulating traces ;\n                * 'replace' : With replacing traces ;\n        :return: None\n        \"\"\"\n\n        assert method in ['accumulating', 'replace'], 'Invalid method arg.'\n\n        overflow_flag = False\n\n        for i_ep in range(n_episodes):\n\n            if overflow_flag:\n                # Training diverged : set default worse value for all the remaining epochs\n                self._n_step_hist.append(self.max_n_step)\n                continue\n\n            n_it = 0\n\n            state = self.get_init_state()\n            action = self.policy(state)\n\n            self._z = np.zeros(self._w.shape)\n\n            running = True\n            while(running):\n\n                try:\n                    next_state, reward = env.step(state, action)\n                    n_it += 1\n\n                    δ = reward\n                    δ -= self.q_hat(state, action) # q_hat(s) : implicit sum over F(s,a) (see book)\n\n                    self._z = update_trace_vector(agent=self, method=method, state=state, action=action)\n\n                    if self.is_terminal_state(next_state) or (n_it == self.max_n_step):\n                        self._w += (self._α/self._num_tilings) * δ * self._z\n                        running = False\n                        continue # go to next episode\n\n                    next_action = self.policy(next_state)\n\n                    δ += self._γ * self.q_hat(next_state, next_action) # q_hat(s') : implicit sum over F(s',a') (see book)\n                    self._w += (self._α/self._num_tilings) * δ * self._z\n\n                    state = next_state\n                    action = next_action\n\n\n                except ValueError:\n                    overflow_msg = 'λ>0.9 : expected behavior !' if (self._λ > .9) else 'Training diverged, try a lower α.'\n                    print(f'Warning : Value overflow.| λ={self._λ} , α*num_tile={self._α} | ' + overflow_msg)\n\n                    # Training data lists will be fed with default worse values for all the remaining epochs.\n                    overflow_flag = True\n                    running = False\n                    continue\n\n            if overflow_flag:\n                n_it = self.max_n_step\n\n            self._n_step_hist.append(n_it)\n\n\nclass MountainCar:\n    def __init__(self, lmbda, alpha):\n        # Environment initialization\n        self._env = MountainCarEnvironment()\n        # Observation boundaries\n        # (format : [[min_1, max_1], ..., [min_i, max_i], ... ] for i in state's components\n        #  state = (x, x_dot))\n        obs_bounds = [[-1.2, 0.5],\n                      [-0.07, 0.07]]\n        # Tiling parameters\n        self._iht_args = {'iht_size': 2 ** 12,\n                          'num_tilings': 10,\n                          'tiling_size': 9,\n                          'obs_bounds': obs_bounds}\n        # Agent parameters\n        mc_agent_args = {'iht_args': self._iht_args,\n                         'alpha': alpha,\n                         'lmbda': lmbda}\n        # Agent initialization\n        self._agent = MountainCarAgent(**mc_agent_args)\n\n    @property\n    def n_step_hist(self):\n        return self._agent.n_step_hist\n\n    def train(self, n_episodes, method):\n        assert method in ['accumulating', 'replace'], 'Invalid method arg.'\n        self._agent.run_sarsa_lambda(self._env, n_episodes=n_episodes, method=method)\n\n\n#############################################################################################\n#                                     4. Cart and Pole                                      #\n#############################################################################################\n\nclass CartPoleEnvironment:\n    \"\"\"Credit : https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py#L7\"\"\"\n\n    def __init__(self):\n\n        self.gravity = 9.8\n        self.masscart = 1.0\n        self.masspole = 0.1\n        self.total_mass = (self.masspole + self.masscart)\n        self.length = 0.5  # actually half the pole's length\n        self.polemass_length = (self.masspole * self.length)\n        self.force_mag = 10.0\n        self.tau = 0.02  # seconds between state updates\n        self.kinematics_integrator = 'euler'\n\n        # Angle at which to fail the episode\n        self.theta_threshold_radians = 12 * 2 * math.pi / 360\n        # Position at which to fail the episode\n        self.x_threshold = 2.4\n        # Action space\n        self._all_actions = [0, 1] # left, right\n\n    def is_state_valid(self, state):\n        x, _, theta, _ = state\n        # Velocities aren't bounded, therefore cannot be checked.\n        is_state_invalid = bool(\n            x < -4.8\n            or x > 4.8\n            or theta < -0.418\n            or theta > 0.418\n        )\n        return not is_state_invalid\n\n    def step(self, state, action):\n        x, x_dot, theta, theta_dot = state\n        force = self.force_mag if action == 1 else -self.force_mag\n        costheta = math.cos(theta)\n        sintheta = math.sin(theta)\n\n        # For the interested reader:\n        # https://coneural.org/florian/papers/05_cart_pole.pdf\n        temp = (force + self.polemass_length * theta_dot ** 2 * sintheta) / self.total_mass\n        thetaacc = (self.gravity * sintheta - costheta * temp) / (self.length * (4.0 / 3.0 - self.masspole * costheta ** 2 / self.total_mass))\n        xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass\n\n        if self.kinematics_integrator == 'euler':\n            x = x + self.tau * x_dot\n            x_dot = x_dot + self.tau * xacc\n            theta = theta + self.tau * theta_dot\n            theta_dot = theta_dot + self.tau * thetaacc\n        else:  # semi-implicit euler\n            x_dot = x_dot + self.tau * xacc\n            x = x + self.tau * x_dot\n            theta_dot = theta_dot + self.tau * thetaacc\n            theta = theta + self.tau * theta_dot\n\n        next_state = (x, x_dot, theta, theta_dot)\n        reward = 1.0\n\n        return next_state, reward\n\n\nclass CartPoleAgent:\n    def __init__(self, iht_args, alpha, lmbda):\n        # Index Hash Table for position encoding\n        self._iht = IndexHashTable(**iht_args)\n        # Weight vector\n        self._w = np.zeros(iht_args['iht_size'])\n        # Number of tilings\n        self._num_tilings = iht_args['num_tilings']\n        # Eligibility trace\n        self._z = self._z = np.zeros(self._w.shape)\n        # Exponential weighting decrease\n        self._λ = lmbda\n        # Max number of failures (default worse n_failures)\n        self.max_n_failures = 100000\n        # Action space\n        self._all_actions = [0, 1]\n        # Learning step-size\n        self._α = alpha\n        # Discount factor\n        self._γ = 0.99\n        # Exploration ratio\n        self._ε = 0.1\n        # Angle at which to fail the episode (12°)\n        self.theta_threshold_radians = 12 * 2 * math.pi / 360\n        # Position at which to fail the episode\n        self.x_threshold = 2.4\n        # Number of failures (updated while running Sarsa(λ))\n        self._n_failures = 0\n\n    @property\n    def n_failures(self):\n        return self._n_failures\n\n    def policy(self, state):\n        \"\"\"Apply a ε-greedy policy to choose an action from state.\"\"\"\n        if np.random.random_sample() < self._ε:\n            action = self._all_actions[np.random.choice(range(len(self._all_actions)))]\n            return action\n\n        q_sa_next = np.array([self.q_hat(state, a) for a in self._all_actions])\n        greedy_action_inds = np.where(q_sa_next == q_sa_next.max())[0]\n        ind_action = np.random.choice(greedy_action_inds)\n        action = self._all_actions[ind_action]\n\n        return action\n\n    def is_state_valid(self, state):\n        x, _, theta, _ = state\n        is_state_invalid = bool(\n            x < -4.8 or x > 4.8\n            or theta < -0.418 or theta > 0.418\n        )\n        return not is_state_invalid\n\n    def get_init_state(self):\n        \"\"\"Get a random starting position.\"\"\"\n        state = np.random.uniform(low=-0.05, high=0.05, size=(4,))\n        return state\n\n    def is_state_over_bounds(self, state):\n        \"\"\"Returns True if the current state is out of bounds, i.e. the current run is over. Returns\n        False otherwise.\"\"\"\n\n        x, x_dot, theta, theta_dot = state\n        return bool(\n            x < -self.x_threshold\n            or x > self.x_threshold\n            or theta < -self.theta_threshold_radians\n            or theta > self.theta_threshold_radians\n        )\n\n    def q_hat(self, state, action):\n        \"\"\"Compute the q value for the current state-action pair.\"\"\"\n        if self.is_state_over_bounds(state): return 0.\n\n        x_s_a = self._iht.get_tiles(state, action)\n        q = np.array([self._w[id_w] for id_w in x_s_a]).sum()\n        return q\n\n    def get_active_features(self, state, action):\n        \"\"\"Get an array containing the ids of the current active features.\"\"\"\n        return self._iht.get_tiles(state, action)\n\n    def run_sarsa_lambda(self, env, n_step_max, method):\n        \"\"\"Apply Sarsa(λ) algorithm. (p.305)\n\n        :param env: environment to interact with.\n        :param n_step_max: number of steps to train on.\n        :param method: specify the Sarsa(λ) method :\n                * 'accumulating' : With accumulating traces ;\n        :return: None\n        \"\"\"\n        assert method in ['accumulating'], 'Invalid method arg.'\n\n        n_step = 0 # number of steps across episodes\n        n_ep = 0 # number of episode\n\n        while n_step < n_step_max:\n            n_ep += 1\n            n_step_try = 0 # number of steps in the current episode\n\n            state = self.get_init_state()\n            action = self.policy(state)\n\n            self._z = np.zeros_like(self._w)\n\n            running = True\n            while running:\n\n                try:\n                    next_state, reward = env.step(state, action)\n                    n_step_try += 1\n                    n_step += 1\n\n                    δ = reward\n                    δ -= self.q_hat(state, action) # q_hat(s) : implicit sum over F(s,a) (see book)\n\n                    self._z = update_trace_vector(agent=self, method=method, state=state, action=action)\n\n                    # End of run\n                    if n_step == n_step_max:\n                        running = False\n\n                    # Failed trial\n                    if self.is_state_over_bounds(next_state) :\n                        self._w += (self._α/self._num_tilings) * δ * self._z\n                        self._n_failures += 1\n                        running = False\n                        continue\n\n                    next_action = self.policy(next_state)\n\n                    δ += self._γ * self.q_hat(next_state, next_action) # q_hat(s') : implicit sum over F(s',a') (see book)\n                    self._w += (self._α/self._num_tilings) * δ * self._z\n\n                    state = next_state\n                    action = next_action\n\n                except ValueError:\n                    overflow_msg = 'λ>0.9 : expected behavior !' if (self._λ > .9) else 'Training diverged, try a lower α.'\n                    print(f'Warning : Value overflow.| λ={self._λ} , α*num_tile={self._α} | ' + overflow_msg)\n\n                    # Training metric is set with the default worse value.\n                    self._n_failures = self.max_n_failures\n                    running = False\n                    n_step = n_step_max\n                    continue\n\n        #print('Running over. n_ep =', n_ep)\n\n\nclass CartPole:\n    def __init__(self, lmbda, alpha):\n        # Environment initialization\n        self._env = CartPoleEnvironment()\n        # Observation boundaries\n        # (format : [[min_1, max_1], ..., [min_i, max_i], ... ] for i in state's components.\n        #  state = (x, x_dot, theta, theta_dot)\n        #  \"Fake\" bounds have been set for velocity components to ease tiling.)\n        obs_bounds = [[-4.8, 4.8],\n                      [-3., 3.],\n                      [-0.25, 0.25],\n                      [-3., 3.]]\n        # Tiling parameters\n        self._iht_args = {'iht_size': 2 ** 11,\n                          'num_tilings': 2,\n                          'tiling_size': 4,\n                          'obs_bounds': obs_bounds}\n        # Agent parameters\n        pw_agent_args = {'iht_args': self._iht_args,\n                         'alpha': alpha,\n                         'lmbda': lmbda}\n        # Agent initialization\n        self._agent = CartPoleAgent(**pw_agent_args)\n\n    @property\n    def n_failures(self):\n        return self._agent.n_failures\n\n    def train(self, n_step_max, method):\n        assert method in ['accumulating'], 'Invalid method'\n        self._agent.run_sarsa_lambda(self._env, n_step_max=n_step_max, method=method)\n\n\n\n#############################################################################################\n#                                     5. Puddle World                                       #\n#############################################################################################\n\nclass PuddleWorldGrid:\n    def __init__(self):\n        # Grid dimensions\n        self._h, self._w = (1, 1)\n        # Distance to the top left corner that define the goal area\n        self._goal_len = 0.01\n        # Position of puddle centers\n        # format : ((i_center_a, j_center_a), (i_center_b, j_center_b))\n        self._pos_centers_puddles = [((.25, .1), (.25, .45)),\n                                     ((.2, .45), (.6, .45))]\n        # Puddle radius\n        self._puddle_radius =  0.1\n        # Figure dimension for plotting\n        self._fig_size = (10, 10)\n\n    @property\n    def height(self):\n        return self._h\n\n    @property\n    def width(self):\n        return self._w\n\n    def is_state_goal(self, state):\n        i,j = state\n        g_i, g_j = (0., 1.)\n        dist2goal = np.sqrt((i - g_i) ** 2 + (j - g_j) ** 2)\n        return dist2goal <= self._goal_len\n\n    def get_dist2puddle(self, state):\n        \"\"\"Get state's distance (float) to the nearest puddle's border.\n        Returns a float corresponding to the state's distance to the nearest puddle border. Return -1 if\n        the state to evaluate is far enough from puddles to be not affected by the cost penalty.\n        \"\"\"\n        i, j = state\n        max_dist = -1 # puddle cost is defined by the maximal distance to border\n\n        # Unpack puddle pos\n        (p_horiz_ij_1, p_horiz_ij_2), (p_verti_ij_1, p_verti_ij_2) = self._pos_centers_puddles\n        p_horiz_i_1, p_horiz_j_1 = p_horiz_ij_1\n        p_horiz_i_2, p_horiz_j_2 = p_horiz_ij_2\n        p_verti_i_1, p_verti_j_1 = p_verti_ij_1\n        p_verti_i_2, p_verti_j_2 = p_verti_ij_2\n\n        dist2centers = [np.sqrt((i - p_horiz_i_1) ** 2 + (j - p_horiz_j_1) ** 2),\n                        np.sqrt((i - p_horiz_i_2) ** 2 + (j - p_horiz_j_2) ** 2),\n                        np.sqrt((i - p_verti_i_1) ** 2 + (j - p_verti_j_1) ** 2),\n                        np.sqrt((i - p_verti_i_2) ** 2 + (j - p_verti_j_2) ** 2)]\n        min_dist2centers = np.array(dist2centers).min()\n\n        if min_dist2centers <= self._puddle_radius:\n            dist2border = self._puddle_radius - min_dist2centers\n            if max_dist < dist2border:\n                max_dist = dist2border\n\n        # Horizontal puddle axis\n        if (j >= p_horiz_j_1) and (j <= p_horiz_j_2):\n            dist2horiz_axis_p = np.abs(i - p_horiz_i_1)\n\n            if (dist2horiz_axis_p <= self._puddle_radius):\n                dist2border = self._puddle_radius - dist2horiz_axis_p\n                if max_dist < dist2border:\n                    max_dist = dist2border\n\n        # Vertical puddle axis\n        if (i >= p_verti_i_1) and (i <= p_verti_i_2):\n            dist2verti_axis_p = np.abs(j - p_verti_j_1)\n\n            if dist2verti_axis_p <= self._puddle_radius:\n                dist2border = self._puddle_radius - dist2verti_axis_p\n                if max_dist < dist2border:\n                    max_dist = dist2border\n\n        dist2puddle = max_dist\n        return dist2puddle\n\n    def cvt_ij2xy(self, pos_ij):\n        return pos_ij[1], self._h - pos_ij[0]\n\n    def draw(self):\n        fig, ax = plt.subplots(1, 1, figsize=self._fig_size)\n\n        # Goal corner\n        goal_area_ij = [[0, self._w],\n                        [0, self._w - self._goal_len],\n                        [self._goal_len, self._w]]\n        goal_area_xy = [self.cvt_ij2xy(pos_ij) for pos_ij in goal_area_ij]\n        goal_triangle = plt.Polygon(goal_area_xy, color='tab:green')\n        ax.add_patch(goal_triangle)\n\n        for i in tqdm(np.arange(0., 1., 0.005), desc='Creating map'):\n            for j in np.arange(0., 1., 0.005):\n                dist = self.get_dist2puddle(state=(i,j))\n\n                if dist == -1:\n                    continue # far from puddles\n\n                # Grayscale : min=0.25 , max=0.75\n                color_intensity = (dist / self._puddle_radius) * (0.75 - 0.25) + 0.25\n\n                x, y = self.cvt_ij2xy((i, j))\n                dot = plt.Circle((x, y), 0.002, color=str(color_intensity))\n                ax.add_patch(dot)\n\n        ax.set_xlim(0, self._w)\n        ax.set_ylim(0, self._h)\n        ax.set_title('PUDDLE WORLD', fontsize=18)\n        # plt.waitforbuttonpress()\n\n        # Export\n        export_name = 'puddleworld_map'\n        plt.savefig(export_name)\n        print(f'Puddle word map exported as : {export_name}.png')\n\n\nclass PuddleWorldEnvironment:\n    def __init__(self, grid):\n        # Grid object\n        self._grid = grid\n        # Action space\n        self._all_actions = [(i, j) for i in range(-1, 2) for j in range(-1, 2) if abs(i) != abs(j)]\n        # Step size when taking an action in a certain direction\n        self._step_range = 0.05\n        # Transition cost\n        self._step_cost = -1\n        # Transition cost when the agent walks on puddles\n        self._puddle_cost = -400\n\n    def step(self, state, action):\n        # Random gaussian noise (std=0.01) on each move\n        move_noise = np.random.normal(0, 0.01, len(state))\n\n        # Move\n        next_state = np.array(state) + self._step_range*np.array(action) + move_noise\n        next_state = (np.clip(next_state[0], a_min=0, a_max=self._grid.height),\n                      np.clip(next_state[1], a_min=0, a_max=self._grid.width))\n\n        # Cost\n        dist2puddle = self._grid.get_dist2puddle(next_state)\n        is_far_from_puddles = dist2puddle == -1\n        reward = self._step_cost if is_far_from_puddles else self._puddle_cost * dist2puddle\n\n        return next_state, reward\n\n\nclass PuddleWorldAgent:\n    def __init__(self, grid, alpha, lmbda, iht_args):\n        # Index Hash Table for position encoding\n        self._iht = IndexHashTable(**iht_args)\n        # Weight vector\n        self._w = np.zeros(iht_args['iht_size'])\n        # Number of tilings\n        self._num_tilings = iht_args['num_tilings']\n        # Action space\n        self._all_actions = [(i, j) for i in range(-1, 2) for j in range(-1, 2) if abs(i) != abs(j)]\n        # Grid object (uses metadata to check state validity)\n        self._grid = grid\n        # Learning step-size\n        self._α = alpha\n        # Exponential weighting decrease\n        self._λ = lmbda\n        # Discount factor\n        self._γ = 1.\n        # Exploration ratio\n        self._ε = 0.1\n        # Cost computed for each episode\n        self._cost_per_ep_hist = []\n\n    @property\n    def cost_per_ep_hist(self):\n        return self._cost_per_ep_hist\n\n    def policy(self, state):\n        \"\"\"Apply a ε-greedy policy to choose an action from state.\"\"\"\n\n        if np.random.random_sample() < self._ε:\n            action = self._all_actions[np.random.choice(range(len(self._all_actions)))]\n            return action\n\n        q_hat = np.array([self.q_hat(state, a) for a in self._all_actions])\n        greedy_action_inds = np.where(q_hat == q_hat.max())[0]\n        ind_action = np.random.choice(greedy_action_inds)\n\n        action = self._all_actions[ind_action]\n        return action\n\n    def get_start_pos(self):\n        \"\"\"Randomly pick a non-goal state as starting position.\"\"\"\n        i_pos, j_pos = -1, -1\n\n        is_init_pos_found = False\n        while not is_init_pos_found:\n            i_pos = np.random.randint(low=0, high=self._grid.height)\n            j_pos = np.random.randint(low=0, high=self._grid.width)\n\n            if not self._grid.is_state_goal((i_pos, j_pos)):\n                is_init_pos_found = True\n\n        assert (i_pos != -1) and (j_pos != -1), 'Error while looking for an init position.'\n        return i_pos, j_pos\n\n    def is_terminal_state(self, state):\n        return self._grid.is_state_goal(state)\n\n    def q_hat(self, state, action):\n        \"\"\"Compute the q value for the current state-action pair.\"\"\"\n        if self.is_terminal_state(state):\n            return 0.\n\n        x_s_a = self._iht.get_tiles(state, action)\n        q = np.array([self._w[id_w] for id_w in x_s_a]).sum()\n        return q\n\n    def get_active_features(self, state, action):\n        \"\"\"Get an array containing the ids of the current active features.\"\"\"\n        return self._iht.get_tiles(state, action)\n\n    def run_sarsa_lambda(self, env, n_episodes, method):\n        \"\"\"Apply Sarsa(λ) algorithm. (p.305)\n\n        :param env: environment to interact with.\n        :param n_episodes: number of episodes to train on.\n        :param method: specify the Sarsa(λ) method :\n                * 'accumulating' : With accumulating traces ;\n                * 'replace' : With replacing traces ;\n                * 'replace_reset' : With replacing traces, and clearing the traces of other actions.\n        :return: None\n        \"\"\"\n        assert method in ['replace_reset'], 'Invalid method arg.'\n\n        for i_ep in range(n_episodes):\n            cumu_reward = 0 # cumulated reward\n\n            state = self.get_start_pos()\n            action = self.policy(state)\n\n            self._z = np.zeros(self._w.shape)\n\n            running = True\n            while(running):\n\n                next_state, reward = env.step(state, action)\n                cumu_reward += reward\n\n                δ = reward\n                δ -=  self.q_hat(state, action) # q_hat(s) : implicit sum over F(s,a) (see book)\n\n                self._z = update_trace_vector(agent=self, method=method, state=state, action=action)\n\n                if self.is_terminal_state(state) :\n                    self._w += (self._α / self._num_tilings) * δ * self._z\n                    running = False\n                    continue # go to next episode\n\n                next_action = self.policy(next_state)\n\n                δ += self._γ * self.q_hat(next_state, next_action) # q_hat(s') : implicit sum over F(s',a') (see book)\n\n                self._w += (self._α / self._num_tilings) * δ * self._z\n\n                state = next_state\n                action = next_action\n\n            self._cost_per_ep_hist.append(-cumu_reward)\n\n\nclass PuddleWorld:\n    def __init__(self, lmbda, alpha):\n        # Grid initialization\n        self._grid = PuddleWorldGrid()\n        # Environment initialization\n        self._env = PuddleWorldEnvironment(grid=self._grid)\n        # Observation boundaries\n        # (format : [[min_1, max_1], ..., [min_i, max_i], ... ] for i in state's components\n        #  state = (i, j))\n        obs_bounds = [[0., 1.],\n                      [0., 1.]]\n        # Tiling parameters\n        iht_args = {'iht_size': 2**10,\n                    'num_tilings': 5,\n                    'tiling_size': 5,\n                    'obs_bounds': obs_bounds}\n        # Agent parameters\n        pw_agent_args = {'grid': self._grid,\n                         'alpha': alpha,\n                         'lmbda': lmbda,\n                         'iht_args': iht_args}\n        # Agent initialization\n        self._agent = PuddleWorldAgent(**pw_agent_args)\n\n    @property\n    def cost_per_ep_hist(self):\n        return self._agent.cost_per_ep_hist\n\n    def draw(self):\n        self._grid.draw()\n\n    def train(self, n_episodes, method):\n        assert method in ['replace_reset'], 'Invalid method arg.'\n        self._agent.run_sarsa_lambda(self._env, n_episodes=n_episodes, method=method)\n\n\ndef get_puddle_world_map():\n    \"\"\"Creates the puddle world map and save the figure in the local folder as a .png file.\"\"\"\n    pw = PuddleWorld(0., 0.) # dummy args\n    pw.draw()\n\n\n#############################################################################################\n#                                        5. Results                                         #\n#############################################################################################\n\n#---------------------------#\n# 5.1. Getting plot data    #\n#---------------------------#\n\n\ndef get_random_walk_plot_data():\n    n_episodes = 10\n    n_runs = 1000\n\n    lambda_range = {'replace': [0., 0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.975, 0.99, 1.],\n                    'accumulating': [0., 0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.975, 0.99, 1.]}\n\n    # Optimal alpha coefficient for each lambda\n    alpha_range = {'replace': [0.8, 0.8, 0.8, 0.6, 0.6, 0.4, 0.4, 0.4, 0.3, 0.3],\n                   'accumulating': [0.8, 0.8, 0.8, 0.6, 0.3, 0.2, 0.1, 0.05, 0.03, 0.01]}\n\n    all_df_vis = {}\n\n\n    for method in ['replace', 'accumulating']:  # 'accumulating' # replace\n\n        rms_per_ep = []\n        for n_run in tqdm(range(n_runs), desc=f'RANDOM WALK | method={method}'):\n\n            for λ, α in zip(lambda_range[method], alpha_range[method]):\n                np.random.seed(n_run) # Make sure that each runs of both algorithms experiences the same trial\n\n                random_walk = RandomWalk(lmbda=λ, alpha=α)\n                random_walk.train(n_episodes=n_episodes, method=method)\n\n                rms_per_ep.append([λ, np.array(random_walk.error_hist).mean()])\n\n        df_str_key = f'{method}'\n        all_df_vis[df_str_key] = pd.DataFrame(np.array(rms_per_ep), columns=['lambda', 'rms'])\n        all_df_vis[df_str_key]['method'] = np.full(len(rms_per_ep), method)\n\n    # No error bar on the book's RandomWalk figure\n    df_vis = pd.concat(all_df_vis.values(), ignore_index=True)\n    df_vis = df_vis.groupby(['lambda', 'method'])['rms'].mean().reset_index()\n    return df_vis\n\ndef get_mountain_car_plot_data():\n    n_episodes = 20\n    n_runs = 30\n\n    lambda_range = {'replace': [0., 0.4, 0.7, 0.8, 0.9, 0.95, 0.99, 1.],\n                    'accumulating': [0., 0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 1.]}\n\n    # Optimal alpha coefficient for each lambda\n    alpha_range = {'replace': [1.4, 1.7, 1.5, 1.7, 1.7, 1.2, 0.6, 0.4],\n                   'accumulating': [1.4, 1.3, 1.0, 0.8, 0.6, 0.5, 0.3, 0.15, 0.03, 0.01]}\n\n    all_df_vis = {}\n\n    for method in ['replace', 'accumulating']:\n        n_step_per_ep = []\n\n        for n_run in tqdm(range(n_runs), desc=f'MOUNTAIN CAR | method={method}'):\n            for λ, α in zip(lambda_range[method], alpha_range[method]):\n                np.random.seed(n_run)\n                mountain_car = MountainCar(lmbda=λ, alpha=α)\n                mountain_car.train(n_episodes=n_episodes, method=method)\n                n_step_per_ep.append([λ, np.array(mountain_car.n_step_hist).mean()])\n\n        df_str_key = f'{method}'\n        all_df_vis[df_str_key] = pd.DataFrame(np.array(n_step_per_ep), columns=['lambda', 'steps'])\n        all_df_vis[df_str_key]['method'] = np.full(len(n_step_per_ep), method)\n\n    df_vis = pd.concat(all_df_vis.values(), ignore_index=True)\n    return df_vis\n\n\ndef get_cart_pole_plot_data():\n    n_step_max = 100000\n    n_runs = 30\n\n    method = 'accumulating'\n\n    lambda_range = [0., 0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.99]\n\n    # Optimal alpha coefficient for each lambda\n    alpha_range = [.3, .3, .3, .3, .1, .1, .1, .1]\n\n    all_df_vis = {}\n\n    n_fail_lambdas = []\n    for n_run in tqdm(range(n_runs), desc=f'CART POLE | method={method}'):\n        for λ, α in zip(lambda_range, alpha_range):\n            np.random.seed(n_run)  # Make sure that each runs of both algorithms experiences the same trial\n            cart_pole = CartPole(lmbda=λ, alpha=α)\n            cart_pole.train(n_step_max=n_step_max, method=method)\n\n            n_fail_lambdas.append([λ, cart_pole.n_failures])\n\n    df_str_key = f'{n_runs}'\n    all_df_vis[df_str_key] = pd.DataFrame(np.array(n_fail_lambdas), columns=['lambda', 'n_fails'])\n    all_df_vis[df_str_key]['method'] = np.full(len(n_fail_lambdas), method)\n\n    df_vis = pd.concat(all_df_vis.values())\n    return df_vis\n\n\ndef get_puddle_world_plot_data():\n    n_episodes = 40\n    n_runs = 30\n\n    method = 'replace_reset'\n\n    lambda_range = [0., 0.5, 0.8, 0.9, 0.95, 0.98, 0.99, 1.]\n\n    # Optimal alpha coefficient for each lambda\n    alpha_range = [.7, .7, .5, .5, .5, .5, .5, .3]\n\n    all_df_vis = {}\n\n    rand_seed = 0\n    np.random.seed(rand_seed)\n\n    cost_per_ep = []\n    for n_run in tqdm(range(n_runs), desc=f'PUDDLE WORLD | method={method}'):\n        for λ, α in zip(lambda_range, alpha_range):\n\n            puddle_world = PuddleWorld(lmbda=λ, alpha=α)\n            puddle_world.train(n_episodes=n_episodes, method=method)\n\n            cost_per_ep.append([λ, np.array(puddle_world.cost_per_ep_hist).mean()])\n\n    df_str_key = f'{rand_seed}'\n    all_df_vis[df_str_key] = pd.DataFrame(np.array(cost_per_ep), columns=['lambda', 'cost'])\n    all_df_vis[df_str_key]['method'] = np.full(len(cost_per_ep), method)\n\n    df_vis = pd.concat(all_df_vis.values())\n    return df_vis\n\n\n#----------------------------------#\n# 5.2. Reproducing figure 12.14    #\n#----------------------------------#\n\ndef figure_12_14():\n    # Get plot data for each task\n    df_rw = get_random_walk_plot_data()\n    df_mc = get_mountain_car_plot_data()\n    df_cp = get_cart_pole_plot_data()\n    df_pw = get_puddle_world_plot_data()\n\n    fig, axes = plt.subplots(2, 2, figsize=(12, 12))\n\n    # Mountain Car\n    sns.lineplot(data=df_mc, x='lambda', y='steps', hue='method',\n                 style='method', ax=axes[0, 0], marker='o', ci=68, err_style=\"bars\")\n    axes[0, 0].set_xlabel('λ')\n    axes[0, 0].set_ylabel('Steps per episode')\n    axes[0, 0].set_title(f\"MOUNTAIN CAR\", fontsize=15)\n    axes[0, 0].set_ylim(0, 500)\n\n    # Random walk\n    sns.lineplot(data=df_rw, x='lambda', y='rms', hue='method', style='method', ax=axes[0, 1], marker='o')\n    axes[0, 1].set_xlabel('λ')\n    axes[0, 1].set_ylabel('RMS error')\n    axes[0, 1].set_title(f\"RANDOM WALK\", fontsize=15)\n    axes[0, 1].set_ylim(0.2, 0.6)\n\n    # Puddle world\n    sns.lineplot(data=df_pw, x='lambda', y='cost', hue='method', ax=axes[1, 0], marker='o', ci=68, err_style=\"bars\")\n    axes[1, 0].set_xlabel('λ')\n    axes[1, 0].set_ylabel('Cost per episode')\n    axes[1, 0].set_title(f\"PUDDLE WORLD\", fontsize=15)\n\n    # Cart and pole\n    sns.lineplot(data=df_cp, x='lambda', y='n_fails', hue='method', ax=axes[1, 1], marker='o', ci=68,\n                 err_style=\"bars\")\n    axes[1, 1].set_xlabel('λ')\n    axes[1, 1].set_ylabel('Failures per 100 000 steps')\n    axes[1, 1].set_title(f\"CART AND POLE\", fontsize=15)\n\n    plt.savefig('combined_fig_test')\n    #plt.waitforbuttonpress()\n\n#--------------#\n# 5.3. Main    #\n#--------------#\n\nif __name__ == '__main__':\n\n    figure_12_14() # ~2h on colab\n\n    #get_puddle_world_map()\n\n\n"
  },
  {
    "path": "chapter12/mountain_car.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2017-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom math import floor\nfrom tqdm import tqdm\n\n#######################################################################\n# Following are some utilities for tile coding from Rich.\n# To make each file self-contained, I copied them from\n# http://incompleteideas.net/tiles/tiles3.py-remove\n# with some naming convention changes\n#\n# Tile coding starts\nclass IHT:\n    \"Structure to handle collisions\"\n    def __init__(self, size_val):\n        self.size = size_val\n        self.overfull_count = 0\n        self.dictionary = {}\n\n    def count(self):\n        return len(self.dictionary)\n\n    def full(self):\n        return len(self.dictionary) >= self.size\n\n    def get_index(self, obj, read_only=False):\n        d = self.dictionary\n        if obj in d:\n            return d[obj]\n        elif read_only:\n            return None\n        size = self.size\n        count = self.count()\n        if count >= size:\n            if self.overfull_count == 0: print('IHT full, starting to allow collisions')\n            self.overfull_count += 1\n            return hash(obj) % self.size\n        else:\n            d[obj] = count\n            return count\n\ndef hash_coords(coordinates, m, read_only=False):\n    if isinstance(m, IHT): return m.get_index(tuple(coordinates), read_only)\n    if isinstance(m, int): return hash(tuple(coordinates)) % m\n    if m is None: return coordinates\n\ndef tiles(iht_or_size, num_tilings, floats, ints=None, read_only=False):\n    \"\"\"returns num-tilings tile indices corresponding to the floats and ints\"\"\"\n    if ints is None:\n        ints = []\n    qfloats = [floor(f * num_tilings) for f in floats]\n    tiles = []\n    for tiling in range(num_tilings):\n        tilingX2 = tiling * 2\n        coords = [tiling]\n        b = tiling\n        for q in qfloats:\n            coords.append((q + b) // num_tilings)\n            b += tilingX2\n        coords.extend(ints)\n        tiles.append(hash_coords(coords, iht_or_size, read_only))\n    return tiles\n# Tile coding ends\n#######################################################################\n\n# all possible actions\nACTION_REVERSE = -1\nACTION_ZERO = 0\nACTION_FORWARD = 1\n# order is important\nACTIONS = [ACTION_REVERSE, ACTION_ZERO, ACTION_FORWARD]\n\n# bound for position and velocity\nPOSITION_MIN = -1.2\nPOSITION_MAX = 0.5\nVELOCITY_MIN = -0.07\nVELOCITY_MAX = 0.07\n\n# discount is always 1.0 in these experiments\nDISCOUNT = 1.0\n\n# use optimistic initial value, so it's ok to set epsilon to 0\nEPSILON = 0\n\n# maximum steps per episode\nSTEP_LIMIT = 5000\n\n# take an @action at @position and @velocity\n# @return: new position, new velocity, reward (always -1)\ndef step(position, velocity, action):\n    new_velocity = velocity + 0.001 * action - 0.0025 * np.cos(3 * position)\n    new_velocity = min(max(VELOCITY_MIN, new_velocity), VELOCITY_MAX)\n    new_position = position + new_velocity\n    new_position = min(max(POSITION_MIN, new_position), POSITION_MAX)\n    reward = -1.0\n    if new_position == POSITION_MIN:\n        new_velocity = 0.0\n    return new_position, new_velocity, reward\n\n# accumulating trace update rule\n# @trace: old trace (will be modified)\n# @activeTiles: current active tile indices\n# @lam: lambda\n# @return: new trace for convenience\ndef accumulating_trace(trace, active_tiles, lam):\n    trace *= lam * DISCOUNT\n    trace[active_tiles] += 1\n    return trace\n\n# replacing trace update rule\n# @trace: old trace (will be modified)\n# @activeTiles: current active tile indices\n# @lam: lambda\n# @return: new trace for convenience\ndef replacing_trace(trace, activeTiles, lam):\n    active = np.in1d(np.arange(len(trace)), activeTiles)\n    trace[active] = 1\n    trace[~active] *= lam * DISCOUNT\n    return trace\n\n# replacing trace update rule, 'clearing' means set all tiles corresponding to non-selected actions to 0\n# @trace: old trace (will be modified)\n# @activeTiles: current active tile indices\n# @lam: lambda\n# @clearingTiles: tiles to be cleared\n# @return: new trace for convenience\ndef replacing_trace_with_clearing(trace, active_tiles, lam, clearing_tiles):\n    active = np.in1d(np.arange(len(trace)), active_tiles)\n    trace[~active] *= lam * DISCOUNT\n    trace[clearing_tiles] = 0\n    trace[active] = 1\n    return trace\n\n# dutch trace update rule\n# @trace: old trace (will be modified)\n# @activeTiles: current active tile indices\n# @lam: lambda\n# @alpha: step size for all tiles\n# @return: new trace for convenience\ndef dutch_trace(trace, active_tiles, lam, alpha):\n    coef = 1 - alpha * DISCOUNT * lam * np.sum(trace[active_tiles])\n    trace *= DISCOUNT * lam\n    trace[active_tiles] += coef\n    return trace\n\n# wrapper class for Sarsa(lambda)\nclass Sarsa:\n    # In this example I use the tiling software instead of implementing standard tiling by myself\n    # One important thing is that tiling is only a map from (state, action) to a series of indices\n    # It doesn't matter whether the indices have meaning, only if this map satisfy some property\n    # View the following webpage for more information\n    # http://incompleteideas.net/sutton/tiles/tiles3.html\n    # @maxSize: the maximum # of indices\n    def __init__(self, step_size, lam, trace_update=accumulating_trace, num_of_tilings=8, max_size=2048):\n        self.max_size = max_size\n        self.num_of_tilings = num_of_tilings\n        self.trace_update = trace_update\n        self.lam = lam\n\n        # divide step size equally to each tiling\n        self.step_size = step_size / num_of_tilings\n\n        self.hash_table = IHT(max_size)\n\n        # weight for each tile\n        self.weights = np.zeros(max_size)\n\n        # trace for each tile\n        self.trace = np.zeros(max_size)\n\n        # position and velocity needs scaling to satisfy the tile software\n        self.position_scale = self.num_of_tilings / (POSITION_MAX - POSITION_MIN)\n        self.velocity_scale = self.num_of_tilings / (VELOCITY_MAX - VELOCITY_MIN)\n\n    # get indices of active tiles for given state and action\n    def get_active_tiles(self, position, velocity, action):\n        # I think positionScale * (position - position_min) would be a good normalization.\n        # However positionScale * position_min is a constant, so it's ok to ignore it.\n        active_tiles = tiles(self.hash_table, self.num_of_tilings,\n                            [self.position_scale * position, self.velocity_scale * velocity],\n                            [action])\n        return active_tiles\n\n    # estimate the value of given state and action\n    def value(self, position, velocity, action):\n        if position == POSITION_MAX:\n            return 0.0\n        active_tiles = self.get_active_tiles(position, velocity, action)\n        return np.sum(self.weights[active_tiles])\n\n    # learn with given state, action and target\n    def learn(self, position, velocity, action, target):\n        active_tiles = self.get_active_tiles(position, velocity, action)\n        estimation = np.sum(self.weights[active_tiles])\n        delta = target - estimation\n        if self.trace_update == accumulating_trace or self.trace_update == replacing_trace:\n            self.trace_update(self.trace, active_tiles, self.lam)\n        elif self.trace_update == dutch_trace:\n            self.trace_update(self.trace, active_tiles, self.lam, self.step_size)\n        elif self.trace_update == replacing_trace_with_clearing:\n            clearing_tiles = []\n            for act in ACTIONS:\n                if act != action:\n                    clearing_tiles.extend(self.get_active_tiles(position, velocity, act))\n            self.trace_update(self.trace, active_tiles, self.lam, clearing_tiles)\n        else:\n            raise Exception('Unexpected Trace Type')\n        self.weights += self.step_size * delta * self.trace\n\n    # get # of steps to reach the goal under current state value function\n    def cost_to_go(self, position, velocity):\n        costs = []\n        for action in ACTIONS:\n            costs.append(self.value(position, velocity, action))\n        return -np.max(costs)\n\n# get action at @position and @velocity based on epsilon greedy policy and @valueFunction\ndef get_action(position, velocity, valueFunction):\n    if np.random.binomial(1, EPSILON) == 1:\n        return np.random.choice(ACTIONS)\n    values = []\n    for action in ACTIONS:\n        values.append(valueFunction.value(position, velocity, action))\n    return np.argmax(values) - 1\n\n# play Mountain Car for one episode based on given method @evaluator\n# @return: total steps in this episode\ndef play(evaluator):\n    position = np.random.uniform(-0.6, -0.4)\n    velocity = 0.0\n    action = get_action(position, velocity, evaluator)\n    steps = 0\n    while True:\n        next_position, next_velocity, reward = step(position, velocity, action)\n        next_action = get_action(next_position, next_velocity, evaluator)\n        steps += 1\n        target = reward + DISCOUNT * evaluator.value(next_position, next_velocity, next_action)\n        evaluator.learn(position, velocity, action, target)\n        position = next_position\n        velocity = next_velocity\n        action = next_action\n        if next_position == POSITION_MAX:\n            break\n        if steps >= STEP_LIMIT:\n            print('Step Limit Exceeded!')\n            break\n    return steps\n\n# figure 12.10, effect of the lambda and alpha on early performance of Sarsa(lambda)\ndef figure_12_10():\n    runs = 30\n    episodes = 50\n    alphas = np.arange(1, 8) / 4.0\n    lams = [0.99, 0.95, 0.5, 0]\n\n    steps = np.zeros((len(lams), len(alphas), runs, episodes))\n    for lamInd, lam in enumerate(lams):\n        for alphaInd, alpha in enumerate(alphas):\n            for run in tqdm(range(runs)):\n                evaluator = Sarsa(alpha, lam, replacing_trace)\n                for ep in range(episodes):\n                    step = play(evaluator)\n                    steps[lamInd, alphaInd, run, ep] = step\n\n    # average over episodes\n    steps = np.mean(steps, axis=3)\n\n    # average over runs\n    steps = np.mean(steps, axis=2)\n\n    for lamInd, lam in enumerate(lams):\n        plt.plot(alphas, steps[lamInd, :], label='lambda = %s' % (str(lam)))\n    plt.xlabel('alpha * # of tilings (8)')\n    plt.ylabel('averaged steps per episode')\n    plt.ylim([180, 300])\n    plt.legend()\n\n    plt.savefig('../images/figure_12_10.png')\n    plt.close()\n\n# figure 12.11, summary comparision of Sarsa(lambda) algorithms\n# I use 8 tilings rather than 10 tilings\ndef figure_12_11():\n    traceTypes = [dutch_trace, replacing_trace, replacing_trace_with_clearing, accumulating_trace]\n    alphas = np.arange(0.2, 2.2, 0.2)\n    episodes = 20\n    runs = 30\n    lam = 0.9\n    rewards = np.zeros((len(traceTypes), len(alphas), runs, episodes))\n\n    for traceInd, trace in enumerate(traceTypes):\n        for alphaInd, alpha in enumerate(alphas):\n            for run in tqdm(range(runs)):\n                evaluator = Sarsa(alpha, lam, trace)\n                for ep in range(episodes):\n                    if trace == accumulating_trace and alpha > 0.6:\n                        steps = STEP_LIMIT\n                    else:\n                        steps = play(evaluator)\n                    rewards[traceInd, alphaInd, run, ep] = -steps\n\n    # average over episodes\n    rewards = np.mean(rewards, axis=3)\n\n    # average over runs\n    rewards = np.mean(rewards, axis=2)\n\n    for traceInd, trace in enumerate(traceTypes):\n        plt.plot(alphas, rewards[traceInd, :], label=trace.__name__)\n    plt.xlabel('alpha * # of tilings (8)')\n    plt.ylabel('averaged rewards pre episode')\n    plt.ylim([-550, -150])\n    plt.legend()\n\n    plt.savefig('../images/figure_12_11.png')\n    plt.close()\n\nif __name__ == '__main__':\n    figure_12_10()\n    figure_12_11()\n"
  },
  {
    "path": "chapter12/random_walk.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #\n# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\n\n# all states\nN_STATES = 19\n\n# all states but terminal states\nSTATES = np.arange(1, N_STATES + 1)\n\n# start from the middle state\nSTART_STATE = 10\n\n# two terminal states\n# an action leading to the left terminal state has reward -1\n# an action leading to the right terminal state has reward 1\nEND_STATES = [0, N_STATES + 1]\n\n# true state values from Bellman equation\nTRUE_VALUE = np.arange(-20, 22, 2) / 20.0\nTRUE_VALUE[0] = TRUE_VALUE[N_STATES + 1] = 0.0\n\n# base class for lambda-based algorithms in this chapter\n# In this example, we use the simplest linear feature function, state aggregation.\n# And we use exact 19 groups, so the weights for each group is exact the value for that state\nclass ValueFunction:\n    # @rate: lambda, as it's a keyword in python, so I call it rate\n    # @stepSize: alpha, step size for update\n    def __init__(self, rate, step_size):\n        self.rate = rate\n        self.step_size = step_size\n        self.weights = np.zeros(N_STATES + 2)\n\n    # the state value is just the weight\n    def value(self, state):\n        return self.weights[state]\n\n    # feed the algorithm with new observation\n    # derived class should override this function\n    def learn(self, state, reward):\n        return\n\n    # initialize some variables at the beginning of each episode\n    # must be called at the very beginning of each episode\n    # derived class should override this function\n    def new_episode(self):\n        return\n\n# Off-line lambda-return algorithm\nclass OffLineLambdaReturn(ValueFunction):\n    def __init__(self, rate, step_size):\n        ValueFunction.__init__(self, rate, step_size)\n        # To accelerate learning, set a truncate value for power of lambda\n        self.rate_truncate = 1e-3\n\n    def new_episode(self):\n        # initialize the trajectory\n        self.trajectory = [START_STATE]\n        # only need to track the last reward in one episode, as all others are 0\n        self.reward = 0.0\n\n    def learn(self, state, reward):\n        # add the new state to the trajectory\n        self.trajectory.append(state)\n        if state in END_STATES:\n            # start off-line learning once the episode ends\n            self.reward = reward\n            self.T = len(self.trajectory) - 1\n            self.off_line_learn()\n\n    # get the n-step return from the given time\n    def n_step_return_from_time(self, n, time):\n        # gamma is always 1 and rewards are zero except for the last reward\n        # the formula can be simplified\n        end_time = min(time + n, self.T)\n        returns = self.value(self.trajectory[end_time])\n        if end_time == self.T:\n            returns += self.reward\n        return returns\n\n    # get the lambda-return from the given time\n    def lambda_return_from_time(self, time):\n        returns = 0.0\n        lambda_power = 1\n        for n in range(1, self.T - time):\n            returns += lambda_power * self.n_step_return_from_time(n, time)\n            lambda_power *= self.rate\n            if lambda_power < self.rate_truncate:\n                # If the power of lambda has been too small, discard all the following sequences\n                break\n        returns *= 1 - self.rate\n        if lambda_power >= self.rate_truncate:\n            returns += lambda_power * self.reward\n        return returns\n\n    # perform off-line learning at the end of an episode\n    def off_line_learn(self):\n        for time in range(self.T):\n            # update for each state in the trajectory\n            state = self.trajectory[time]\n            delta = self.lambda_return_from_time(time) - self.value(state)\n            delta *= self.step_size\n            self.weights[state] += delta\n\n# TD(lambda) algorithm\nclass TemporalDifferenceLambda(ValueFunction):\n    def __init__(self, rate, step_size):\n        ValueFunction.__init__(self, rate, step_size)\n        self.new_episode()\n\n    def new_episode(self):\n        # initialize the eligibility trace\n        self.eligibility = np.zeros(N_STATES + 2)\n        # initialize the beginning state\n        self.last_state = START_STATE\n\n    def learn(self, state, reward):\n        # update the eligibility trace and weights\n        self.eligibility *= self.rate\n        self.eligibility[self.last_state] += 1\n        delta = reward + self.value(state) - self.value(self.last_state)\n        delta *= self.step_size\n        self.weights += delta * self.eligibility\n        self.last_state = state\n\n# True online TD(lambda) algorithm\nclass TrueOnlineTemporalDifferenceLambda(ValueFunction):\n    def __init__(self, rate, step_size):\n        ValueFunction.__init__(self, rate, step_size)\n\n    def new_episode(self):\n        # initialize the eligibility trace\n        self.eligibility = np.zeros(N_STATES + 2)\n        # initialize the beginning state\n        self.last_state = START_STATE\n        # initialize the old state value\n        self.old_state_value = 0.0\n\n    def learn(self, state, reward):\n        # update the eligibility trace and weights\n        last_state_value = self.value(self.last_state)\n        state_value = self.value(state)\n        dutch = 1 - self.step_size * self.rate * self.eligibility[self.last_state]\n        self.eligibility *= self.rate\n        self.eligibility[self.last_state] += dutch\n        delta = reward + state_value - last_state_value\n        self.weights += self.step_size * (delta + last_state_value - self.old_state_value) * self.eligibility\n        self.weights[self.last_state] -= self.step_size * (last_state_value - self.old_state_value)\n        self.old_state_value = state_value\n        self.last_state = state\n\n# 19-state random walk\ndef random_walk(value_function):\n    value_function.new_episode()\n    state = START_STATE\n    while state not in END_STATES:\n        next_state = state + np.random.choice([-1, 1])\n        if next_state == 0:\n            reward = -1\n        elif next_state == N_STATES + 1:\n            reward = 1\n        else:\n            reward = 0\n        value_function.learn(next_state, reward)\n        state = next_state\n\n# general plot framework\n# @valueFunctionGenerator: generate an instance of value function\n# @runs: specify the number of independent runs\n# @lambdas: a series of different lambda values\n# @alphas: sequences of step size for each lambda\ndef parameter_sweep(value_function_generator, runs, lambdas, alphas):\n    # play for 10 episodes for each run\n    episodes = 10\n    # track the rms errors\n    errors = [np.zeros(len(alphas_)) for alphas_ in alphas]\n    for run in tqdm(range(runs)):\n        for lambdaIndex, rate in enumerate(lambdas):\n            for alphaIndex, alpha in enumerate(alphas[lambdaIndex]):\n                valueFunction = value_function_generator(rate, alpha)\n                for episode in range(episodes):\n                    random_walk(valueFunction)\n                    stateValues = [valueFunction.value(state) for state in STATES]\n                    errors[lambdaIndex][alphaIndex] += np.sqrt(np.mean(np.power(stateValues - TRUE_VALUE[1: -1], 2)))\n\n    # average over runs and episodes\n    for error in errors:\n        error /= episodes * runs\n\n    for i in range(len(lambdas)):\n        plt.plot(alphas[i], errors[i], label='lambda = ' + str(lambdas[i]))\n    plt.xlabel('alpha')\n    plt.ylabel('RMS error')\n    plt.legend()\n\n# Figure 12.3: Off-line lambda-return algorithm\ndef figure_12_3():\n    lambdas = [0.0, 0.4, 0.8, 0.9, 0.95, 0.975, 0.99, 1]\n    alphas = [np.arange(0, 1.1, 0.1),\n              np.arange(0, 1.1, 0.1),\n              np.arange(0, 1.1, 0.1),\n              np.arange(0, 1.1, 0.1),\n              np.arange(0, 1.1, 0.1),\n              np.arange(0, 0.55, 0.05),\n              np.arange(0, 0.22, 0.02),\n              np.arange(0, 0.11, 0.01)]\n    parameter_sweep(OffLineLambdaReturn, 50, lambdas, alphas)\n\n    plt.savefig('../images/figure_12_3.png')\n    plt.close()\n\n# Figure 12.6: TD(lambda) algorithm\ndef figure_12_6():\n    lambdas = [0.0, 0.4, 0.8, 0.9, 0.95, 0.975, 0.99, 1]\n    alphas = [np.arange(0, 1.1, 0.1),\n              np.arange(0, 1.1, 0.1),\n              np.arange(0, 0.99, 0.09),\n              np.arange(0, 0.55, 0.05),\n              np.arange(0, 0.33, 0.03),\n              np.arange(0, 0.22, 0.02),\n              np.arange(0, 0.11, 0.01),\n              np.arange(0, 0.044, 0.004)]\n    parameter_sweep(TemporalDifferenceLambda, 50, lambdas, alphas)\n\n    plt.savefig('../images/figure_12_6.png')\n    plt.close()\n\n# Figure 12.7: True online TD(lambda) algorithm\ndef figure_12_8():\n    lambdas = [0.0, 0.4, 0.8, 0.9, 0.95, 0.975, 0.99, 1]\n    alphas = [np.arange(0, 1.1, 0.1),\n              np.arange(0, 1.1, 0.1),\n              np.arange(0, 1.1, 0.1),\n              np.arange(0, 1.1, 0.1),\n              np.arange(0, 1.1, 0.1),\n              np.arange(0, 0.88, 0.08),\n              np.arange(0, 0.44, 0.04),\n              np.arange(0, 0.11, 0.01)]\n    parameter_sweep(TrueOnlineTemporalDifferenceLambda, 50, lambdas, alphas)\n\n    plt.savefig('../images/figure_12_8.png')\n    plt.close()\n\nif __name__ == '__main__':\n    figure_12_3()\n    figure_12_6()\n    figure_12_8()\n"
  },
  {
    "path": "chapter13/short_corridor.py",
    "content": "#######################################################################\n# Copyright (C)                                                       #\n# 2018 Sergii Bondariev (sergeybondarev@gmail.com)                    #\n# 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)                  #\n# Permission given to modify the code as long as you keep this        #\n# declaration at the top                                              #\n#######################################################################\n\nimport numpy as np\nimport matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\n\ndef true_value(p):\n    \"\"\" True value of the first state\n    Args:\n        p (float): probability of the action 'right'.\n    Returns:\n        True value of the first state.\n        The expression is obtained by manually solving the easy linear system\n        of Bellman equations using known dynamics.\n    \"\"\"\n    return (2 * p - 4) / (p * (1 - p))\n\nclass ShortCorridor:\n    \"\"\"\n    Short corridor environment, see Example 13.1\n    \"\"\"\n    def __init__(self):\n        self.reset()\n\n    def reset(self):\n        self.state = 0\n\n    def step(self, go_right):\n        \"\"\"\n        Args:\n            go_right (bool): chosen action\n        Returns:\n            tuple of (reward, episode terminated?)\n        \"\"\"\n        if self.state == 0 or self.state == 2:\n            if go_right:\n                self.state += 1\n            else:\n                self.state = max(0, self.state - 1)\n        else:\n            if go_right:\n                self.state -= 1\n            else:\n                self.state += 1\n\n        if self.state == 3:\n            # terminal state\n            return 0, True\n        else:\n            return -1, False\n\ndef softmax(x):\n    t = np.exp(x - np.max(x))\n    return t / np.sum(t)\n\nclass ReinforceAgent:\n    \"\"\"\n    ReinforceAgent that follows algorithm\n    'REINFORNCE Monte-Carlo Policy-Gradient Control (episodic)'\n    \"\"\"\n    def __init__(self, alpha, gamma):\n        # set values such that initial conditions correspond to left-epsilon greedy\n        self.theta = np.array([-1.47, 1.47])\n        self.alpha = alpha\n        self.gamma = gamma\n        # first column - left, second - right\n        self.x = np.array([[0, 1],\n                           [1, 0]])\n        self.rewards = []\n        self.actions = []\n\n    def get_pi(self):\n        h = np.dot(self.theta, self.x)\n        t = np.exp(h - np.max(h))\n        pmf = t / np.sum(t)\n        # never become deterministic,\n        # guarantees episode finish\n        imin = np.argmin(pmf)\n        epsilon = 0.05\n\n        if pmf[imin] < epsilon:\n            pmf[:] = 1 - epsilon\n            pmf[imin] = epsilon\n\n        return pmf\n\n    def get_p_right(self):\n        return self.get_pi()[1]\n\n    def choose_action(self, reward):\n        if reward is not None:\n            self.rewards.append(reward)\n\n        pmf = self.get_pi()\n        go_right = np.random.uniform() <= pmf[1]\n        self.actions.append(go_right)\n\n        return go_right\n\n    def episode_end(self, last_reward):\n        self.rewards.append(last_reward)\n\n        # learn theta\n        G = np.zeros(len(self.rewards))\n        G[-1] = self.rewards[-1]\n\n        for i in range(2, len(G) + 1):\n            G[-i] = self.gamma * G[-i + 1] + self.rewards[-i]\n\n        gamma_pow = 1\n\n        for i in range(len(G)):\n            j = 1 if self.actions[i] else 0\n            pmf = self.get_pi()\n            grad_ln_pi = self.x[:, j] - np.dot(self.x, pmf)\n            update = self.alpha * gamma_pow * G[i] * grad_ln_pi\n\n            self.theta += update\n            gamma_pow *= self.gamma\n\n        self.rewards = []\n        self.actions = []\n\nclass ReinforceBaselineAgent(ReinforceAgent):\n    def __init__(self, alpha, gamma, alpha_w):\n        super(ReinforceBaselineAgent, self).__init__(alpha, gamma)\n        self.alpha_w = alpha_w\n        self.w = 0\n\n    def episode_end(self, last_reward):\n        self.rewards.append(last_reward)\n\n        # learn theta\n        G = np.zeros(len(self.rewards))\n        G[-1] = self.rewards[-1]\n\n        for i in range(2, len(G) + 1):\n            G[-i] = self.gamma * G[-i + 1] + self.rewards[-i]\n\n        gamma_pow = 1\n\n        for i in range(len(G)):\n            self.w += self.alpha_w * gamma_pow * (G[i] - self.w)\n\n            j = 1 if self.actions[i] else 0\n            pmf = self.get_pi()\n            grad_ln_pi = self.x[:, j] - np.dot(self.x, pmf)\n            update = self.alpha * gamma_pow * (G[i] - self.w) * grad_ln_pi\n\n            self.theta += update\n            gamma_pow *= self.gamma\n\n        self.rewards = []\n        self.actions = []\n\ndef trial(num_episodes, agent_generator):\n    env = ShortCorridor()\n    agent = agent_generator()\n\n    rewards = np.zeros(num_episodes)\n    for episode_idx in range(num_episodes):\n        rewards_sum = 0\n        reward = None\n        env.reset()\n\n        while True:\n            go_right = agent.choose_action(reward)\n            reward, episode_end = env.step(go_right)\n            rewards_sum += reward\n\n            if episode_end:\n                agent.episode_end(reward)\n                break\n\n        rewards[episode_idx] = rewards_sum\n\n    return rewards\n\ndef example_13_1():\n    epsilon = 0.05\n    fig, ax = plt.subplots(1, 1)\n\n    # Plot a graph\n    p = np.linspace(0.01, 0.99, 100)\n    y = true_value(p)\n    ax.plot(p, y, color='red')\n\n    # Find a maximum point, can also be done analytically by taking a derivative\n    imax = np.argmax(y)\n    pmax = p[imax]\n    ymax = y[imax]\n    ax.plot(pmax, ymax, color='green', marker=\"*\", label=\"optimal point: f({0:.2f}) = {1:.2f}\".format(pmax, ymax))\n\n    # Plot points of two epsilon-greedy policies\n    ax.plot(epsilon, true_value(epsilon), color='magenta', marker=\"o\", label=\"epsilon-greedy left\")\n    ax.plot(1 - epsilon, true_value(1 - epsilon), color='blue', marker=\"o\", label=\"epsilon-greedy right\")\n\n    ax.set_ylabel(\"Value of the first state\")\n    ax.set_xlabel(\"Probability of the action 'right'\")\n    ax.set_title(\"Short corridor with switched actions\")\n    ax.set_ylim(ymin=-105.0, ymax=5)\n    ax.legend()\n\n    plt.savefig('../images/example_13_1.png')\n    plt.close()\n\ndef figure_13_1():\n    num_trials = 100\n    num_episodes = 1000\n    gamma = 1\n    agent_generators = [lambda : ReinforceAgent(alpha=2e-4, gamma=gamma),\n                        lambda : ReinforceAgent(alpha=2e-5, gamma=gamma),\n                        lambda : ReinforceAgent(alpha=2e-3, gamma=gamma)]\n    labels = ['alpha = 2e-4',\n              'alpha = 2e-5',\n              'alpha = 2e-3']\n\n    rewards = np.zeros((len(agent_generators), num_trials, num_episodes))\n\n    for agent_index, agent_generator in enumerate(agent_generators):\n        for i in tqdm(range(num_trials)):\n            reward = trial(num_episodes, agent_generator)\n            rewards[agent_index, i, :] = reward\n\n    plt.plot(np.arange(num_episodes) + 1, -11.6 * np.ones(num_episodes), ls='dashed', color='red', label='-11.6')\n    for i, label in enumerate(labels):\n        plt.plot(np.arange(num_episodes) + 1, rewards[i].mean(axis=0), label=label)\n    plt.ylabel('total reward on episode')\n    plt.xlabel('episode')\n    plt.legend(loc='lower right')\n\n    plt.savefig('../images/figure_13_1.png')\n    plt.close()\n\ndef figure_13_2():\n    num_trials = 100\n    num_episodes = 1000\n    alpha = 2e-4\n    gamma = 1\n    agent_generators = [lambda : ReinforceAgent(alpha=alpha, gamma=gamma),\n                        lambda : ReinforceBaselineAgent(alpha=alpha*10, gamma=gamma, alpha_w=alpha*100)]\n    labels = ['Reinforce without baseline',\n              'Reinforce with baseline']\n\n    rewards = np.zeros((len(agent_generators), num_trials, num_episodes))\n\n    for agent_index, agent_generator in enumerate(agent_generators):\n        for i in tqdm(range(num_trials)):\n            reward = trial(num_episodes, agent_generator)\n            rewards[agent_index, i, :] = reward\n\n    plt.plot(np.arange(num_episodes) + 1, -11.6 * np.ones(num_episodes), ls='dashed', color='red', label='-11.6')\n    for i, label in enumerate(labels):\n        plt.plot(np.arange(num_episodes) + 1, rewards[i].mean(axis=0), label=label)\n    plt.ylabel('total reward on episode')\n    plt.xlabel('episode')\n    plt.legend(loc='lower right')\n\n    plt.savefig('../images/figure_13_2.png')\n    plt.close()\n\nif __name__ == '__main__':\n    example_13_1()\n    figure_13_1()\n    figure_13_2()\n"
  },
  {
    "path": "requirements.txt",
    "content": "numpy\nmatplotlib\nseaborn\ntqdm\nscipy\n"
  }
]