[
  {
    "path": ".github/ISSUE_TEMPLATE/bug-in-code-solution.md",
    "content": "---\nname: Bug in code solution\nabout: You have found a mistake in one of the codes solution, please fill this issue\n  report\n\n---\n\n<!--- Provide a general summary of the issue in the Title above -->\n\n## Expected Behavior\n<!--- Tell us what should happen -->\n\n## Current Behavior\n<!--- Tell us what happens instead of the expected behavior -->\n\n## Steps to Reproduce\n<!--- Provide an unambiguous set of steps to reproduce this bug.-->\n<!--- Include code to reproduce, if relevant -->\n1.\n2.\n3.\n4.\n\n## Possible Solution (Optional)\n<!--- Not obligatory, but suggest an idea for solving the issue -->\n\n## Detailed Description (Optional)\n<!--- Provide a more detailed description of the solution you are proposing.-->\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/missing-exercise-or-outdated-statement.md",
    "content": "---\nname: Missing Exercise or Outdated Statement\nabout: 'As the book continues to be updated, there can be missing exercises or outdated\n  statements. '\n\n---\n\n## Exercises missing\nThe Exercise <exercise_number> is missing.\n\n## Obviously...\nCheck [the book online](http://incompleteideas.net/book/the-book-2nd.html) for the exercises and add statements and solutions.\n"
  },
  {
    "path": ".gitignore",
    "content": ".DS_Store\n__pycache__\n"
  },
  {
    "path": "Chapter 1/Exercise 1.1.md",
    "content": "# Exercise 1.1: Self-Play\n\n## Question:\nSuppose, instead of playing against a random opponent, the reinforcement learning algorithm described above\nplayed against itself. What do you think would happen in this case? Would it learn a different way of playing?\n\n## Answer:\nThe algorithm will continue to adapt until it reaches an equilibrium, which may be either fixed (always making\nthe same moves), or cyclical. It is possible to reach a higher skill level than against a fixed opponent\nbut it may also take miss paths toward stronger over all play because it is reacting to itself\n\n"
  },
  {
    "path": "Chapter 1/Exercise 1.2.md",
    "content": "# Exercise 1.2: Symmetries   \n\n## Question \nMany tic-tac-toe positions appear different but are really the same because of symmetries.\nHow might we amend the reinforcement learning algorithm described above to take advantage of this?\nIn what ways would this improve it? Now think again. Suppose the opponent did not take advantage of symmetries.\nIn that case, should we?\nIs it true, then, that symmetrically equivalent positions should necessarily have the same value?\n\n## Answer:\nFor tic-tac-toe it is possible to use 4 axis of symmetry to essentially fold the board down to a quarter of the size\nThis would dramatically increase the speed/reduce the memory required.\nIf the opponent did not take advantage of symmetries then it could result in a worse overall performance, for example,\nif the opponent always played correct except for 1 corner, then using symmetries would mean you never take advantage\nof that information. This means symmetrically equivalent positions don't always hold the same value in a\nmulti-player game\n"
  },
  {
    "path": "Chapter 1/Exercise 1.3.md",
    "content": "# Exercise 1.3: Greedy Play   \n\n## Question\nSuppose the reinforcement learning player was greedy, that is, it always played the\nmove that brought it to the position that it rated the best. Would it learn to play better, or worse,\nthan a non-greedy player? What problems might occur?\n\n## Answer:\nIn general it would play worse. THe chance the correct action for a situation in the long run is the first one\nthat returns a positive reward is pretty slim, particularly if there are a large number of actions available.\nIt would also be unable to adapt to opponents that slowly altered behaviour over time.\n"
  },
  {
    "path": "Chapter 1/Exercise 1.4.md",
    "content": "# Exercise 1.4: Learning from Exploration   \n\n## Question\nSuppose learning updates occurred after all moves, including exploratory\nmoves. If the step-size parameter is appropriately reduced over time, then the state values would converge to a set\nof probabilities. What are the two sets of probabilities computed when we do, and when we do not, learn from\nexploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be\nbetter to learn? Which would result in more wins?\n\n## Answer:\nWith the step size parameter appropriately reduced, and assuming the exploration rate is fixed, the probability set\nwith no learning from exploration is the value of each state given the optimal action from then on is taken, whereas\nwith learning from exploration it is the expected value of each state including the active exploration policy.\nUsing the former is better to learn, as it reduces variance from sub-optimal future states (e.g. if you can win a\ngame of chess in one move, but if you perform another move your opponent wins, that doesn't make it s bad state)\nThe former would result in more wins all other things being equal"
  },
  {
    "path": "Chapter 1/Exercise 1.5.md",
    "content": "# Exercise 1.5: Other Improvements   \n\n## Question:\nCan you think of other ways to improve the reinforcement learning player?\nCan you think of any better way to solve the tic-tac-toe problem as posed?\n\n## Answer:\nIf the player was adapting over time decaying the old updates could speed up the improvement.\nAltering the exploration rate/learning based on the variance in the opponents actions. If the opponent is\nalways making the same moves and you are winning from it, then e-greedy with 10% exploration is just going to\nlose you games. Even though tic-tac-toe is a solvable game doing that may not result in the highest result\nwith a sub-optimal opponent.\n"
  },
  {
    "path": "Chapter 2/Exercise 2.1.md",
    "content": "# Exercise 2.1\n  \n## Question:\nIn ε-greedy action selection, for the case of two actions and ε = 0.5,\nwhat is the probability thtat the greedy action is selected?\n\n## Answer:\n0.5 + 0.5 \\* 0.5 = 0.75\n\n50% of the times it'll be selected greedily (because it is the best choice) and half of the\ntimes the action is selected randomly it will be selected by chance.\n"
  },
  {
    "path": "Chapter 2/Exercise 2.2.md",
    "content": "# Exercise 2.2\n\n## Question:\n_Bandit example_ Consider a k-armed bandit problem with k = 4 actions, denoted 1, 2, 3, and 4.\nConsider applying to this problem a bandit algorithm using ε-greedy action selectioni,\nsample average action-value estimates, and initial estimates of Q<sub>1</sub>(a) = 0, for all a. Suppose\nthe initial sequence of actions and rewards is A<sub>1</sub> = 1, R<sub>1</sub> = -1, A<sub>2</sub> = 2, R<sub>2</sub> = 1, A<sub>3</sub> = 2, R<sub>3</sub> = -2,\nA<sub>4</sub> = 2, R<sub>4</sub> = 2, A<sub>5</sub> = 3, R<sub>5</sub> = 0. On some of these time steps the ε case may have ocurred,\ncausing an action to be selected at random. On which time steps did this definitely occur? On which\ntime steps could this possibly have occurred.\n\n## Answer:\nLet's build a table for Q_t(a) for each time step t:\n|      |a=1      |a=2     |a=3     |a=4     |\n|:----:|:-------:|:------:|:------:|:------:|\n|t=1   |0.00     |0.00    |0.00    |0.00    |\n|t=2   |-1.00    |0.00    |0.00    |0.00    |\n|t=3   |-1.00    |1.00    |0.00    |0.00    |\n|t=4   |-1.00    |-0.50   |0.00    |0.00    |\n|t=5   |-1.00    |0.33    |0.00    |0.00    |\n\n- A_1 = 1: random selection or greedy selection.\n- A_2 = 2: random selection or greedy selection.\n- A_3 = 2: random selection or greedy selection.\n- A_4 = 2: definitely non-greedy selection (exploration).\n- A_5 = 3: definitely non-greedy selection (exploration).\n\n"
  },
  {
    "path": "Chapter 2/Exercise 2.3.md",
    "content": "# Exercise 2.3\n\n## Question:\nIn the comparison shown in Figure 2.2, which method will perform best in the long run in\nterms of cumulative reward and cumulative probability of selecting the best action? How much better will it be?\n\n## Answer:\nε=0.01 will select the best action 99.1% (0.99+0.01\\*0.1) of the time versus ε=0.1, that will select it 91% (0.9+0.1\\*0.1).\n"
  },
  {
    "path": "Chapter 2/Exercise 2.4.md",
    "content": "# Exercise 2.4\n\n## Question:\nIf the step-size parameters, α<sub>n</sub>, are not constant, then the estimate\nQ<sub>n</sub> is a weighted average of previously received rewards with a weighting different\nfrom that given by (2.6). What is the weighting on each prior reward for the general\ncase, analogous to (2.6), in terms of the sequence of step-size parameters?\n\n## Answer:\nThe equation of Q<sub>(n+1)</sub> = (1-α)<sup>n</sup>\\*Q<sub>1</sub> + Σ<sup>n</sup><sub>(i=1)</sub>[α\\*(1-α)<sup>(n-i)</sup>\\*R<sub>i</sub>] would become\nQ<sub>(n+1)</sub> = Π<sup>n</sup><sub>(i=1)</sub>[1-α<sub>i</sub>]\\*Q<sub>1</sub> + Σ<sup>n</sup><sub>(i=1)</sub>[α<sub>i</sub>\\*Π<sup>i</sup><sub>(j=1)</sub>[1-α<sub>j</sub>]\\*R<sub>i</sub>]\n\nEssentially it means that you need to keep a track of the α's used and alter the product\n"
  },
  {
    "path": "Chapter 2/Exercise 2.5.md",
    "content": "# Exercise 2.5\n\n## Question:\nDesign and conduct an experiment to demonstrate the difficulties that\nsample-average methods have for non-stationary problems. Use a modified\nversion of the 10-armed testbed in which all the q\\*(a) start out equal and\nthen take independent random walks (say by adding a normally distributed\nincrement with mean zero and standard deviation 0.01 to all the q\\*(a)\non each step. Prepare plots like Figure 2.2 for an\naction-value method using sample averages, incrementally computed,\nand another action-value method using a constant step-size\nparameter, α = 0.1. Use ε = 0.1 and longer runs, say of 10,000 steps.\n\n## Answer:\nRun code/Exercise 2.5.py\n\n![Average Reward](images/average_reward.png)\n\n![Action Optimality](images/action_optimality.png)\n"
  },
  {
    "path": "Chapter 2/Exercise 2.6.md",
    "content": "# Exercise 2.6\n\n## Question:\nThe results shown in Figure 2.3 should be quite reliable because they\nare averages over 2000 individual, randomly chosen 10-armed bandit \ntasks. Why, then, are there oscillations and spikes in the early part \nof the curve for the optimistic method? In other words, what might \nmake this method perform particularly better or worse, on average, \non particular early steps?\n\n## Answer:\nThe oscillations in the early stage are likely to be caused as the\nalgorithm reduces the Q values for the poorly performing options.\nTo start off with it's going to think all of the bad options are good\nand has to try each multiple times before it realises they're actually\nbad. This will depend on how quickly these bad options drop below the\nbest option (once that occurs on exploration will reduce them to their\ncorrect values over time but at a much more reduced rate)\n\n\n\n"
  },
  {
    "path": "Chapter 2/Exercise 2.7.md",
    "content": "# Exercise 2.7\n\n## Question:\nShow that in the case of two actions, the soft-max distribution is the same as\nthat given by the logistic, or sigmoid, function often used in statistics and\nartificial neural networks.\n\n## Answer:\n\n![Exercise 2.7 solution](images/Exercise_2_7.png)\n\n"
  },
  {
    "path": "Chapter 2/Exercise 2.8.md",
    "content": "# Exercise 2.8\n\n## Question:\nSuppose you face a 2-armed bandit task whose true action values change randomly from time step\nto time step. Specifically, suppose that, for any time step, the true values of action 1 and 2\nare respectively 0.1 and 0.2 with probability 0.5 (case A), and 0.9 and 0.8 with probability 0.5\n(case B). If you are not able to tell which case you face at any step, what is the best expectation\nof success you can achieve and how should you behave to achieve it? Now suppose that on each step\nyou are told whether you are facing case A or case B (although you still don't know the true action\nvalues). This is an associative search task. What is the best expectation of success you can\nachieve in this task, and how should you behave to achieve it?\n\n## Answer:\n\nFor the first scenario, you cannot hold individual estimates for the case A and B. Therefore,\nthe best approach is to select the action that has best value estimate in combination. In this case,\nthe estimates of both actions the same. So the best expectation of success is 0.5 and it can be\nachieved by selecting an action randomly at each step.\n\n\nA<sub>1</sub> = 0.5 \\* 0.1 + 0.5 \\* 0.9 = 0.5\n\nA<sub>2</sub> = 0.5 \\* 0.2 + 0.5 \\* 0.8 = 0.5\n\nFor the second scenario, you can hold independent estimates for the case A and B, thus we can learn\nthe best action for each one treating them as independent bandit problems. The best expectation\nof success is 0.55 obtained from selecting A<sub>2</sub> in case A and A<sub>1</sub> in case B.\n\n0.5 \\* 0.2 + 0.5 \\* 0.9 = 0.55\n\n\n\n\n"
  },
  {
    "path": "Chapter 2/Exercise 2.9.md",
    "content": "# Exercise 2.9\n\n## Question:\n(programming) Make a figure anologous to Figure 2.6 for the non-stationary case outlined\nin Exercise 2.5. Include the constant step-size ε-greedy algorithm with α = 0.1. Use runs\nof 200,000 steps and, as performance measure for each algorithm and parameter setting, use\nthe average reward over the last 100,000 steps.\n\n## Answer:\nRun code/Exercise 2.9.py\n\n![Average Reward](images/average_reward_per_parameter_conf.png)\n"
  },
  {
    "path": "Chapter 2/code/Exercise 2.5.py",
    "content": "import numpy as np\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\nfrom estimators import SampleAverageEstimator, WeightedEstimator\nfrom testbed import K_armed_testbed\n\nnp.random.seed(250)\n\n\ndef plot_performance(estimator_names, rewards, action_optimality):\n    for i, estimator_name in enumerate(estimator_names):\n        average_run_rewards = np.average(rewards[i], axis=0)\n        plt.plot(average_run_rewards, label=estimator_name)\n\n    plt.legend()\n    plt.xlabel(\"Steps\")\n    plt.ylabel(\"Average reward\")\n    plt.show()\n\n    for i, estimator_name in enumerate(estimator_names):\n        average_run_optimality = np.average(action_optimality[i], axis=0)\n        plt.plot(average_run_optimality, label=estimator_name)\n    plt.legend()\n    plt.xlabel(\"Steps\")\n    plt.ylabel(\"% Optimal action\")\n    plt.show()\n\n\nif __name__ == \"__main__\":\n    K = 10\n    N_STEPS = 10000\n    N_RUNS = 2000\n    N_ESTIMATORS = 2\n\n    rewards = np.full((N_ESTIMATORS, N_RUNS, N_STEPS), fill_value=0.)\n    optimal_selections = np.full((N_ESTIMATORS, N_RUNS, N_STEPS), fill_value=0.)\n\n    for run_i in tqdm(range(N_RUNS)):\n\n        testbed = K_armed_testbed(k_actions=K)\n\n        action_value_estimates = np.full(K, fill_value=0.0)\n        sample_average_estimator = SampleAverageEstimator(action_value_estimates.copy(), epsilon=0.1)\n        weighted_estimator = WeightedEstimator(action_value_estimates.copy(), epsilon=0.1, alpha=0.1)\n\n        estimators = [sample_average_estimator, weighted_estimator]\n\n        for step_i in range(N_STEPS):\n            for estimator_i, estimator in enumerate(estimators):\n                action_selected = estimator.select_action()\n                is_optimal = testbed.is_optimal_action(action_selected)\n                reward = testbed.sample_action(action_selected)\n                estimator.update_estimates(action_selected, reward)\n\n                rewards[estimator_i][run_i][step_i] = reward\n                optimal_selections[estimator_i][run_i][step_i] = is_optimal\n\n            testbed.random_walk_action_values()\n\n    plot_performance([\"Ɛ=0.1\", \"Ɛ=0.1 α=0.1\"], np.array(rewards), np.array(optimal_selections))\n"
  },
  {
    "path": "Chapter 2/code/Exercise 2.9.py",
    "content": "import numpy as np\nimport matplotlib.pyplot as plt\nfrom tqdm import tqdm\nfrom testbed import K_armed_testbed\nfrom estimators import SampleAverageEstimator, WeightedEstimator, GradientBandit, UCBEstimator\n\nnp.random.seed(250)\n\n\ndef plot_performance_of_parameter_settings(parameter_settings, estimator_names, performance_results):\n\n    for estimator_i, estimator_results in enumerate(performance_results):\n        average_parameter_results = []\n        for parameter_setting_results in estimator_results:\n            average_run_results = np.average(parameter_setting_results, axis=0)\n            average_step_rewards = np.average(average_run_results)\n            average_parameter_results.append(average_step_rewards)\n\n        plt.plot(parameter_settings, average_parameter_results, label=estimator_names[estimator_i])\n\n    plt.legend()\n    plt.xlabel(\"ε, α, c, Q0\")\n    plt.xscale(\"log\", basex=2)\n    plt.ylabel(\"Average reward over last %d steps\" % AVERAGE_OVER_LAST_N_STEPS)\n    plt.show()\n\n\nif __name__ == \"__main__\":\n    with open(\"runs.csv\", \"w+\") as csvfile:\n        csvfile.write(\"algorithm_id,parameter_setting,run_i\\n\")\n    K = 10\n    N_STEPS = 200000\n    N_RUNS = 10\n    N_ESTIMATORS = 4\n    AVERAGE_OVER_LAST_N_STEPS = 100000\n\n    starting_index = N_STEPS - AVERAGE_OVER_LAST_N_STEPS\n\n    parameter_settings = [1.0/128, 1.0/64, 1.0/32, 1.0/16, 1.0/8, 1.0/4, 1.0/2, 1.0, 2.0, 4.0]\n\n    rewards = np.full((N_ESTIMATORS, len(parameter_settings), N_RUNS, AVERAGE_OVER_LAST_N_STEPS), fill_value=0.)\n\n    for parameter_setting_i, parameter_setting in tqdm(enumerate(parameter_settings), total=len(parameter_settings)):\n        for run_i in tqdm(range(N_RUNS)):\n\n            testbed = K_armed_testbed(k_actions=K)\n\n            action_value_estimates = np.full(K, fill_value=0.0)\n            sample_average_estimator = SampleAverageEstimator(action_value_estimates.copy(), epsilon=parameter_setting)\n            weighted_estimator = WeightedEstimator(action_value_estimates.copy(), epsilon=0.1, alpha=parameter_setting)\n            ucb = UCBEstimator(action_value_estimates.copy(), epsilon=0.1, alpha=0.1, c=parameter_setting)\n            gradient_bandit = GradientBandit(action_value_estimates.copy(), alpha=parameter_setting)\n\n            estimators = [sample_average_estimator, weighted_estimator, ucb, gradient_bandit]\n\n            for step_i in tqdm(range(N_STEPS)):\n                for estimator_i, estimator in enumerate(estimators):\n                    action_selected = estimator.select_action()\n                    reward = testbed.sample_action(action_selected)\n                    estimator.update_estimates(action_selected, reward)\n\n                    if step_i >= starting_index:\n                        rewards[estimator_i][parameter_setting_i][run_i][step_i - starting_index] = reward\n\n                testbed.random_walk_action_values()\n\n    estimator_names = [\"Sample Average Estimator\", \"Constant Step-size Estimator\", \"UCB\", \"Gradient Bandit\"]\n    plot_performance_of_parameter_settings(parameter_settings, estimator_names, rewards)\n"
  },
  {
    "path": "Chapter 2/code/estimators.py",
    "content": "import numpy as np\n\n\nclass Estimator(object):\n    def __init__(self, action_value_initial_estimates):\n        self.action_value_estimates = action_value_initial_estimates\n        self.k_actions = len(action_value_initial_estimates)\n        self.action_selected_count = np.full(self.k_actions, fill_value=0, dtype=\"int64\")\n\n    def select_action(self):\n        raise NotImplementedError(\"Need to implement a method to select actions\")\n\n    def update_estimates(self):\n        raise NotImplementedError(\"Need to implement a method to update action value estimates\")\n\n    def select_greedy_action(self):\n        return np.argmax(self.action_value_estimates)\n\n    def select_action_randomly(self):\n        return np.random.choice(self.k_actions)\n\n\nclass SampleAverageEstimator(Estimator):\n    def __init__(self, action_value_initial_estimates, epsilon):\n        super(SampleAverageEstimator, self).__init__(action_value_initial_estimates)\n        self.epsilon = epsilon\n\n    def update_estimates(self, action_selected, r):\n        self.action_selected_count[action_selected] += 1\n\n        qn = self.action_value_estimates[action_selected]\n        n = self.action_selected_count[action_selected]\n\n        self.action_value_estimates[action_selected] = qn + (1.0 / n) * (r - qn)\n\n    def select_action(self):\n        probability = np.random.rand()\n        if probability >= self.epsilon:\n            return self.select_greedy_action()\n\n        return self.select_action_randomly()\n\n\nclass WeightedEstimator(SampleAverageEstimator):\n    def __init__(self, action_value_initial_estimates, epsilon=0, alpha=0.5):\n        super(WeightedEstimator, self).__init__(action_value_initial_estimates, epsilon)\n        self.alpha = alpha\n\n    def update_estimates(self, action_selected, r):\n        qn = self.action_value_estimates[action_selected]\n\n        self.action_value_estimates[action_selected] = qn + self.alpha * (r - qn)\n\n\nclass UCBEstimator(WeightedEstimator):\n    def __init__(self, action_value_initial_estimates, epsilon=0, alpha=0.5, c=2):\n        super(UCBEstimator, self).__init__(action_value_initial_estimates, epsilon, alpha)\n        self.c = c\n        self.t = 0\n\n    def select_action(self):\n        self.t += 1\n        probability = np.random.rand()\n        if probability >= self.epsilon:\n            return self.select_greedy_action()\n\n        return self.select_ucb_action()\n\n    def calculate_action_potential(self, action_i):\n        q_t = self.action_value_estimates[action_i]\n        ln_t = np.log(self.t)\n        n_t = self.action_selected_count[action_i]\n\n        return q_t + self.c * np.sqrt(ln_t / n_t)\n\n    def select_ucb_action(self):\n        greedy_action = self.select_greedy_action()\n\n        if 0 in self.action_selected_count:\n            actions_never_selected = [action_i for action_i in range(self.k_actions)\n                                      if self.action_selected_count[action_i] == 0]\n            selected_action = np.random.choice(actions_never_selected)\n            self.action_selected_count[selected_action] += 1\n            return selected_action\n\n        action_potential = [self.calculate_action_potential(action_i) for action_i in range(self.k_actions)]\n        action_potential[greedy_action] = -1\n\n        return np.argmax(action_potential)\n\n\nclass GradientBandit(Estimator):\n    def __init__(self, action_value_initial_estimates, alpha):\n        super(GradientBandit, self).__init__(action_value_initial_estimates)\n        self.average_reward = 0\n        self.numerical_preference = np.full(self.k_actions, fill_value=0.)\n        self.alpha = alpha\n\n    def update_average_reward(self, r):\n        qn = self.average_reward\n        self.average_reward = qn + self.alpha * (r - qn)\n\n    def update_estimates(self, action_selected, r):\n        self.update_average_reward(r)\n\n        P = self.get_actions_probabilities()\n        baseline = self.average_reward\n\n        ht = self.numerical_preference\n        htp1 = ht - self.alpha * (r - baseline) * P\n        htp1[action_selected] = ht[action_selected] + self.alpha * (r - baseline) * (1 - P[action_selected])\n\n        self.numerical_preference = htp1\n\n    def get_actions_probabilities(self):\n        exp_numerical_preference = np.exp(self.numerical_preference)\n        return exp_numerical_preference / np.sum(exp_numerical_preference)\n\n    def select_action(self):\n        return np.random.choice(a=self.k_actions, p=self.get_actions_probabilities())\n"
  },
  {
    "path": "Chapter 2/code/testbed.py",
    "content": "import numpy as np\nfrom numpy.random import normal as GaussianDistribution\n\n\nclass K_armed_testbed():\n    # Q*-values for each one of the k possible actions start out equal\n    # and then take independent random walks\n\n    def __init__(self, k_actions):\n        self.k = k_actions\n        # self.action_values = GaussianDistribution(loc=0, scale=1, size=self.k)\n        self.action_values = np.full(self.k, fill_value=0.0)\n\n    def random_walk_action_values(self):\n        increment = GaussianDistribution(loc=0, scale=0.01, size=self.k)\n        self.action_values += increment\n\n    def sample_action(self, action_i):\n        return GaussianDistribution(loc=self.action_values[action_i], scale=1, size=1)[0]\n\n    def get_optimal_action(self):\n        return np.argmax(self.action_values)\n\n    def get_optimal_action_value(self):\n        return self.action_values[self.get_optimal_action()]\n\n    def is_optimal_action(self, action_i):\n        return float(self.get_optimal_action_value() == self.action_values[action_i])\n\n    def __str__(self):\n        return \"\\t\".join([\"A%d: %.2f\" % (action_i, self.action_values[action_i]) for action_i in range(self.k)])\n"
  },
  {
    "path": "Chapter 2/tex_files/exercise2.7.tex",
    "content": "\\documentclass[12pt]{article}\n\n\\usepackage[margin=1in]{geometry}\n\n\\begin{document}\n\\thispagestyle{empty}\n\n\\noindent Using the definition of the \\textbf{sigmoid function} we would have:\n\n$$Pr\\{A_t = a\\} = \\frac{e^{H_t(a)}}{e^{H_t(a)} + 1}$$\n$$Pr\\{A_t = b\\} = 1 - Pr\\{A_t = a\\} = \\frac{1}{e^{H_t(a)} + 1}$$\n\n\\noindent Extending the definition of the \\textbf{soft-max distribution} for the case of two actions (k=2) we get the following:\n\n$$Pr\\{A_t=a\\} = \\frac{e^{H_t(a)}}{e^{H_t(a)} + e^{H_t(b)}}$$\n$$Pr\\{A_t=b\\} = \\frac{e^{H_t(b)}}{e^{H_t(a)} + e^{H_t(b)}}$$\n\n\\noindent According to the definition of the \\textit{numerical preference}, if we substract an amount from each of the preferences\nit does not affect the probabilities. So we can redifine $H_t(a)$ and $H_t(b)$ as:\n\n$$H_t(b) \\leftarrow  H_t(b) - H_t(b) = 0$$\n$$H_t(a) \\leftarrow  H_t(a) - H_t(b) = H_t(a)$$\n\n\\noindent and we get:\n\n$$Pr\\{A_t=a\\} = \\frac{e^{H_t(a)}}{e^{H_t(a)} + e^0} = \\frac{e^{H_t(a)}}{e^{H_t(a)} + 1} $$\n$$Pr\\{A_t=b\\} = \\frac{e^{0}}{e^{H_t(a)} + e^{0}} = \\frac{1}{e^{H_t(a)} + 1}$$\n\n\\end{document}\n"
  },
  {
    "path": "Chapter 3/Chapter 3 Exercises.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.1 \\n\",\n    \"  \\n\",\n    \"## Question:\\n\",\n    \"Devise three example tasks of your own that fit into the MDP framework, identifying for each its states, actions,\\n\",\n    \"and rewards. Make the three examples as _different_ from each other as possible. The framework is abstract and\\n\",\n    \"flexible and can be applied in many different ways. Stretch its limits in some way in at least one of your examples.\\n\",\n    \"\\n\",\n    \"## Answer:\\n\",\n    \"Example 1: Hairdresser agent\\n\",\n    \"\\n\",\n    \"- States: the state of the hair and the desired cut of the client. The state of the hair and desired cut could be\\n\",\n    \"encoded as arrays of the length of the hair in a set of predetermined areas the head is divided in.\\n\",\n    \"- Actions: using the scissors or the clipper (and what accessory) and the area to be cut.\\n\",\n    \"- Rewards: negative rewards for the client complaints or imprecisions in the cut and positive rewards for tips.\\n\",\n    \"\\n\",\n    \"Example 2: DJ agent\\n\",\n    \"\\n\",\n    \"- States: a measure of how much people are dancing and singing to the song being played and the song currently playing.\\n\",\n    \"- Actions: given a setlist of 5000 songs, selecting the next song to be played (or a combination of them) and a type of\\n\",\n    \"transition between the songs.\\n\",\n    \"- Rewards: negative rewards given by people leaving the club faster than expected, positive rewards given by the level\\n\",\n    \"of danciness/_singiness_ in the club.\\n\",\n    \"\\n\",\n    \"Example 3: Texas Hold'em Poker player agent\\n\",\n    \"\\n\",\n    \"- States: The two cards in its hand and the cards showing in the table.\\n\",\n    \"- Actions: Check, call, raise, or fold.\\n\",\n    \"- Rewards: The money obtained or lost after playing one hand.\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.2 \\n\",\n    \"  \\n\",\n    \"## Question:\\n\",\n    \"Is the MDP framework adequate to usefully represent *all* goal-directed learning tasks?\\n\",\n    \"Can you think of any clear exceptions?\\n\",\n    \"\\n\",\n    \"## Answer:\\n\",\n    \"We can try to think about it in terms of the limitations a finite MDP imposes in a problem definition and whether any\\n\",\n    \"approximations could exist within the framework.\\n\",\n    \"\\n\",\n    \"1) The Markov property, if not only the previous state and action selected influence which will be the next state. An\\n\",\n    \"example of this could be a modified game of chess where the order in which the pieces were moved affects the possible\\n\",\n    \"moves. This could be encoded within the MDP, an state would be composed of the position of the pieces and when\\n\",\n    \"they have been moved (skyrocketing the amount of available states making it a more difficult learning environment).\\n\",\n    \"If the sequence information wasn't available (or only partially available) at the time of making a decision,\\n\",\n    \"information from the past that can affect the next state would be missing, breaking the Markov property.\\n\",\n    \"\\n\",\n    \"2) The action and state sets must be finite. Any problem with infinite available actions or states would need\\n\",\n    \"alternative representations, such as grouping them into subsets and use these sets. An example of this could be a\\n\",\n    \"problem where the states are the natural numbers and we have to define intervals (e.g. negative numbers, numbers in the\\n\",\n    \"range 0-25, 25-200, and 200-∞) or where the actions can be a string of any size and we restrict it to a discrete number\\n\",\n    \"of lengths (e.g. only generate strings with length 3, 5, 8, 13, 21 and 34).\\n\",\n    \"\\n\",\n    \"3) Rewards must be numerical. If the rewards the environment gives back are not numerical we would need to encode them\\n\",\n    \"as numbers. This can be a highly difficult task as the rewards may not translate naturally to numbers. For example,\\n\",\n    \"if the rewards were your family's verbal feedback on how good the meal you prepared was, it would be difficult to\\n\",\n    \"convert it into a number and capture correctly its intensity (e.g. What's a better feedback? \\\"It was really good\\\" or\\n\",\n    \"\\\"I have enjoyed the meal a lot\\\"). If we opt for a simpler reward encoding, distinguishing only negative, neutral and\\n\",\n    \"positive comments, this may not capture perfectly the information given.\\n\",\n    \"\\n\",\n    \"The first example is, as far as I understand, a clear exception of the MDP-framework not being an appropriate\\n\",\n    \"representation. Apart from this, some of the points mentioned (or their combination) may difficult largely the task for\\n\",\n    \"an agent, making it really hard to learn anything valuable from the environment.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.3\\n\",\n    \"  \\n\",\n    \"## Question:\\n\",\n    \"Consider the problem of driving. You could define the actions in\\n\",\n    \"terms of the accelerator, steering wheel, and brake, that is, where your body meets\\n\",\n    \"the machine. Or you could define them farther out - say, where the rubber meets the\\n\",\n    \"road, considering your actions to be tire torques. Or you could define them farther\\n\",\n    \"in - say, where your brain meets your body, the actions being muscle twitches to\\n\",\n    \"control your limbs. Or you could go to a really high level and say that your actions\\n\",\n    \"are your choices of where to drive. What is the right level, the right place to draw\\n\",\n    \"the line between agent and environment? On what basis is one location of the line\\n\",\n    \"to be preferred over another? Is there any fundamental reason for preferring one\\n\",\n    \"location over another, or is it a free choice?\\n\",\n    \"\\n\",\n    \"## Answer:\\n\",\n    \"The limit of the actions should be at the functional point, being the \\n\",\n    \"point at which is the agent makes the decision to take an action that\\n\",\n    \"the action always occurs in the same way every time. Above that point\\n\",\n    \"it is better to think of the agent as having sub-goals and goals, where\\n\",\n    \"something like walking to the door is a goal with the sub-goals of \\n\",\n    \"moving the legs, with the action of applying hydraulic pressure to\\n\",\n    \"particular points.\\n\",\n    \"\\n\",\n    \"However this does depend on the reliability of the system, and what\\n\",\n    \"reliability you require. If 99% of sub-goals are successful then it may\\n\",\n    \"be easy to convert them into actions. Abstracting in this way makes\\n\",\n    \"reaching larger goals easier as it dramatically reduces the\\n\",\n    \"action-space needed to explore.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.4\\n\",\n    \"\\n\",\n    \"# Question:\\n\",\n    \"Give a table analogous to that in Example 3.3, but for p(s',r|s, a). It should have columns for s, a, s', r, and p(s', r|s, a), and a row for every 4-tuple for which p(s',r|s, a) > 0.\\n\",\n    \"\\n\",\n    \"# Answer\\n\",\n    \"Since there is a single reward defined for each triplet (s, a, s'), the table is the same filtering the lines with p(s'|s,a)=0.\\n\",\n    \"\\n\",\n    \"This fulfills the formula (3.4):\\n\",\n    \"\\\\begin{equation*}\\n\",\n    \"p(s'|s, a) = \\\\sum_{r \\\\in R} p(s',r|s, a)\\n\",\n    \"\\\\end{equation*}\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.5\\n\",\n    \"\\n\",\n    \"### Question:\\n\",\n    \"The equations in Section 3.1 are for the continuing case and need to be modified (very slightly) to\\n\",\n    \"apply to episodic tasks. Show that you know the modifications needed by giving the modified version\\n\",\n    \"of (3.3)\\n\",\n    \"\\n\",\n    \"### Answer:\\n\",\n    \"The original formula is:\\n\",\n    \"    \\n\",\n    \"\\\\begin{equation*}\\n\",\n    \"        \\\\sum_{s' \\\\in S} \\\\sum_{r \\\\in R} p(s', r|s, a) = 1, \\\\forall s \\\\in S, a \\\\in A(s)\\n\",\n    \"\\\\end{equation*}\\n\",\n    \"\\n\",\n    \"according to the definitions in 3.3, for episodic tasks the set of terminal and non-terminal states can be denoted as S+. Therefore, the formula changes to:\\n\",\n    \"\\\\begin{equation*}\\n\",\n    \"        \\\\sum_{s' \\\\in S} \\\\sum_{r \\\\in R} p(s', r|s, a) = 1, \\\\forall s \\\\in S^+, a \\\\in A(s)\\n\",\n    \"\\\\end{equation*}\\n\",\n    \"as the dynamics of the MDP in an episodic task include as a possible transition those ending in a terminal state.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.6\\n\",\n    \"  \\n\",\n    \"### Question:\\n\",\n    \"Suppose you treated pole-balancing as an episodic task but also used\\n\",\n    \"discounting, with all rewards zero except for -1 upon failure. What then would the\\n\",\n    \"return be at each time? How does this return differ from that in the discounted,\\n\",\n    \"continuing formulation of this task?\\n\",\n    \"\\n\",\n    \"### Answer:\\n\",\n    \"\\n\",\n    \"The formula would change to:\\n\",\n    \"\\n\",\n    \"\\\\begin{equation*}\\n\",\n    \"G_t = \\\\sum_{k=0}^{T-t-1} \\\\gamma^k R_{t+k+1}\\n\",\n    \"\\\\end{equation*}\\n\",\n    \"\\n\",\n    \"In the limit (very large T), both returns would be the same.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.7\\n\",\n    \"  \\n\",\n    \"## Question:\\n\",\n    \"Imagine that you are designing a robot to run a maze. You decide\\n\",\n    \"to give it a reward of +1 for escaping from the maze and a reward of zero at all\\n\",\n    \"other times. The task seems to break down naturally into episodes - the successive\\n\",\n    \"runs through the maze - so you decide to treat it as an episodic task, where the goal\\n\",\n    \"is to maximize expected total reward (3.1). After running the learning agent for a\\n\",\n    \"while, you find that it is showing no improvement in escaping from the maze. What\\n\",\n    \"is going wrong? Have you effectively communicated to the agent what you want it\\n\",\n    \"to achieve?\\n\",\n    \"\\n\",\n    \"## Answer:\\n\",\n    \"This is likely an exploration issue where the agent is unable to find\\n\",\n    \"the exit the first time and therefore doesn't know there's anything\\n\",\n    \"better than 0 as a reward. Potential solutions include having each\\n\",\n    \"non-goal state be worth -1, and/or extending the episode length. This\\n\",\n    \"would mean states the agents visits a lot (particularly around the start)\\n\",\n    \"will get worse and worse values so it will want to move away from there\\n\",\n    \"and eventually find the goal (essentially reaching the goal stops it\\n\",\n    \"being in pain)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.8\\n\",\n    \"\\n\",\n    \"# Answer\\n\",\n    \"Suppose gamma = 0.5 and the following sequence of rewards is received R_1 = -1, R_2 = 2, R_3 = 6, R_4 = 3 and R_5 = 3, with T = 5. What are G_0, G_1, ..., G_5? Hint: Work backwards\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"G_0 = 2.0\\n\",\n      \"G_1 = 6.0\\n\",\n      \"G_2 = 8.0\\n\",\n      \"G_3 = 4.0\\n\",\n      \"G_4 = 2.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"r = [-1, 2, 6, 3, 2]\\n\",\n    \"gamma = 0.5\\n\",\n    \"\\n\",\n    \"for g_i in range(len(r)):\\n\",\n    \"    ret = sum([gamma**i * r_i for i, r_i in enumerate(r[g_i:])])    \\n\",\n    \"    print(\\\"G_\\\"+ str(g_i) + \\\" = \\\" + str(ret))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.9\\n\",\n    \"\\n\",\n    \"## Question\\n\",\n    \"Suppose gamma=0.9 and the reward sequence is R_1 = 2 followed by an infinite sequence of 7s. What are G_1 and G_0?\\n\",\n    \"\\n\",\n    \"## Answer\\n\",\n    \"\\\\begin{equation*}\\n\",\n    \"G_0 = 2 + \\\\gamma\\\\ 7 + \\\\gamma^2\\\\ 7 + ... = 2 + \\\\gamma\\\\ ( \\\\sum_{k=0}^{\\\\infty} \\\\gamma^k\\\\ 7 ) = 2 + \\\\gamma\\\\ 7\\\\ ( \\\\sum_{k=0}^{\\\\infty} \\\\gamma^k) =  2 + \\\\gamma\\\\ 7\\\\ (\\\\frac{1}{1 - \\\\gamma}) = 2 + 0.9 \\\\times 7 \\\\times 10 = 65\\n\",\n    \"\\\\end{equation*}\\n\",\n    \"\\\\begin{equation*}\\n\",\n    \"G_1 = 7 + \\\\gamma\\\\ 7 + \\\\gamma^2\\\\ 7 + ... = 7\\\\ ( \\\\sum_{k=0}^{\\\\infty} \\\\gamma^k) = 7 \\\\frac{1}{1 - \\\\gamma} = 70\\n\",\n    \"\\\\end{equation*}\\n\",\n    \"\\n\",\n    \"This results fulfills:\\n\",\n    \"\\n\",\n    \"\\\\begin{equation*}\\n\",\n    \"G_0 = R_1 + \\\\gamma G_1 = 2 + 0.9 \\\\times 70 = 65\\n\",\n    \"\\\\end{equation*}\\n\",\n    \"\\n\",\n    \"This problem can also be solved through the equations below (note that G_2 = G_1 = G_n where n > 0):\\n\",\n    \"\\n\",\n    \"\\\\begin{equation*}\\n\",\n    \"G_0 = R_1 + 0.9\\\\ G_1\\n\",\n    \"\\\\end{equation*}\\n\",\n    \"\\\\begin{equation*}\\n\",\n    \"G_1 = R_2 + 0.9\\\\ G_1\\n\",\n    \"\\\\end{equation*}\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.10\\n\",\n    \"\\n\",\n    \"## Question\\n\",\n    \"Prove (3.10)\\n\",\n    \"\\n\",\n    \"## Answer\\n\",\n    \"\\\\begin{equation*}\\n\",\n    \"G_t = \\\\sum_{k=0}^\\\\infty y_k = lim_{n \\\\rightarrow \\\\infty} (1 + \\\\gamma + \\\\gamma^2 + ... + \\\\gamma^n) = lim_{n \\\\rightarrow \\\\infty} \\\\frac{(1 +   \\\\gamma + \\\\gamma^2 + ... + \\\\gamma^n) (1 - \\\\gamma)}{(1 - \\\\gamma)} = lim_{n \\\\rightarrow \\\\infty} \\\\frac{1 - \\\\gamma^{n+1}}{1 - \\\\gamma} = \\\\frac{1}{1 - \\\\gamma}\\n\",\n    \"\\\\end{equation*}\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.11\\n\",\n    \"\\n\",\n    \"## Question\\n\",\n    \"If the current state is S_t, and actions are selected according to stochastic policy pi, then what is the expectation of R_t+1 in terms of pi and the four-argument function p(3.2)?\\n\",\n    \"\\n\",\n    \"## Answer\\n\",\n    \"\\\\begin{equation*}\\n\",\n    \"\\\\mathbb{E} [R_{t+1} | S_t=s, \\\\pi] = \\\\sum_a \\\\pi(a|s) \\\\sum_{s',r} r\\\\ p(s', r|s, a)\\n\",\n    \"\\\\end{equation*}\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.12\\n\",\n    \"## Question\\n\",\n    \"Give an equation for v_pi in terms of q_pi and pi.\\n\",\n    \"\\n\",\n    \"## Answer\\n\",\n    \"\\\\begin{equation*}\\n\",\n    \"v_\\\\pi(s) = \\\\sum_{a\\\\in A(s)} q_\\\\pi(s,a) \\\\pi(a|s)\\n\",\n    \"\\\\end{equation*}\\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.13\\n\",\n    \"## Question\\n\",\n    \"Give an equation for q_pi in terms of v_pi and the four-argument p\\n\",\n    \"## Answer\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.14\\n\",\n    \"## Question\\n\",\n    \"The Bellman equation (3.14) must hold for each state for the value function v_pi shown in Figure 3.2 (right) of Example 3.5. Show numerically that this equation holds for the center state, valued at +0.7, with respect to its four neighboring states, valued at +2.3, +0.4, -0.4 and +0.7. (These numbers are accurate only to one decimal place.)\\n\",\n    \"## Answer\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.15\\n\",\n    \"## Question\\n\",\n    \"In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.8), that adding a constant c to all the rewards adds a constant, v_c, to the values of all states, and thus\\n\",\n    \"does not affect the relative values of any states under any policies. What is v_c in terms of c and gamma?\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Answer\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.16\\n\",\n    \"## Question\\n\",\n    \"Now consider adding a constant c to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.\\n\",\n    \"\\n\",\n    \"## Answer\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.17\\n\",\n    \"## Question\\n\",\n    \"What is the Bellman equation for action values, that is, for q_pi? It must give the action value q_pi(s, a) in terms of the action values, q_pi(s',a'), of possible successors to the state–action pair (s, a). Hint: the backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.\\n\",\n    \"\\n\",\n    \"## Answer\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.18\\n\",\n    \"## Question\\n\",\n    \"The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:\\n\",\n    \"\\n\",\n    \"Give the equation corresponding to this intuition and diagram for the value at the root node, v_pi(s), in terms of the value at the expected leaf node, q_pi(s, a), given S_t=s. This equation should include an expectation conditioned on following the policy, \\\\pi. Then give a second equation in which the expected value is written out explicitly in terms of \\\\pi(a|s) such that no expected value notation appears in the equation.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise 3.19\\n\",\n    \"The value of an action, q_pi(s, a), depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:\\n\",\n    \"\\n\",\n    \"Give the equation corresponding to this intuition and diagram for the action value, q_pi(s, a), in terms of the expected next reward, Rt+1, and the expected next state value,v_pi(St+1), given that St=s and At=a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of p(s',r|s, a) defined by (3.2), such that no expected value notation appears in the equation.\\n\",\n    \"\\n\",\n    \"## Answer\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Exercise 3.20\\n\",\n    \"Draw or describe the optimal state-value function for the golf example.\\n\",\n    \"\\n\",\n    \"Exercise 3.21\\n\",\n    \"Draw or describe the contours of the optimal action-value function for\\n\",\n    \"putting, q\\\\*(s,putter), for the golf example.\\n\",\n    \"\\n\",\n    \"Exercise 3.22\\n\",\n    \"Consider the continuing MDP shown on to the right. The only decision to be made is that in the top state,\\n\",\n    \"where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, pi_left and pi_right. What policy is optimal if\\n\",\n    \"gamma=0? If gamma=0.9? If gamma=0.5?\\n\",\n    \"\\n\",\n    \"Exercise 3.23\\n\",\n    \"Give the Bellman equation for q_pi for the recycling robot.\\n\",\n    \"\\n\",\n    \"Exercise 3.24\\n\",\n    \"Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express this value symbolically, and then to compute it to three decimal places.\\n\",\n    \"\\n\",\n    \"Exercise 3.25\\n\",\n    \"Give an equation for v_* in terms of q_*.\\n\",\n    \"\\n\",\n    \"Exercise 3.26\\n\",\n    \"Give an equation for q_* in terms of v* and the four-argument p.\\n\",\n    \"\\n\",\n    \"Exercise 3.27\\n\",\n    \"Give an equation for \\\\pi_* in terms of q*.\\n\",\n    \"\\n\",\n    \"Exercise 3.28\\n\",\n    \"Give an equation for pi* in terms of v* and the four-argument p.\\n\",\n    \"\\n\",\n    \"Exercise 3.29\\n\",\n    \"Rewrite the four Bellman equations for the four value functions (v_pi,v*,q_pi,and q*) in terms of the three argument function p (3.4) and the two-argument function r (3.5).\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.7.6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "README.md",
    "content": "## Exercise Solutions for \"Reinforcement Learning: An Introduction\" 2nd Edition \n### A book by Richard S. Sutton and Andrew G. Barto.\n\nYou can find an online version of the book [HERE](http://incompleteideas.net/book/the-book-2nd.html).\n\nI have no guarantees for any of the solutions' correctness so if you see any mistakes or think any of the solutions lack completeness or you simply want to start a discussion on them, please feel free to let me know or submit an issue or pull request.\n\n**NOTE:** Exercises 1.1-1.5, 2.3, 2.4, 2.6, 3.3-3.11 come from [JKCooper2's repository](https://github.com/JKCooper2/rlai-exercises)\n"
  }
]