Repository: iamhectorotero/rlai-exercises
Branch: master
Commit: 95a7438b396e
Files: 24
Total size: 43.5 KB

Directory structure:
gitextract_tc2u3k5u/

├── .github/
│   └── ISSUE_TEMPLATE/
│       ├── bug-in-code-solution.md
│       └── missing-exercise-or-outdated-statement.md
├── .gitignore
├── Chapter 1/
│   ├── Exercise 1.1.md
│   ├── Exercise 1.2.md
│   ├── Exercise 1.3.md
│   ├── Exercise 1.4.md
│   └── Exercise 1.5.md
├── Chapter 2/
│   ├── Exercise 2.1.md
│   ├── Exercise 2.2.md
│   ├── Exercise 2.3.md
│   ├── Exercise 2.4.md
│   ├── Exercise 2.5.md
│   ├── Exercise 2.6.md
│   ├── Exercise 2.7.md
│   ├── Exercise 2.8.md
│   ├── Exercise 2.9.md
│   ├── code/
│   │   ├── Exercise 2.5.py
│   │   ├── Exercise 2.9.py
│   │   ├── estimators.py
│   │   └── testbed.py
│   └── tex_files/
│       └── exercise2.7.tex
├── Chapter 3/
│   └── Chapter 3 Exercises.ipynb
└── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/ISSUE_TEMPLATE/bug-in-code-solution.md
================================================
---
name: Bug in code solution
about: You have found a mistake in one of the codes solution, please fill this issue
  report

---

<!--- Provide a general summary of the issue in the Title above -->

## Expected Behavior
<!--- Tell us what should happen -->

## Current Behavior
<!--- Tell us what happens instead of the expected behavior -->

## Steps to Reproduce
<!--- Provide an unambiguous set of steps to reproduce this bug.-->
<!--- Include code to reproduce, if relevant -->
1.
2.
3.
4.

## Possible Solution (Optional)
<!--- Not obligatory, but suggest an idea for solving the issue -->

## Detailed Description (Optional)
<!--- Provide a more detailed description of the solution you are proposing.-->


================================================
FILE: .github/ISSUE_TEMPLATE/missing-exercise-or-outdated-statement.md
================================================
---
name: Missing Exercise or Outdated Statement
about: 'As the book continues to be updated, there can be missing exercises or outdated
  statements. '

---

## Exercises missing
The Exercise <exercise_number> is missing.

## Obviously...
Check [the book online](http://incompleteideas.net/book/the-book-2nd.html) for the exercises and add statements and solutions.


================================================
FILE: .gitignore
================================================
.DS_Store
__pycache__


================================================
FILE: Chapter 1/Exercise 1.1.md
================================================
# Exercise 1.1: Self-Play

## Question:
Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above
played against itself. What do you think would happen in this case? Would it learn a different way of playing?

## Answer:
The algorithm will continue to adapt until it reaches an equilibrium, which may be either fixed (always making
the same moves), or cyclical. It is possible to reach a higher skill level than against a fixed opponent
but it may also take miss paths toward stronger over all play because it is reacting to itself


================================================
FILE: Chapter 1/Exercise 1.2.md
================================================
# Exercise 1.2: Symmetries   

## Question 
Many tic-tac-toe positions appear different but are really the same because of symmetries.
How might we amend the reinforcement learning algorithm described above to take advantage of this?
In what ways would this improve it? Now think again. Suppose the opponent did not take advantage of symmetries.
In that case, should we?
Is it true, then, that symmetrically equivalent positions should necessarily have the same value?

## Answer:
For tic-tac-toe it is possible to use 4 axis of symmetry to essentially fold the board down to a quarter of the size
This would dramatically increase the speed/reduce the memory required.
If the opponent did not take advantage of symmetries then it could result in a worse overall performance, for example,
if the opponent always played correct except for 1 corner, then using symmetries would mean you never take advantage
of that information. This means symmetrically equivalent positions don't always hold the same value in a
multi-player game


================================================
FILE: Chapter 1/Exercise 1.3.md
================================================
# Exercise 1.3: Greedy Play   

## Question
Suppose the reinforcement learning player was greedy, that is, it always played the
move that brought it to the position that it rated the best. Would it learn to play better, or worse,
than a non-greedy player? What problems might occur?

## Answer:
In general it would play worse. THe chance the correct action for a situation in the long run is the first one
that returns a positive reward is pretty slim, particularly if there are a large number of actions available.
It would also be unable to adapt to opponents that slowly altered behaviour over time.


================================================
FILE: Chapter 1/Exercise 1.4.md
================================================
# Exercise 1.4: Learning from Exploration   

## Question
Suppose learning updates occurred after all moves, including exploratory
moves. If the step-size parameter is appropriately reduced over time, then the state values would converge to a set
of probabilities. What are the two sets of probabilities computed when we do, and when we do not, learn from
exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be
better to learn? Which would result in more wins?

## Answer:
With the step size parameter appropriately reduced, and assuming the exploration rate is fixed, the probability set
with no learning from exploration is the value of each state given the optimal action from then on is taken, whereas
with learning from exploration it is the expected value of each state including the active exploration policy.
Using the former is better to learn, as it reduces variance from sub-optimal future states (e.g. if you can win a
game of chess in one move, but if you perform another move your opponent wins, that doesn't make it s bad state)
The former would result in more wins all other things being equal

================================================
FILE: Chapter 1/Exercise 1.5.md
================================================
# Exercise 1.5: Other Improvements   

## Question:
Can you think of other ways to improve the reinforcement learning player?
Can you think of any better way to solve the tic-tac-toe problem as posed?

## Answer:
If the player was adapting over time decaying the old updates could speed up the improvement.
Altering the exploration rate/learning based on the variance in the opponents actions. If the opponent is
always making the same moves and you are winning from it, then e-greedy with 10% exploration is just going to
lose you games. Even though tic-tac-toe is a solvable game doing that may not result in the highest result
with a sub-optimal opponent.


================================================
FILE: Chapter 2/Exercise 2.1.md
================================================
# Exercise 2.1
  
## Question:
In ε-greedy action selection, for the case of two actions and ε = 0.5,
what is the probability thtat the greedy action is selected?

## Answer:
0.5 + 0.5 \* 0.5 = 0.75

50% of the times it'll be selected greedily (because it is the best choice) and half of the
times the action is selected randomly it will be selected by chance.


================================================
FILE: Chapter 2/Exercise 2.2.md
================================================
# Exercise 2.2

## Question:
_Bandit example_ Consider a k-armed bandit problem with k = 4 actions, denoted 1, 2, 3, and 4.
Consider applying to this problem a bandit algorithm using ε-greedy action selectioni,
sample average action-value estimates, and initial estimates of Q<sub>1</sub>(a) = 0, for all a. Suppose
the initial sequence of actions and rewards is A<sub>1</sub> = 1, R<sub>1</sub> = -1, A<sub>2</sub> = 2, R<sub>2</sub> = 1, A<sub>3</sub> = 2, R<sub>3</sub> = -2,
A<sub>4</sub> = 2, R<sub>4</sub> = 2, A<sub>5</sub> = 3, R<sub>5</sub> = 0. On some of these time steps the ε case may have ocurred,
causing an action to be selected at random. On which time steps did this definitely occur? On which
time steps could this possibly have occurred.

## Answer:
Let's build a table for Q_t(a) for each time step t:
|      |a=1      |a=2     |a=3     |a=4     |
|:----:|:-------:|:------:|:------:|:------:|
|t=1   |0.00     |0.00    |0.00    |0.00    |
|t=2   |-1.00    |0.00    |0.00    |0.00    |
|t=3   |-1.00    |1.00    |0.00    |0.00    |
|t=4   |-1.00    |-0.50   |0.00    |0.00    |
|t=5   |-1.00    |0.33    |0.00    |0.00    |

- A_1 = 1: random selection or greedy selection.
- A_2 = 2: random selection or greedy selection.
- A_3 = 2: random selection or greedy selection.
- A_4 = 2: definitely non-greedy selection (exploration).
- A_5 = 3: definitely non-greedy selection (exploration).


================================================
FILE: Chapter 2/Exercise 2.3.md
================================================
# Exercise 2.3

## Question:
In the comparison shown in Figure 2.2, which method will perform best in the long run in
terms of cumulative reward and cumulative probability of selecting the best action? How much better will it be?

## Answer:
ε=0.01 will select the best action 99.1% (0.99+0.01\*0.1) of the time versus ε=0.1, that will select it 91% (0.9+0.1\*0.1).


================================================
FILE: Chapter 2/Exercise 2.4.md
================================================
# Exercise 2.4

## Question:
If the step-size parameters, α<sub>n</sub>, are not constant, then the estimate
Q<sub>n</sub> is a weighted average of previously received rewards with a weighting different
from that given by (2.6). What is the weighting on each prior reward for the general
case, analogous to (2.6), in terms of the sequence of step-size parameters?

## Answer:
The equation of Q<sub>(n+1)</sub> = (1-α)<sup>n</sup>\*Q<sub>1</sub> + Σ<sup>n</sup><sub>(i=1)</sub>[α\*(1-α)<sup>(n-i)</sup>\*R<sub>i</sub>] would become
Q<sub>(n+1)</sub> = Π<sup>n</sup><sub>(i=1)</sub>[1-α<sub>i</sub>]\*Q<sub>1</sub> + Σ<sup>n</sup><sub>(i=1)</sub>[α<sub>i</sub>\*Π<sup>i</sup><sub>(j=1)</sub>[1-α<sub>j</sub>]\*R<sub>i</sub>]

Essentially it means that you need to keep a track of the α's used and alter the product


================================================
FILE: Chapter 2/Exercise 2.5.md
================================================
# Exercise 2.5

## Question:
Design and conduct an experiment to demonstrate the difficulties that
sample-average methods have for non-stationary problems. Use a modified
version of the 10-armed testbed in which all the q\*(a) start out equal and
then take independent random walks (say by adding a normally distributed
increment with mean zero and standard deviation 0.01 to all the q\*(a)
on each step. Prepare plots like Figure 2.2 for an
action-value method using sample averages, incrementally computed,
and another action-value method using a constant step-size
parameter, α = 0.1. Use ε = 0.1 and longer runs, say of 10,000 steps.

## Answer:
Run code/Exercise 2.5.py

![Average Reward](images/average_reward.png)

![Action Optimality](images/action_optimality.png)


================================================
FILE: Chapter 2/Exercise 2.6.md
================================================
# Exercise 2.6

## Question:
The results shown in Figure 2.3 should be quite reliable because they
are averages over 2000 individual, randomly chosen 10-armed bandit 
tasks. Why, then, are there oscillations and spikes in the early part 
of the curve for the optimistic method? In other words, what might 
make this method perform particularly better or worse, on average, 
on particular early steps?

## Answer:
The oscillations in the early stage are likely to be caused as the
algorithm reduces the Q values for the poorly performing options.
To start off with it's going to think all of the bad options are good
and has to try each multiple times before it realises they're actually
bad. This will depend on how quickly these bad options drop below the
best option (once that occurs on exploration will reduce them to their
correct values over time but at a much more reduced rate)


================================================
FILE: Chapter 2/Exercise 2.7.md
================================================
# Exercise 2.7

## Question:
Show that in the case of two actions, the soft-max distribution is the same as
that given by the logistic, or sigmoid, function often used in statistics and
artificial neural networks.

## Answer:

![Exercise 2.7 solution](images/Exercise_2_7.png)


================================================
FILE: Chapter 2/Exercise 2.8.md
================================================
# Exercise 2.8

## Question:
Suppose you face a 2-armed bandit task whose true action values change randomly from time step
to time step. Specifically, suppose that, for any time step, the true values of action 1 and 2
are respectively 0.1 and 0.2 with probability 0.5 (case A), and 0.9 and 0.8 with probability 0.5
(case B). If you are not able to tell which case you face at any step, what is the best expectation
of success you can achieve and how should you behave to achieve it? Now suppose that on each step
you are told whether you are facing case A or case B (although you still don't know the true action
values). This is an associative search task. What is the best expectation of success you can
achieve in this task, and how should you behave to achieve it?

## Answer:

For the first scenario, you cannot hold individual estimates for the case A and B. Therefore,
the best approach is to select the action that has best value estimate in combination. In this case,
the estimates of both actions the same. So the best expectation of success is 0.5 and it can be
achieved by selecting an action randomly at each step.


A<sub>1</sub> = 0.5 \* 0.1 + 0.5 \* 0.9 = 0.5

A<sub>2</sub> = 0.5 \* 0.2 + 0.5 \* 0.8 = 0.5

For the second scenario, you can hold independent estimates for the case A and B, thus we can learn
the best action for each one treating them as independent bandit problems. The best expectation
of success is 0.55 obtained from selecting A<sub>2</sub> in case A and A<sub>1</sub> in case B.

0.5 \* 0.2 + 0.5 \* 0.9 = 0.55


================================================
FILE: Chapter 2/Exercise 2.9.md
================================================
# Exercise 2.9

## Question:
(programming) Make a figure anologous to Figure 2.6 for the non-stationary case outlined
in Exercise 2.5. Include the constant step-size ε-greedy algorithm with α = 0.1. Use runs
of 200,000 steps and, as performance measure for each algorithm and parameter setting, use
the average reward over the last 100,000 steps.

## Answer:
Run code/Exercise 2.9.py

![Average Reward](images/average_reward_per_parameter_conf.png)


================================================
FILE: Chapter 2/code/Exercise 2.5.py
================================================
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
from estimators import SampleAverageEstimator, WeightedEstimator
from testbed import K_armed_testbed

np.random.seed(250)


def plot_performance(estimator_names, rewards, action_optimality):
    for i, estimator_name in enumerate(estimator_names):
        average_run_rewards = np.average(rewards[i], axis=0)
        plt.plot(average_run_rewards, label=estimator_name)

    plt.legend()
    plt.xlabel("Steps")
    plt.ylabel("Average reward")
    plt.show()

    for i, estimator_name in enumerate(estimator_names):
        average_run_optimality = np.average(action_optimality[i], axis=0)
        plt.plot(average_run_optimality, label=estimator_name)
    plt.legend()
    plt.xlabel("Steps")
    plt.ylabel("% Optimal action")
    plt.show()


if __name__ == "__main__":
    K = 10
    N_STEPS = 10000
    N_RUNS = 2000
    N_ESTIMATORS = 2

    rewards = np.full((N_ESTIMATORS, N_RUNS, N_STEPS), fill_value=0.)
    optimal_selections = np.full((N_ESTIMATORS, N_RUNS, N_STEPS), fill_value=0.)

    for run_i in tqdm(range(N_RUNS)):

        testbed = K_armed_testbed(k_actions=K)

        action_value_estimates = np.full(K, fill_value=0.0)
        sample_average_estimator = SampleAverageEstimator(action_value_estimates.copy(), epsilon=0.1)
        weighted_estimator = WeightedEstimator(action_value_estimates.copy(), epsilon=0.1, alpha=0.1)

        estimators = [sample_average_estimator, weighted_estimator]

        for step_i in range(N_STEPS):
            for estimator_i, estimator in enumerate(estimators):
                action_selected = estimator.select_action()
                is_optimal = testbed.is_optimal_action(action_selected)
                reward = testbed.sample_action(action_selected)
                estimator.update_estimates(action_selected, reward)

                rewards[estimator_i][run_i][step_i] = reward
                optimal_selections[estimator_i][run_i][step_i] = is_optimal

            testbed.random_walk_action_values()

    plot_performance(["Ɛ=0.1", "Ɛ=0.1 α=0.1"], np.array(rewards), np.array(optimal_selections))


================================================
FILE: Chapter 2/code/Exercise 2.9.py
================================================
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
from testbed import K_armed_testbed
from estimators import SampleAverageEstimator, WeightedEstimator, GradientBandit, UCBEstimator

np.random.seed(250)


def plot_performance_of_parameter_settings(parameter_settings, estimator_names, performance_results):

    for estimator_i, estimator_results in enumerate(performance_results):
        average_parameter_results = []
        for parameter_setting_results in estimator_results:
            average_run_results = np.average(parameter_setting_results, axis=0)
            average_step_rewards = np.average(average_run_results)
            average_parameter_results.append(average_step_rewards)

        plt.plot(parameter_settings, average_parameter_results, label=estimator_names[estimator_i])

    plt.legend()
    plt.xlabel("ε, α, c, Q0")
    plt.xscale("log", basex=2)
    plt.ylabel("Average reward over last %d steps" % AVERAGE_OVER_LAST_N_STEPS)
    plt.show()


if __name__ == "__main__":
    with open("runs.csv", "w+") as csvfile:
        csvfile.write("algorithm_id,parameter_setting,run_i\n")
    K = 10
    N_STEPS = 200000
    N_RUNS = 10
    N_ESTIMATORS = 4
    AVERAGE_OVER_LAST_N_STEPS = 100000

    starting_index = N_STEPS - AVERAGE_OVER_LAST_N_STEPS

    parameter_settings = [1.0/128, 1.0/64, 1.0/32, 1.0/16, 1.0/8, 1.0/4, 1.0/2, 1.0, 2.0, 4.0]

    rewards = np.full((N_ESTIMATORS, len(parameter_settings), N_RUNS, AVERAGE_OVER_LAST_N_STEPS), fill_value=0.)

    for parameter_setting_i, parameter_setting in tqdm(enumerate(parameter_settings), total=len(parameter_settings)):
        for run_i in tqdm(range(N_RUNS)):

            testbed = K_armed_testbed(k_actions=K)

            action_value_estimates = np.full(K, fill_value=0.0)
            sample_average_estimator = SampleAverageEstimator(action_value_estimates.copy(), epsilon=parameter_setting)
            weighted_estimator = WeightedEstimator(action_value_estimates.copy(), epsilon=0.1, alpha=parameter_setting)
            ucb = UCBEstimator(action_value_estimates.copy(), epsilon=0.1, alpha=0.1, c=parameter_setting)
            gradient_bandit = GradientBandit(action_value_estimates.copy(), alpha=parameter_setting)

            estimators = [sample_average_estimator, weighted_estimator, ucb, gradient_bandit]

            for step_i in tqdm(range(N_STEPS)):
                for estimator_i, estimator in enumerate(estimators):
                    action_selected = estimator.select_action()
                    reward = testbed.sample_action(action_selected)
                    estimator.update_estimates(action_selected, reward)

                    if step_i >= starting_index:
                        rewards[estimator_i][parameter_setting_i][run_i][step_i - starting_index] = reward

                testbed.random_walk_action_values()

    estimator_names = ["Sample Average Estimator", "Constant Step-size Estimator", "UCB", "Gradient Bandit"]
    plot_performance_of_parameter_settings(parameter_settings, estimator_names, rewards)


================================================
FILE: Chapter 2/code/estimators.py
================================================
import numpy as np


class Estimator(object):
    def __init__(self, action_value_initial_estimates):
        self.action_value_estimates = action_value_initial_estimates
        self.k_actions = len(action_value_initial_estimates)
        self.action_selected_count = np.full(self.k_actions, fill_value=0, dtype="int64")

    def select_action(self):
        raise NotImplementedError("Need to implement a method to select actions")

    def update_estimates(self):
        raise NotImplementedError("Need to implement a method to update action value estimates")

    def select_greedy_action(self):
        return np.argmax(self.action_value_estimates)

    def select_action_randomly(self):
        return np.random.choice(self.k_actions)


class SampleAverageEstimator(Estimator):
    def __init__(self, action_value_initial_estimates, epsilon):
        super(SampleAverageEstimator, self).__init__(action_value_initial_estimates)
        self.epsilon = epsilon

    def update_estimates(self, action_selected, r):
        self.action_selected_count[action_selected] += 1

        qn = self.action_value_estimates[action_selected]
        n = self.action_selected_count[action_selected]

        self.action_value_estimates[action_selected] = qn + (1.0 / n) * (r - qn)

    def select_action(self):
        probability = np.random.rand()
        if probability >= self.epsilon:
            return self.select_greedy_action()

        return self.select_action_randomly()


class WeightedEstimator(SampleAverageEstimator):
    def __init__(self, action_value_initial_estimates, epsilon=0, alpha=0.5):
        super(WeightedEstimator, self).__init__(action_value_initial_estimates, epsilon)
        self.alpha = alpha

    def update_estimates(self, action_selected, r):
        qn = self.action_value_estimates[action_selected]

        self.action_value_estimates[action_selected] = qn + self.alpha * (r - qn)


class UCBEstimator(WeightedEstimator):
    def __init__(self, action_value_initial_estimates, epsilon=0, alpha=0.5, c=2):
        super(UCBEstimator, self).__init__(action_value_initial_estimates, epsilon, alpha)
        self.c = c
        self.t = 0

    def select_action(self):
        self.t += 1
        probability = np.random.rand()
        if probability >= self.epsilon:
            return self.select_greedy_action()

        return self.select_ucb_action()

    def calculate_action_potential(self, action_i):
        q_t = self.action_value_estimates[action_i]
        ln_t = np.log(self.t)
        n_t = self.action_selected_count[action_i]

        return q_t + self.c * np.sqrt(ln_t / n_t)

    def select_ucb_action(self):
        greedy_action = self.select_greedy_action()

        if 0 in self.action_selected_count:
            actions_never_selected = [action_i for action_i in range(self.k_actions)
                                      if self.action_selected_count[action_i] == 0]
            selected_action = np.random.choice(actions_never_selected)
            self.action_selected_count[selected_action] += 1
            return selected_action

        action_potential = [self.calculate_action_potential(action_i) for action_i in range(self.k_actions)]
        action_potential[greedy_action] = -1

        return np.argmax(action_potential)


class GradientBandit(Estimator):
    def __init__(self, action_value_initial_estimates, alpha):
        super(GradientBandit, self).__init__(action_value_initial_estimates)
        self.average_reward = 0
        self.numerical_preference = np.full(self.k_actions, fill_value=0.)
        self.alpha = alpha

    def update_average_reward(self, r):
        qn = self.average_reward
        self.average_reward = qn + self.alpha * (r - qn)

    def update_estimates(self, action_selected, r):
        self.update_average_reward(r)

        P = self.get_actions_probabilities()
        baseline = self.average_reward

        ht = self.numerical_preference
        htp1 = ht - self.alpha * (r - baseline) * P
        htp1[action_selected] = ht[action_selected] + self.alpha * (r - baseline) * (1 - P[action_selected])

        self.numerical_preference = htp1

    def get_actions_probabilities(self):
        exp_numerical_preference = np.exp(self.numerical_preference)
        return exp_numerical_preference / np.sum(exp_numerical_preference)

    def select_action(self):
        return np.random.choice(a=self.k_actions, p=self.get_actions_probabilities())


================================================
FILE: Chapter 2/code/testbed.py
================================================
import numpy as np
from numpy.random import normal as GaussianDistribution


class K_armed_testbed():
    # Q*-values for each one of the k possible actions start out equal
    # and then take independent random walks

    def __init__(self, k_actions):
        self.k = k_actions
        # self.action_values = GaussianDistribution(loc=0, scale=1, size=self.k)
        self.action_values = np.full(self.k, fill_value=0.0)

    def random_walk_action_values(self):
        increment = GaussianDistribution(loc=0, scale=0.01, size=self.k)
        self.action_values += increment

    def sample_action(self, action_i):
        return GaussianDistribution(loc=self.action_values[action_i], scale=1, size=1)[0]

    def get_optimal_action(self):
        return np.argmax(self.action_values)

    def get_optimal_action_value(self):
        return self.action_values[self.get_optimal_action()]

    def is_optimal_action(self, action_i):
        return float(self.get_optimal_action_value() == self.action_values[action_i])

    def __str__(self):
        return "\t".join(["A%d: %.2f" % (action_i, self.action_values[action_i]) for action_i in range(self.k)])


================================================
FILE: Chapter 2/tex_files/exercise2.7.tex
================================================
\documentclass[12pt]{article}

\usepackage[margin=1in]{geometry}

\begin{document}
\thispagestyle{empty}

\noindent Using the definition of the \textbf{sigmoid function} we would have:

$$Pr\{A_t = a\} = \frac{e^{H_t(a)}}{e^{H_t(a)} + 1}$$
$$Pr\{A_t = b\} = 1 - Pr\{A_t = a\} = \frac{1}{e^{H_t(a)} + 1}$$

\noindent Extending the definition of the \textbf{soft-max distribution} for the case of two actions (k=2) we get the following:

$$Pr\{A_t=a\} = \frac{e^{H_t(a)}}{e^{H_t(a)} + e^{H_t(b)}}$$
$$Pr\{A_t=b\} = \frac{e^{H_t(b)}}{e^{H_t(a)} + e^{H_t(b)}}$$

\noindent According to the definition of the \textit{numerical preference}, if we substract an amount from each of the preferences
it does not affect the probabilities. So we can redifine $H_t(a)$ and $H_t(b)$ as:

$$H_t(b) \leftarrow  H_t(b) - H_t(b) = 0$$
$$H_t(a) \leftarrow  H_t(a) - H_t(b) = H_t(a)$$

\noindent and we get:

$$Pr\{A_t=a\} = \frac{e^{H_t(a)}}{e^{H_t(a)} + e^0} = \frac{e^{H_t(a)}}{e^{H_t(a)} + 1} $$
$$Pr\{A_t=b\} = \frac{e^{0}}{e^{H_t(a)} + e^{0}} = \frac{1}{e^{H_t(a)} + 1}$$

\end{document}


================================================
FILE: Chapter 3/Chapter 3 Exercises.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.1 \n",
    "  \n",
    "## Question:\n",
    "Devise three example tasks of your own that fit into the MDP framework, identifying for each its states, actions,\n",
    "and rewards. Make the three examples as _different_ from each other as possible. The framework is abstract and\n",
    "flexible and can be applied in many different ways. Stretch its limits in some way in at least one of your examples.\n",
    "\n",
    "## Answer:\n",
    "Example 1: Hairdresser agent\n",
    "\n",
    "- States: the state of the hair and the desired cut of the client. The state of the hair and desired cut could be\n",
    "encoded as arrays of the length of the hair in a set of predetermined areas the head is divided in.\n",
    "- Actions: using the scissors or the clipper (and what accessory) and the area to be cut.\n",
    "- Rewards: negative rewards for the client complaints or imprecisions in the cut and positive rewards for tips.\n",
    "\n",
    "Example 2: DJ agent\n",
    "\n",
    "- States: a measure of how much people are dancing and singing to the song being played and the song currently playing.\n",
    "- Actions: given a setlist of 5000 songs, selecting the next song to be played (or a combination of them) and a type of\n",
    "transition between the songs.\n",
    "- Rewards: negative rewards given by people leaving the club faster than expected, positive rewards given by the level\n",
    "of danciness/_singiness_ in the club.\n",
    "\n",
    "Example 3: Texas Hold'em Poker player agent\n",
    "\n",
    "- States: The two cards in its hand and the cards showing in the table.\n",
    "- Actions: Check, call, raise, or fold.\n",
    "- Rewards: The money obtained or lost after playing one hand.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.2 \n",
    "  \n",
    "## Question:\n",
    "Is the MDP framework adequate to usefully represent *all* goal-directed learning tasks?\n",
    "Can you think of any clear exceptions?\n",
    "\n",
    "## Answer:\n",
    "We can try to think about it in terms of the limitations a finite MDP imposes in a problem definition and whether any\n",
    "approximations could exist within the framework.\n",
    "\n",
    "1) The Markov property, if not only the previous state and action selected influence which will be the next state. An\n",
    "example of this could be a modified game of chess where the order in which the pieces were moved affects the possible\n",
    "moves. This could be encoded within the MDP, an state would be composed of the position of the pieces and when\n",
    "they have been moved (skyrocketing the amount of available states making it a more difficult learning environment).\n",
    "If the sequence information wasn't available (or only partially available) at the time of making a decision,\n",
    "information from the past that can affect the next state would be missing, breaking the Markov property.\n",
    "\n",
    "2) The action and state sets must be finite. Any problem with infinite available actions or states would need\n",
    "alternative representations, such as grouping them into subsets and use these sets. An example of this could be a\n",
    "problem where the states are the natural numbers and we have to define intervals (e.g. negative numbers, numbers in the\n",
    "range 0-25, 25-200, and 200-∞) or where the actions can be a string of any size and we restrict it to a discrete number\n",
    "of lengths (e.g. only generate strings with length 3, 5, 8, 13, 21 and 34).\n",
    "\n",
    "3) Rewards must be numerical. If the rewards the environment gives back are not numerical we would need to encode them\n",
    "as numbers. This can be a highly difficult task as the rewards may not translate naturally to numbers. For example,\n",
    "if the rewards were your family's verbal feedback on how good the meal you prepared was, it would be difficult to\n",
    "convert it into a number and capture correctly its intensity (e.g. What's a better feedback? \"It was really good\" or\n",
    "\"I have enjoyed the meal a lot\"). If we opt for a simpler reward encoding, distinguishing only negative, neutral and\n",
    "positive comments, this may not capture perfectly the information given.\n",
    "\n",
    "The first example is, as far as I understand, a clear exception of the MDP-framework not being an appropriate\n",
    "representation. Apart from this, some of the points mentioned (or their combination) may difficult largely the task for\n",
    "an agent, making it really hard to learn anything valuable from the environment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.3\n",
    "  \n",
    "## Question:\n",
    "Consider the problem of driving. You could define the actions in\n",
    "terms of the accelerator, steering wheel, and brake, that is, where your body meets\n",
    "the machine. Or you could define them farther out - say, where the rubber meets the\n",
    "road, considering your actions to be tire torques. Or you could define them farther\n",
    "in - say, where your brain meets your body, the actions being muscle twitches to\n",
    "control your limbs. Or you could go to a really high level and say that your actions\n",
    "are your choices of where to drive. What is the right level, the right place to draw\n",
    "the line between agent and environment? On what basis is one location of the line\n",
    "to be preferred over another? Is there any fundamental reason for preferring one\n",
    "location over another, or is it a free choice?\n",
    "\n",
    "## Answer:\n",
    "The limit of the actions should be at the functional point, being the \n",
    "point at which is the agent makes the decision to take an action that\n",
    "the action always occurs in the same way every time. Above that point\n",
    "it is better to think of the agent as having sub-goals and goals, where\n",
    "something like walking to the door is a goal with the sub-goals of \n",
    "moving the legs, with the action of applying hydraulic pressure to\n",
    "particular points.\n",
    "\n",
    "However this does depend on the reliability of the system, and what\n",
    "reliability you require. If 99% of sub-goals are successful then it may\n",
    "be easy to convert them into actions. Abstracting in this way makes\n",
    "reaching larger goals easier as it dramatically reduces the\n",
    "action-space needed to explore."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.4\n",
    "\n",
    "# Question:\n",
    "Give a table analogous to that in Example 3.3, but for p(s',r|s, a). It should have columns for s, a, s', r, and p(s', r|s, a), and a row for every 4-tuple for which p(s',r|s, a) > 0.\n",
    "\n",
    "# Answer\n",
    "Since there is a single reward defined for each triplet (s, a, s'), the table is the same filtering the lines with p(s'|s,a)=0.\n",
    "\n",
    "This fulfills the formula (3.4):\n",
    "\\begin{equation*}\n",
    "p(s'|s, a) = \\sum_{r \\in R} p(s',r|s, a)\n",
    "\\end{equation*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.5\n",
    "\n",
    "### Question:\n",
    "The equations in Section 3.1 are for the continuing case and need to be modified (very slightly) to\n",
    "apply to episodic tasks. Show that you know the modifications needed by giving the modified version\n",
    "of (3.3)\n",
    "\n",
    "### Answer:\n",
    "The original formula is:\n",
    "    \n",
    "\\begin{equation*}\n",
    "        \\sum_{s' \\in S} \\sum_{r \\in R} p(s', r|s, a) = 1, \\forall s \\in S, a \\in A(s)\n",
    "\\end{equation*}\n",
    "\n",
    "according to the definitions in 3.3, for episodic tasks the set of terminal and non-terminal states can be denoted as S+. Therefore, the formula changes to:\n",
    "\\begin{equation*}\n",
    "        \\sum_{s' \\in S} \\sum_{r \\in R} p(s', r|s, a) = 1, \\forall s \\in S^+, a \\in A(s)\n",
    "\\end{equation*}\n",
    "as the dynamics of the MDP in an episodic task include as a possible transition those ending in a terminal state."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.6\n",
    "  \n",
    "### Question:\n",
    "Suppose you treated pole-balancing as an episodic task but also used\n",
    "discounting, with all rewards zero except for -1 upon failure. What then would the\n",
    "return be at each time? How does this return differ from that in the discounted,\n",
    "continuing formulation of this task?\n",
    "\n",
    "### Answer:\n",
    "\n",
    "The formula would change to:\n",
    "\n",
    "\\begin{equation*}\n",
    "G_t = \\sum_{k=0}^{T-t-1} \\gamma^k R_{t+k+1}\n",
    "\\end{equation*}\n",
    "\n",
    "In the limit (very large T), both returns would be the same."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.7\n",
    "  \n",
    "## Question:\n",
    "Imagine that you are designing a robot to run a maze. You decide\n",
    "to give it a reward of +1 for escaping from the maze and a reward of zero at all\n",
    "other times. The task seems to break down naturally into episodes - the successive\n",
    "runs through the maze - so you decide to treat it as an episodic task, where the goal\n",
    "is to maximize expected total reward (3.1). After running the learning agent for a\n",
    "while, you find that it is showing no improvement in escaping from the maze. What\n",
    "is going wrong? Have you effectively communicated to the agent what you want it\n",
    "to achieve?\n",
    "\n",
    "## Answer:\n",
    "This is likely an exploration issue where the agent is unable to find\n",
    "the exit the first time and therefore doesn't know there's anything\n",
    "better than 0 as a reward. Potential solutions include having each\n",
    "non-goal state be worth -1, and/or extending the episode length. This\n",
    "would mean states the agents visits a lot (particularly around the start)\n",
    "will get worse and worse values so it will want to move away from there\n",
    "and eventually find the goal (essentially reaching the goal stops it\n",
    "being in pain)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.8\n",
    "\n",
    "# Answer\n",
    "Suppose gamma = 0.5 and the following sequence of rewards is received R_1 = -1, R_2 = 2, R_3 = 6, R_4 = 3 and R_5 = 3, with T = 5. What are G_0, G_1, ..., G_5? Hint: Work backwards"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "G_0 = 2.0\n",
      "G_1 = 6.0\n",
      "G_2 = 8.0\n",
      "G_3 = 4.0\n",
      "G_4 = 2.0\n"
     ]
    }
   ],
   "source": [
    "r = [-1, 2, 6, 3, 2]\n",
    "gamma = 0.5\n",
    "\n",
    "for g_i in range(len(r)):\n",
    "    ret = sum([gamma**i * r_i for i, r_i in enumerate(r[g_i:])])    \n",
    "    print(\"G_\"+ str(g_i) + \" = \" + str(ret))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.9\n",
    "\n",
    "## Question\n",
    "Suppose gamma=0.9 and the reward sequence is R_1 = 2 followed by an infinite sequence of 7s. What are G_1 and G_0?\n",
    "\n",
    "## Answer\n",
    "\\begin{equation*}\n",
    "G_0 = 2 + \\gamma\\ 7 + \\gamma^2\\ 7 + ... = 2 + \\gamma\\ ( \\sum_{k=0}^{\\infty} \\gamma^k\\ 7 ) = 2 + \\gamma\\ 7\\ ( \\sum_{k=0}^{\\infty} \\gamma^k) =  2 + \\gamma\\ 7\\ (\\frac{1}{1 - \\gamma}) = 2 + 0.9 \\times 7 \\times 10 = 65\n",
    "\\end{equation*}\n",
    "\\begin{equation*}\n",
    "G_1 = 7 + \\gamma\\ 7 + \\gamma^2\\ 7 + ... = 7\\ ( \\sum_{k=0}^{\\infty} \\gamma^k) = 7 \\frac{1}{1 - \\gamma} = 70\n",
    "\\end{equation*}\n",
    "\n",
    "This results fulfills:\n",
    "\n",
    "\\begin{equation*}\n",
    "G_0 = R_1 + \\gamma G_1 = 2 + 0.9 \\times 70 = 65\n",
    "\\end{equation*}\n",
    "\n",
    "This problem can also be solved through the equations below (note that G_2 = G_1 = G_n where n > 0):\n",
    "\n",
    "\\begin{equation*}\n",
    "G_0 = R_1 + 0.9\\ G_1\n",
    "\\end{equation*}\n",
    "\\begin{equation*}\n",
    "G_1 = R_2 + 0.9\\ G_1\n",
    "\\end{equation*}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.10\n",
    "\n",
    "## Question\n",
    "Prove (3.10)\n",
    "\n",
    "## Answer\n",
    "\\begin{equation*}\n",
    "G_t = \\sum_{k=0}^\\infty y_k = lim_{n \\rightarrow \\infty} (1 + \\gamma + \\gamma^2 + ... + \\gamma^n) = lim_{n \\rightarrow \\infty} \\frac{(1 +   \\gamma + \\gamma^2 + ... + \\gamma^n) (1 - \\gamma)}{(1 - \\gamma)} = lim_{n \\rightarrow \\infty} \\frac{1 - \\gamma^{n+1}}{1 - \\gamma} = \\frac{1}{1 - \\gamma}\n",
    "\\end{equation*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.11\n",
    "\n",
    "## Question\n",
    "If the current state is S_t, and actions are selected according to stochastic policy pi, then what is the expectation of R_t+1 in terms of pi and the four-argument function p(3.2)?\n",
    "\n",
    "## Answer\n",
    "\\begin{equation*}\n",
    "\\mathbb{E} [R_{t+1} | S_t=s, \\pi] = \\sum_a \\pi(a|s) \\sum_{s',r} r\\ p(s', r|s, a)\n",
    "\\end{equation*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.12\n",
    "## Question\n",
    "Give an equation for v_pi in terms of q_pi and pi.\n",
    "\n",
    "## Answer\n",
    "\\begin{equation*}\n",
    "v_\\pi(s) = \\sum_{a\\in A(s)} q_\\pi(s,a) \\pi(a|s)\n",
    "\\end{equation*}\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.13\n",
    "## Question\n",
    "Give an equation for q_pi in terms of v_pi and the four-argument p\n",
    "## Answer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.14\n",
    "## Question\n",
    "The Bellman equation (3.14) must hold for each state for the value function v_pi shown in Figure 3.2 (right) of Example 3.5. Show numerically that this equation holds for the center state, valued at +0.7, with respect to its four neighboring states, valued at +2.3, +0.4, -0.4 and +0.7. (These numbers are accurate only to one decimal place.)\n",
    "## Answer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.15\n",
    "## Question\n",
    "In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.8), that adding a constant c to all the rewards adds a constant, v_c, to the values of all states, and thus\n",
    "does not affect the relative values of any states under any policies. What is v_c in terms of c and gamma?\n",
    "\n",
    "\n",
    "## Answer\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.16\n",
    "## Question\n",
    "Now consider adding a constant c to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.\n",
    "\n",
    "## Answer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.17\n",
    "## Question\n",
    "What is the Bellman equation for action values, that is, for q_pi? It must give the action value q_pi(s, a) in terms of the action values, q_pi(s',a'), of possible successors to the state–action pair (s, a). Hint: the backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.\n",
    "\n",
    "## Answer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.18\n",
    "## Question\n",
    "The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:\n",
    "\n",
    "Give the equation corresponding to this intuition and diagram for the value at the root node, v_pi(s), in terms of the value at the expected leaf node, q_pi(s, a), given S_t=s. This equation should include an expectation conditioned on following the policy, \\pi. Then give a second equation in which the expected value is written out explicitly in terms of \\pi(a|s) such that no expected value notation appears in the equation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3.19\n",
    "The value of an action, q_pi(s, a), depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:\n",
    "\n",
    "Give the equation corresponding to this intuition and diagram for the action value, q_pi(s, a), in terms of the expected next reward, Rt+1, and the expected next state value,v_pi(St+1), given that St=s and At=a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of p(s',r|s, a) defined by (3.2), such that no expected value notation appears in the equation.\n",
    "\n",
    "## Answer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exercise 3.20\n",
    "Draw or describe the optimal state-value function for the golf example.\n",
    "\n",
    "Exercise 3.21\n",
    "Draw or describe the contours of the optimal action-value function for\n",
    "putting, q\\*(s,putter), for the golf example.\n",
    "\n",
    "Exercise 3.22\n",
    "Consider the continuing MDP shown on to the right. The only decision to be made is that in the top state,\n",
    "where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, pi_left and pi_right. What policy is optimal if\n",
    "gamma=0? If gamma=0.9? If gamma=0.5?\n",
    "\n",
    "Exercise 3.23\n",
    "Give the Bellman equation for q_pi for the recycling robot.\n",
    "\n",
    "Exercise 3.24\n",
    "Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express this value symbolically, and then to compute it to three decimal places.\n",
    "\n",
    "Exercise 3.25\n",
    "Give an equation for v_* in terms of q_*.\n",
    "\n",
    "Exercise 3.26\n",
    "Give an equation for q_* in terms of v* and the four-argument p.\n",
    "\n",
    "Exercise 3.27\n",
    "Give an equation for \\pi_* in terms of q*.\n",
    "\n",
    "Exercise 3.28\n",
    "Give an equation for pi* in terms of v* and the four-argument p.\n",
    "\n",
    "Exercise 3.29\n",
    "Rewrite the four Bellman equations for the four value functions (v_pi,v*,q_pi,and q*) in terms of the three argument function p (3.4) and the two-argument function r (3.5)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: README.md
================================================
## Exercise Solutions for "Reinforcement Learning: An Introduction" 2nd Edition 
### A book by Richard S. Sutton and Andrew G. Barto.

You can find an online version of the book [HERE](http://incompleteideas.net/book/the-book-2nd.html).

I have no guarantees for any of the solutions' correctness so if you see any mistakes or think any of the solutions lack completeness or you simply want to start a discussion on them, please feel free to let me know or submit an issue or pull request.

**NOTE:** Exercises 1.1-1.5, 2.3, 2.4, 2.6, 3.3-3.11 come from [JKCooper2's repository](https://github.com/JKCooper2/rlai-exercises)