Full Code of openai/spinningup for AI

master 038665d62d56 cached

117 files

643.1 KB

165.9k tokens

356 symbols

1 requests

Download .txt

Showing preview only (679K chars total). Download the full file or copy to clipboard to get everything.

Repository: openai/spinningup
Branch: master
Commit: 038665d62d56
Files: 117
Total size: 643.1 KB

Directory structure:
gitextract_hmpqqyur/

├── .gitignore
├── .travis.yml
├── LICENSE
├── docs/
│   ├── Makefile
│   ├── _static/
│   │   └── css/
│   │       └── modify.css
│   ├── algorithms/
│   │   ├── ddpg.rst
│   │   ├── ppo.rst
│   │   ├── sac.rst
│   │   ├── td3.rst
│   │   ├── trpo.rst
│   │   └── vpg.rst
│   ├── conf.py
│   ├── docs_requirements.txt
│   ├── etc/
│   │   ├── acknowledgements.rst
│   │   └── author.rst
│   ├── images/
│   │   ├── rl_algorithms.xml
│   │   └── rl_algorithms_9_15.xml
│   ├── index.rst
│   ├── make.bat
│   ├── spinningup/
│   │   ├── bench.rst
│   │   ├── bench_ddpg.rst
│   │   ├── bench_ppo.rst
│   │   ├── bench_sac.rst
│   │   ├── bench_td3.rst
│   │   ├── bench_vpg.rst
│   │   ├── exercise2_1_soln.rst
│   │   ├── exercise2_2_soln.rst
│   │   ├── exercises.rst
│   │   ├── extra_pg_proof1.rst
│   │   ├── extra_pg_proof2.rst
│   │   ├── extra_tf_pg_implementation.rst
│   │   ├── keypapers.rst
│   │   ├── rl_intro.rst
│   │   ├── rl_intro2.rst
│   │   ├── rl_intro3.rst
│   │   ├── rl_intro4.rst
│   │   └── spinningup.rst
│   ├── user/
│   │   ├── algorithms.rst
│   │   ├── installation.rst
│   │   ├── introduction.rst
│   │   ├── plotting.rst
│   │   ├── running.rst
│   │   └── saving_and_loading.rst
│   └── utils/
│       ├── logger.rst
│       ├── mpi.rst
│       ├── plotter.rst
│       └── run_utils.rst
├── readme.md
├── readthedocs.yml
├── setup.py
├── spinup/
│   ├── __init__.py
│   ├── algos/
│   │   ├── __init__.py
│   │   ├── pytorch/
│   │   │   ├── ddpg/
│   │   │   │   ├── core.py
│   │   │   │   └── ddpg.py
│   │   │   ├── ppo/
│   │   │   │   ├── core.py
│   │   │   │   └── ppo.py
│   │   │   ├── sac/
│   │   │   │   ├── core.py
│   │   │   │   └── sac.py
│   │   │   ├── td3/
│   │   │   │   ├── core.py
│   │   │   │   └── td3.py
│   │   │   ├── trpo/
│   │   │   │   └── trpo.py
│   │   │   └── vpg/
│   │   │       ├── core.py
│   │   │       └── vpg.py
│   │   └── tf1/
│   │       ├── ddpg/
│   │       │   ├── __init__.py
│   │       │   ├── core.py
│   │       │   └── ddpg.py
│   │       ├── ppo/
│   │       │   ├── __init__.py
│   │       │   ├── core.py
│   │       │   └── ppo.py
│   │       ├── sac/
│   │       │   ├── __init__.py
│   │       │   ├── core.py
│   │       │   └── sac.py
│   │       ├── td3/
│   │       │   ├── __init__.py
│   │       │   ├── core.py
│   │       │   └── td3.py
│   │       ├── trpo/
│   │       │   ├── __init__.py
│   │       │   ├── core.py
│   │       │   └── trpo.py
│   │       └── vpg/
│   │           ├── __init__.py
│   │           ├── core.py
│   │           └── vpg.py
│   ├── examples/
│   │   ├── pytorch/
│   │   │   ├── bench_ppo_cartpole.py
│   │   │   └── pg_math/
│   │   │       ├── 1_simple_pg.py
│   │   │       └── 2_rtg_pg.py
│   │   └── tf1/
│   │       ├── bench_ppo_cartpole.py
│   │       ├── pg_math/
│   │       │   ├── 1_simple_pg.py
│   │       │   └── 2_rtg_pg.py
│   │       └── train_mnist.py
│   ├── exercises/
│   │   ├── common.py
│   │   ├── pytorch/
│   │   │   ├── problem_set_1/
│   │   │   │   ├── exercise1_1.py
│   │   │   │   ├── exercise1_2.py
│   │   │   │   ├── exercise1_2_auxiliary.py
│   │   │   │   └── exercise1_3.py
│   │   │   ├── problem_set_1_solutions/
│   │   │   │   ├── exercise1_1_soln.py
│   │   │   │   └── exercise1_2_soln.py
│   │   │   └── problem_set_2/
│   │   │       └── exercise2_2.py
│   │   └── tf1/
│   │       ├── problem_set_1/
│   │       │   ├── exercise1_1.py
│   │       │   ├── exercise1_2.py
│   │       │   └── exercise1_3.py
│   │       ├── problem_set_1_solutions/
│   │       │   ├── exercise1_1_soln.py
│   │       │   └── exercise1_2_soln.py
│   │       └── problem_set_2/
│   │           └── exercise2_2.py
│   ├── run.py
│   ├── user_config.py
│   ├── utils/
│   │   ├── __init__.py
│   │   ├── logx.py
│   │   ├── mpi_pytorch.py
│   │   ├── mpi_tf.py
│   │   ├── mpi_tools.py
│   │   ├── plot.py
│   │   ├── run_entrypoint.py
│   │   ├── run_utils.py
│   │   ├── serialization_utils.py
│   │   └── test_policy.py
│   └── version.py
├── test/
│   └── test_ppo.py
└── travis_setup.sh

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
*.*~
__pycache__/
*.pkl
data/
**/*.egg-info
.python-version
.idea/
.vscode/
.DS_Store
_build/


================================================
FILE: .travis.yml
================================================
env:
 global:
 - LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/travis/.mujoco/mujoco200/bin

matrix:
    include:
        - os: linux
          language: python
          python: "3.6"

before_install:
    - ./travis_setup.sh

script:
    - pip3 install --upgrade -e .[mujoco]
    - python3 -c "import mujoco_py"
    - python3 -c "import spinup"
    - python3 -m pytest


================================================
FILE: LICENSE
================================================
The MIT License

Copyright (c) 2018 OpenAI (http://openai.com)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.


================================================
FILE: docs/Makefile
================================================
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS    =
SPHINXBUILD   = sphinx-build
SPHINXPROJ    = SpinningUp
SOURCEDIR     = .
BUILDDIR      = _build

# Put it first so that "make" without argument is like "make help".
help:
	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

================================================
FILE: docs/_static/css/modify.css
================================================
:root {
    /* Colors */
    --color--white: #fff;
    --color--lightwash: #f7fbfb;
    --color--mediumwash: #eff7f8;
    --color--darkwash: #e6f3f3;
    --color--warmgraylight: #eeedee;
    --color--warmgraydark: #a3acb0;
    --color--coolgray1: #c5c5d2;
    --color--coolgray2: #8e8ea0;
    --color--coolgray3: #6e6e80;
    --color--coolgray4: #404452;
    --color--black: #050505;
    --color--pink: #e6a2e4;
    --color--magenta: #dd5ce5;
    --color--red: #bd1c5f;
    --color--brightred: #ef4146;
    --color--orange: #e86c09;
    --color--golden: #f4ac36;
    --color--yellow: #ebe93d;
    --color--lightgreen: #68de7a;
    --color--darkgreen: #10a37f;
    --color--teal: #2ff3ce;
    --color--lightblue: #27b5ea;
    --color--mediumblue: #2e95d3;
    --color--darkblue: #5436da;
    --color--navyblue: #1d0d4c;
    --color--lightpurple: #6b40d8;
    --color--darkpurple: #412991;
    --color--lightgrayishpurple: #cdc3cf;
    --color--mediumgrayishpurple: #9c88a3;
    --color--darkgrayishpurple: #562f5f;
}

body {
  color: var(--color--darkgray) !important;
  fill: var(--color--darkgray) !important;
}

h1, h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend {
  /* font-weight: 500;
  font-family: Colfax, sans-serif !important; */
  font-family: "Lato","proxima-nova","Helvetica Neue",Arial,sans-serif !important;
}

.wy-nav-top {
  background-color: var(--color--coolgray4) !important;
}

.rst-content .toc-backref {
    color: #404040 !important;
}

.footnote {
  padding-left: 0.75rem;
  background-color: var(--color--warmgraylight) !important;
}

.wy-nav-top a, .wy-nav-top a:visited {
 color: var(--color--white) !important;
}

.wy-menu-vertical header, .wy-menu-vertical p.caption {
  font-weight: 500 !important;
  letter-spacing: 1px;
  margin-top: 1.25rem;
}

.wy-side-nav-search {
  background-color: var(--color--warmgraylight) !important;
}

.wy-body-for-nav {
  background-color: var(--color--coolgray1) !important;
}

.wy-menu-vertical li span.toctree-expand {
  color:  var(--color--coolgray2) !important;
}

.wy-nav-side {
  color: var(--color--coolgray1) !important;
  background-color: var(--color--coolgray4) !important;
}

.wy-side-nav-search input[type=text] {
  border-color: var(--color--warmgraydark) !important;
}

a {
  color: var(--color--mediumblue) !important;
}

a:visited {
  color: #9B59B6 !important;
}

.wy-menu-vertical a {
  color: var(--color--coolgray2) !important;
}

.wy-menu-vertical li.current a {
  border-right: none !important;
  color: var(--color--coolgray4) !important;
}

.wy-menu-vertical li.current {
  background-color: var(--color--warmgraylight) !important;
}

.wy-menu-vertical li.toctree-l2.current>a {
  background-color: var(--color--coolgray1) !important;
}

.wy-menu-vertical a:hover, .wy-menu-vertical li.current a:hover, .wy-menu-vertical li.toctree-l2.current>a:hover {
  color: var(--color--warmgraylight) !important;
  background-color: var(--color--coolgray3) !important;
}

.wy-alert-title, .rst-content .admonition-title {
  background-color: var(--color--mediumblue) !important;
}

.wy-alert, .rst-content .note, .rst-content .attention, .rst-content .caution, .rst-content .danger, .rst-content .error, .rst-content .hint, .rst-content .important, .rst-content .tip, .rst-content .warning, .rst-content .seealso, .rst-content .admonition-todo, .rst-content .admonition {
  background-color: var(--color--warmgraylight) !important;
}

.rst-content dl:not(.docutils) dt {
  border-color: var(--color--mediumblue) !important;
  background-color: var(--color--warmgraylight) !important;
}

/* .rst-content pre.literal-block, .rst-content div[class^='highlight'] {
  background-color: var(--color--warmgraylight) !important;
} */

.wy-table-odd td, .wy-table-striped tr:nth-child(2n-1) td, .rst-content table.docutils:not(.field-list) tr:nth-child(2n-1) td {
  background-color: var(--color--warmgraylight) !important;
}

@media screen and (min-width: 1100px) {
  .wy-nav-content-wrap {
      background-color: var(--color--warmgraylight) !important;
  }
}

.wy-side-nav-search img {
  height: auto !important;
  width: 100% !important;
  padding: 0 !important;
  background-color: inherit !important;
  border-radius: 0 !important;
  margin: 0 !important
}

.wy-side-nav-search>a, .wy-side-nav-search .wy-dropdown>a {
  margin-bottom: 0 !important;
}

.wy-menu-vertical li.toctree-l1.current>a {
  border: none !important;
}

.wy-side-nav-search>div.version {
 color: var(--color--coolgray2) !important;
}

================================================
FILE: docs/algorithms/ddpg.rst
================================================
==================================
Deep Deterministic Policy Gradient
==================================

.. contents:: Table of Contents

Background
==========

(Previously: `Introduction to RL Part 1: The Optimal Q-Function and the Optimal Action`_)

.. _`Introduction to RL Part 1: The Optimal Q-Function and the Optimal Action`: ../spinningup/rl_intro.html#the-optimal-q-function-and-the-optimal-action

Deep Deterministic Policy Gradient (DDPG) is an algorithm which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy.

This approach is closely connected to Q-learning, and is motivated the same way: if you know the optimal action-value function :math:`Q^*(s,a)`, then in any given state, the optimal action :math:`a^*(s)` can be found by solving

.. math::
    
    a^*(s) = \arg \max_a Q^*(s,a).

DDPG interleaves learning an approximator to :math:`Q^*(s,a)` with learning an approximator to :math:`a^*(s)`, and it does so in a way which is specifically adapted for environments with continuous action spaces. But what does it mean that DDPG is adapted *specifically* for environments with continuous action spaces? It relates to how we compute the max over actions in :math:`\max_a Q^*(s,a)`. 

When there are a finite number of discrete actions, the max poses no problem, because we can just compute the Q-values for each action separately and directly compare them. (This also immediately gives us the action which maximizes the Q-value.) But when the action space is continuous, we can't exhaustively evaluate the space, and solving the optimization problem is highly non-trivial. Using a normal optimization algorithm would make calculating :math:`\max_a Q^*(s,a)` a painfully expensive subroutine. And since it would need to be run every time the agent wants to take an action in the environment, this is unacceptable.

Because the action space is continuous, the function :math:`Q^*(s,a)` is presumed to be differentiable with respect to the action argument. This allows us to set up an efficient, gradient-based learning rule for a policy :math:`\mu(s)` which exploits that fact. Then, instead of running an expensive optimization subroutine each time we wish to compute :math:`\max_a Q(s,a)`, we can approximate it with :math:`\max_a Q(s,a) \approx Q(s,\mu(s))`. See the Key Equations section details.


Quick Facts
-----------

* DDPG is an off-policy algorithm.
* DDPG can only be used for environments with continuous action spaces.
* DDPG can be thought of as being deep Q-learning for continuous action spaces.
* The Spinning Up implementation of DDPG does not support parallelization.

Key Equations
-------------

Here, we'll explain the math behind the two parts of DDPG: learning a Q function, and learning a policy.

The Q-Learning Side of DDPG
^^^^^^^^^^^^^^^^^^^^^^^^^^^

First, let's recap the Bellman equation describing the optimal action-value function, :math:`Q^*(s,a)`. It's given by

.. math::

    Q^*(s,a) = \underset{s' \sim P}{{\mathrm E}}\left[r(s,a) + \gamma \max_{a'} Q^*(s', a')\right]

where :math:`s' \sim P` is shorthand for saying that the next state, :math:`s'`, is sampled by the environment from a distribution :math:`P(\cdot| s,a)`. 

This Bellman equation is the starting point for learning an approximator to :math:`Q^*(s,a)`. Suppose the approximator is a neural network :math:`Q_{\phi}(s,a)`, with parameters :math:`\phi`, and that we have collected a set :math:`{\mathcal D}` of transitions :math:`(s,a,r,s',d)` (where :math:`d` indicates whether state :math:`s'` is terminal). We can set up a **mean-squared Bellman error (MSBE)** function, which tells us roughly how closely :math:`Q_{\phi}` comes to satisfying the Bellman equation:

.. math::

    L(\phi, {\mathcal D}) = \underset{(s,a,r,s',d) \sim {\mathcal D}}{{\mathrm E}}\left[
        \Bigg( Q_{\phi}(s,a) - \left(r + \gamma (1 - d) \max_{a'} Q_{\phi}(s',a') \right) \Bigg)^2
        \right]

Here, in evaluating :math:`(1-d)`, we've used a Python convention of evaluating ``True`` to 1 and ``False`` to zero. Thus, when ``d==True``---which is to say, when :math:`s'` is a terminal state---the Q-function should show that the agent gets no additional rewards after the current state. (This choice of notation corresponds to what we later implement in code.)

Q-learning algorithms for function approximators, such as DQN (and all its variants) and DDPG, are largely based on minimizing this MSBE loss function. There are two main tricks employed by all of them which are worth describing, and then a specific detail for DDPG.

**Trick One: Replay Buffers.** All standard algorithms for training a deep neural network to approximate :math:`Q^*(s,a)` make use of an experience replay buffer. This is the set :math:`{\mathcal D}` of previous experiences. In order for the algorithm to have stable behavior, the replay buffer should be large enough to contain a wide range of experiences, but it may not always be good to keep everything. If you only use the very-most recent data, you will overfit to that and things will break; if you use too much experience, you may slow down your learning. This may take some tuning to get right.

.. admonition:: You Should Know

    We've mentioned that DDPG is an off-policy algorithm: this is as good a point as any to highlight why and how. Observe that the replay buffer *should* contain old experiences, even though they might have been obtained using an outdated policy. Why are we able to use these at all? The reason is that the Bellman equation *doesn't care* which transition tuples are used, or how the actions were selected, or what happens after a given transition, because the optimal Q-function should satisfy the Bellman equation for *all* possible transitions. So any transitions that we've ever experienced are fair game when trying to fit a Q-function approximator via MSBE minimization.

**Trick Two: Target Networks.** Q-learning algorithms make use of **target networks**. The term 

.. math::

    r + \gamma (1 - d) \max_{a'} Q_{\phi}(s',a')

is called the **target**, because when we minimize the MSBE loss, we are trying to make the Q-function be more like this target. Problematically, the target depends on the same parameters we are trying to train: :math:`\phi`. This makes MSBE minimization unstable. The solution is to use a set of parameters which comes close to :math:`\phi`, but with a time delay---that is to say, a second network, called the target network, which lags the first. The parameters of the target network are denoted :math:`\phi_{\text{targ}}`. 

In DQN-based algorithms, the target network is just copied over from the main network every some-fixed-number of steps. In DDPG-style algorithms, the target network is updated once per main network update by polyak averaging:

.. math::

    \phi_{\text{targ}} \leftarrow \rho \phi_{\text{targ}} + (1 - \rho) \phi,

where :math:`\rho` is a hyperparameter between 0 and 1 (usually close to 1). (This hyperparameter is called ``polyak`` in our code).


**DDPG Detail: Calculating the Max Over Actions in the Target.** As mentioned earlier: computing the maximum over actions in the target is a challenge in continuous action spaces. DDPG deals with this by using a **target policy network** to compute an action which approximately maximizes :math:`Q_{\phi_{\text{targ}}}`. The target policy network is found the same way as the target Q-function: by polyak averaging the policy parameters over the course of training. 

Putting it all together, Q-learning in DDPG is performed by minimizing the following MSBE loss with stochastic gradient descent:

.. math::

    L(\phi, {\mathcal D}) = \underset{(s,a,r,s',d) \sim {\mathcal D}}{{\mathrm E}}\left[
        \Bigg( Q_{\phi}(s,a) - \left(r + \gamma (1 - d) Q_{\phi_{\text{targ}}}(s', \mu_{\theta_{\text{targ}}}(s')) \right) \Bigg)^2
        \right],

where :math:`\mu_{\theta_{\text{targ}}}` is the target policy.


The Policy Learning Side of DDPG
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Policy learning in DDPG is fairly simple. We want to learn a deterministic policy :math:`\mu_{\theta}(s)` which gives the action that maximizes :math:`Q_{\phi}(s,a)`. Because the action space is continuous, and we assume the Q-function is differentiable with respect to action, we can just perform gradient ascent (with respect to policy parameters only) to solve

.. math::

    \max_{\theta} \underset{s \sim {\mathcal D}}{{\mathrm E}}\left[ Q_{\phi}(s, \mu_{\theta}(s)) \right].

Note that the Q-function parameters are treated as constants here.



Exploration vs. Exploitation
----------------------------

DDPG trains a deterministic policy in an off-policy way. Because the policy is deterministic, if the agent were to explore on-policy, in the beginning it would probably not try a wide enough variety of actions to find useful learning signals. To make DDPG policies explore better, we add noise to their actions at training time. The authors of the original DDPG paper recommended time-correlated `OU noise`_, but more recent results suggest that uncorrelated, mean-zero Gaussian noise works perfectly well. Since the latter is simpler, it is preferred. To facilitate getting higher-quality training data, you may reduce the scale of the noise over the course of training. (We do not do this in our implementation, and keep noise scale fixed throughout.)

At test time, to see how well the policy exploits what it has learned, we do not add noise to the actions.

.. _`OU noise`: https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process

.. admonition:: You Should Know

    Our DDPG implementation uses a trick to improve exploration at the start of training. For a fixed number of steps at the beginning (set with the ``start_steps`` keyword argument), the agent takes actions which are sampled from a uniform random distribution over valid actions. After that, it returns to normal DDPG exploration.


Pseudocode
----------

.. math::
    :nowrap:

    \begin{algorithm}[H]
        \caption{Deep Deterministic Policy Gradient}
        \label{alg1}
    \begin{algorithmic}[1]
        \STATE Input: initial policy parameters $\theta$, Q-function parameters $\phi$, empty replay buffer $\mathcal{D}$
        \STATE Set target parameters equal to main parameters $\theta_{\text{targ}} \leftarrow \theta$, $\phi_{\text{targ}} \leftarrow \phi$
        \REPEAT
            \STATE Observe state $s$ and select action $a = \text{clip}(\mu_{\theta}(s) + \epsilon, a_{Low}, a_{High})$, where $\epsilon \sim \mathcal{N}$
            \STATE Execute $a$ in the environment
            \STATE Observe next state $s'$, reward $r$, and done signal $d$ to indicate whether $s'$ is terminal
            \STATE Store $(s,a,r,s',d)$ in replay buffer $\mathcal{D}$
            \STATE If $s'$ is terminal, reset environment state.
            \IF{it's time to update}
                \FOR{however many updates}
                    \STATE Randomly sample a batch of transitions, $B = \{ (s,a,r,s',d) \}$ from $\mathcal{D}$
                    \STATE Compute targets
                    \begin{equation*}
                        y(r,s',d) = r + \gamma (1-d) Q_{\phi_{\text{targ}}}(s', \mu_{\theta_{\text{targ}}}(s'))
                    \end{equation*}
                    \STATE Update Q-function by one step of gradient descent using
                    \begin{equation*}
                        \nabla_{\phi} \frac{1}{|B|}\sum_{(s,a,r,s',d) \in B} \left( Q_{\phi}(s,a) - y(r,s',d) \right)^2
                    \end{equation*}
                    \STATE Update policy by one step of gradient ascent using
                    \begin{equation*}
                        \nabla_{\theta} \frac{1}{|B|}\sum_{s \in B}Q_{\phi}(s, \mu_{\theta}(s))
                    \end{equation*}
                    \STATE Update target networks with
                    \begin{align*}
                        \phi_{\text{targ}} &\leftarrow \rho \phi_{\text{targ}} + (1-\rho) \phi \\
                        \theta_{\text{targ}} &\leftarrow \rho \theta_{\text{targ}} + (1-\rho) \theta
                    \end{align*}
                \ENDFOR
            \ENDIF
        \UNTIL{convergence}
    \end{algorithmic}
    \end{algorithm}


Documentation
=============

.. admonition:: You Should Know

    In what follows, we give documentation for the PyTorch and Tensorflow implementations of DDPG in Spinning Up. They have nearly identical function calls and docstrings, except for details relating to model construction. However, we include both full docstrings for completeness.


Documentation: PyTorch Version
------------------------------

.. autofunction:: spinup.ddpg_pytorch

Saved Model Contents: PyTorch Version
-------------------------------------

The PyTorch saved model can be loaded with ``ac = torch.load('path/to/model.pt')``, yielding an actor-critic object (``ac``) that has the properties described in the docstring for ``ddpg_pytorch``. 

You can get actions from this model with

.. code-block:: python

    actions = ac.act(torch.as_tensor(obs, dtype=torch.float32))


Documentation: Tensorflow Version
---------------------------------

.. autofunction:: spinup.ddpg_tf1

Saved Model Contents: Tensorflow Version
----------------------------------------

The computation graph saved by the logger includes:

========  ====================================================================
Key       Value
========  ====================================================================
``x``     Tensorflow placeholder for state input.
``a``     Tensorflow placeholder for action input.
``pi``    | Deterministically computes an action from the agent, conditioned 
          | on states in ``x``.
``q``     Gives action-value estimate for states in ``x`` and actions in ``a``.
========  ====================================================================

This saved model can be accessed either by

* running the trained policy with the `test_policy.py`_ tool,
* or loading the whole saved graph into a program with `restore_tf_graph`_. 

.. _`test_policy.py`: ../user/saving_and_loading.html#loading-and-running-trained-policies
.. _`restore_tf_graph`: ../utils/logger.html#spinup.utils.logx.restore_tf_graph


References
==========

Relevant Papers
---------------

- `Deterministic Policy Gradient Algorithms`_, Silver et al. 2014
- `Continuous Control With Deep Reinforcement Learning`_, Lillicrap et al. 2016

.. _`Deterministic Policy Gradient Algorithms`: http://proceedings.mlr.press/v32/silver14.pdf
.. _`Continuous Control With Deep Reinforcement Learning`: https://arxiv.org/abs/1509.02971


Why These Papers?
-----------------

Silver 2014 is included because it establishes the theory underlying deterministic policy gradients (DPG). Lillicrap 2016 is included because it adapts the theoretically-grounded DPG algorithm to the deep RL setting, giving DDPG.



Other Public Implementations
----------------------------

- Baselines_
- rllab_ 
- `rllib (Ray)`_ 
- `TD3 release repo`_

.. _Baselines: https://github.com/openai/baselines/tree/master/baselines/ddpg
.. _rllab: https://github.com/rll/rllab/blob/master/rllab/algos/ddpg.py
.. _`rllib (Ray)`: https://github.com/ray-project/ray/tree/master/python/ray/rllib/agents/ddpg
.. _`TD3 release repo`: https://github.com/sfujim/TD3


================================================
FILE: docs/algorithms/ppo.rst
================================================
============================
Proximal Policy Optimization
============================

.. contents:: Table of Contents


Background
==========


(Previously: `Background for TRPO`_)

.. _`Background for TRPO`: ../algorithms/trpo.html#background

PPO is motivated by the same question as TRPO: how can we take the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse? Where TRPO tries to solve this problem with a complex second-order method, PPO is a family of first-order methods that use a few other tricks to keep new policies close to old. PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO.

There are two primary variants of PPO: PPO-Penalty and PPO-Clip. 

**PPO-Penalty** approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient over the course of training so that it's scaled appropriately. 

**PPO-Clip** doesn't have a KL-divergence term in the objective and doesn't have a constraint at all. Instead relies on specialized clipping in the objective function to remove incentives for the new policy to get far from the old policy. 

Here, we'll focus only on PPO-Clip (the primary variant used at OpenAI).

Quick Facts
-----------

* PPO is an on-policy algorithm.
* PPO can be used for environments with either discrete or continuous action spaces.
* The Spinning Up implementation of PPO supports parallelization with MPI.

Key Equations
-------------

PPO-clip updates policies via

.. math::

    \theta_{k+1} = \arg \max_{\theta} \underset{s,a \sim \pi_{\theta_k}}{{\mathrm E}}\left[
        L(s,a,\theta_k, \theta)\right],

typically taking multiple steps of (usually minibatch) SGD to maximize the objective. Here :math:`L` is given by

.. math::

    L(s,a,\theta_k,\theta) = \min\left(
    \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}  A^{\pi_{\theta_k}}(s,a), \;\;
    \text{clip}\left(\frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, 1 - \epsilon, 1+\epsilon \right) A^{\pi_{\theta_k}}(s,a)
    \right),

in which :math:`\epsilon` is a (small) hyperparameter which roughly says how far away the new policy is allowed to go from the old.

This is a pretty complex expression, and it's hard to tell at first glance what it's doing, or how it helps keep the new policy close to the old policy. As it turns out, there's a considerably simplified version [1]_ of this objective which is a bit easier to grapple with (and is also the version we implement in our code):

.. math::

    L(s,a,\theta_k,\theta) = \min\left(
    \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}  A^{\pi_{\theta_k}}(s,a), \;\;
    g(\epsilon, A^{\pi_{\theta_k}}(s,a))
    \right),

where

.. math::

    g(\epsilon, A) = \left\{ 
        \begin{array}{ll}
        (1 + \epsilon) A & A \geq 0 \\
        (1 - \epsilon) A & A < 0.
        \end{array}
        \right.

To figure out what intuition to take away from this, let's look at a single state-action pair :math:`(s,a)`, and think of cases. 

**Advantage is positive**: Suppose the advantage for that state-action pair is positive, in which case its contribution to the objective reduces to

.. math::
    
    L(s,a,\theta_k,\theta) = \min\left(
    \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, (1 + \epsilon)
    \right)  A^{\pi_{\theta_k}}(s,a).

Because the advantage is positive, the objective will increase if the action becomes more likely---that is, if :math:`\pi_{\theta}(a|s)` increases. But the min in this term puts a limit to how *much* the objective can increase. Once :math:`\pi_{\theta}(a|s) > (1+\epsilon) \pi_{\theta_k}(a|s)`, the min kicks in and this term hits a ceiling of :math:`(1+\epsilon) A^{\pi_{\theta_k}}(s,a)`. Thus: *the new policy does not benefit by going far away from the old policy*.

**Advantage is negative**: Suppose the advantage for that state-action pair is negative, in which case its contribution to the objective reduces to

.. math::
    
    L(s,a,\theta_k,\theta) = \max\left(
    \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, (1 - \epsilon)
    \right)  A^{\pi_{\theta_k}}(s,a).

Because the advantage is negative, the objective will increase if the action becomes less likely---that is, if :math:`\pi_{\theta}(a|s)` decreases. But the max in this term puts a limit to how *much* the objective can increase. Once :math:`\pi_{\theta}(a|s) < (1-\epsilon) \pi_{\theta_k}(a|s)`, the max kicks in and this term hits a ceiling of :math:`(1-\epsilon) A^{\pi_{\theta_k}}(s,a)`. Thus, again: *the new policy does not benefit by going far away from the old policy*.

What we have seen so far is that clipping serves as a regularizer by removing incentives for the policy to change dramatically, and the hyperparameter :math:`\epsilon` corresponds to how far away the new policy can go from the old while still profiting the objective.

.. admonition:: You Should Know

    While this kind of clipping goes a long way towards ensuring reasonable policy updates, it is still possible to end up with a new policy which is too far from the old policy, and there are a bunch of tricks used by different PPO implementations to stave this off. In our implementation here, we use a particularly simple method: early stopping. If the mean KL-divergence of the new policy from the old grows beyond a threshold, we stop taking gradient steps. 

    When you feel comfortable with the basic math and implementation details, it's worth checking out other implementations to see how they handle this issue!


.. [1] See `this note`_ for a derivation of the simplified form of the PPO-Clip objective.


.. _`this note`: https://drive.google.com/file/d/1PDzn9RPvaXjJFZkGeapMHbHGiWWW20Ey/view?usp=sharing


Exploration vs. Exploitation
----------------------------

PPO trains a stochastic policy in an on-policy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.


Pseudocode
----------

.. math::
    :nowrap:

    \begin{algorithm}[H]
        \caption{PPO-Clip}
        \label{alg1}
    \begin{algorithmic}[1]
        \STATE Input: initial policy parameters $\theta_0$, initial value function parameters $\phi_0$
        \FOR{$k = 0,1,2,...$} 
        \STATE Collect set of trajectories ${\mathcal D}_k = \{\tau_i\}$ by running policy $\pi_k = \pi(\theta_k)$ in the environment.
        \STATE Compute rewards-to-go $\hat{R}_t$.
        \STATE Compute advantage estimates, $\hat{A}_t$ (using any method of advantage estimation) based on the current value function $V_{\phi_k}$.
        \STATE Update the policy by maximizing the PPO-Clip objective:
            \begin{equation*}
            \theta_{k+1} = \arg \max_{\theta} \frac{1}{|{\mathcal D}_k| T} \sum_{\tau \in {\mathcal D}_k} \sum_{t=0}^T \min\left(
                \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)}  A^{\pi_{\theta_k}}(s_t,a_t), \;\;
                g(\epsilon, A^{\pi_{\theta_k}}(s_t,a_t))
            \right),
            \end{equation*}
            typically via stochastic gradient ascent with Adam.
        \STATE Fit value function by regression on mean-squared error:
            \begin{equation*}
            \phi_{k+1} = \arg \min_{\phi} \frac{1}{|{\mathcal D}_k| T} \sum_{\tau \in {\mathcal D}_k} \sum_{t=0}^T\left( V_{\phi} (s_t) - \hat{R}_t \right)^2,
            \end{equation*}
            typically via some gradient descent algorithm.
        \ENDFOR
    \end{algorithmic}
    \end{algorithm}




Documentation
=============

.. admonition:: You Should Know

    In what follows, we give documentation for the PyTorch and Tensorflow implementations of PPO in Spinning Up. They have nearly identical function calls and docstrings, except for details relating to model construction. However, we include both full docstrings for completeness.


Documentation: PyTorch Version
------------------------------

.. autofunction:: spinup.ppo_pytorch

Saved Model Contents: PyTorch Version
-------------------------------------

The PyTorch saved model can be loaded with ``ac = torch.load('path/to/model.pt')``, yielding an actor-critic object (``ac``) that has the properties described in the docstring for ``ppo_pytorch``. 

You can get actions from this model with

.. code-block:: python

    actions = ac.act(torch.as_tensor(obs, dtype=torch.float32))


Documentation: Tensorflow Version
---------------------------------

.. autofunction:: spinup.ppo_tf1

Saved Model Contents: Tensorflow Version
----------------------------------------

The computation graph saved by the logger includes:

========  ====================================================================
Key       Value
========  ====================================================================
``x``     Tensorflow placeholder for state input.
``pi``    Samples an action from the agent, conditioned on states in ``x``.
``v``     Gives value estimate for states in ``x``. 
========  ====================================================================

This saved model can be accessed either by

* running the trained policy with the `test_policy.py`_ tool,
* or loading the whole saved graph into a program with `restore_tf_graph`_. 

.. _`test_policy.py`: ../user/saving_and_loading.html#loading-and-running-trained-policies
.. _`restore_tf_graph`: ../utils/logger.html#spinup.utils.logx.restore_tf_graph

References
==========

Relevant Papers
---------------

- `Proximal Policy Optimization Algorithms`_, Schulman et al. 2017
- `High Dimensional Continuous Control Using Generalized Advantage Estimation`_, Schulman et al. 2016
- `Emergence of Locomotion Behaviours in Rich Environments`_, Heess et al. 2017

.. _`Proximal Policy Optimization Algorithms`: https://arxiv.org/abs/1707.06347
.. _`High Dimensional Continuous Control Using Generalized Advantage Estimation`: https://arxiv.org/abs/1506.02438
.. _`Emergence of Locomotion Behaviours in Rich Environments`: https://arxiv.org/abs/1707.02286

Why These Papers?
-----------------

Schulman 2017 is included because it is the original paper describing PPO. Schulman 2016 is included because our implementation of PPO makes use of Generalized Advantage Estimation for computing the policy gradient. Heess 2017 is included because it presents a large-scale empirical analysis of behaviors learned by PPO agents in complex environments (although it uses PPO-penalty instead of PPO-clip). 



Other Public Implementations
----------------------------

- Baselines_
- ModularRL_ (Caution: this implements PPO-penalty instead of PPO-clip.)
- rllab_ (Caution: this implements PPO-penalty instead of PPO-clip.)
- `rllib (Ray)`_ 

.. _Baselines: https://github.com/openai/baselines/tree/master/baselines/ppo2
.. _ModularRL: https://github.com/joschu/modular_rl/blob/master/modular_rl/ppo.py
.. _rllab: https://github.com/rll/rllab/blob/master/rllab/algos/ppo.py
.. _`rllib (Ray)`: https://github.com/ray-project/ray/tree/master/python/ray/rllib/agents/ppo

================================================
FILE: docs/algorithms/sac.rst
================================================
=================
Soft Actor-Critic
=================

.. contents:: Table of Contents

Background
==========

(Previously: `Background for TD3`_)

.. _`Background for TD3`: ../algorithms/td3.html#background

Soft Actor Critic (SAC) is an algorithm that optimizes a stochastic policy in an off-policy way, forming a bridge between stochastic policy optimization and DDPG-style approaches. It isn't a direct successor to TD3 (having been published roughly concurrently), but it incorporates the clipped double-Q trick, and due to the inherent stochasticity of the policy in SAC, it also winds up benefiting from something like target policy smoothing. 

A central feature of SAC is **entropy regularization.** The policy is trained to maximize a trade-off between expected return and `entropy`_, a measure of randomness in the policy. This has a close connection to the exploration-exploitation trade-off: increasing entropy results in more exploration, which can accelerate learning later on. It can also prevent the policy from prematurely converging to a bad local optimum. 

.. _`entropy`: https://en.wikipedia.org/wiki/Entropy_(information_theory)

Quick Facts
-----------

* SAC is an off-policy algorithm.
* The version of SAC implemented here can only be used for environments with continuous action spaces.
* An alternate version of SAC, which slightly changes the policy update rule, can be implemented to handle discrete action spaces.
* The Spinning Up implementation of SAC does not support parallelization.

Key Equations
-------------

To explain Soft Actor Critic, we first have to introduce the entropy-regularized reinforcement learning setting. In entropy-regularized RL, there are slightly-different equations for value functions. 

Entropy-Regularized Reinforcement Learning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Entropy is a quantity which, roughly speaking, says how random a random variable is. If a coin is weighted so that it almost always comes up heads, it has low entropy; if it's evenly weighted and has a half chance of either outcome, it has high entropy. 

Let :math:`x` be a random variable with probability mass or density function :math:`P`. The entropy :math:`H` of :math:`x` is computed from its distribution :math:`P` according to

.. math::

    H(P) = \underE{x \sim P}{-\log P(x)}.

In entropy-regularized reinforcement learning, the agent gets a bonus reward at each time step proportional to the entropy of the policy at that timestep. This changes `the RL problem`_ to:

.. math::

    \pi^* = \arg \max_{\pi} \underE{\tau \sim \pi}{ \sum_{t=0}^{\infty} \gamma^t \bigg( R(s_t, a_t, s_{t+1}) + \alpha H\left(\pi(\cdot|s_t)\right) \bigg)},

where :math:`\alpha > 0` is the trade-off coefficient. (Note: we're assuming an infinite-horizon discounted setting here, and we'll do the same for the rest of this page.) We can now define the slightly-different value functions in this setting. :math:`V^{\pi}` is changed to include the entropy bonuses from every timestep:

.. math::

    V^{\pi}(s) = \underE{\tau \sim \pi}{ \left. \sum_{t=0}^{\infty} \gamma^t \bigg( R(s_t, a_t, s_{t+1}) + \alpha H\left(\pi(\cdot|s_t)\right) \bigg) \right| s_0 = s}

:math:`Q^{\pi}` is changed to include the entropy bonuses from every timestep *except the first*:

.. math::

    Q^{\pi}(s,a) = \underE{\tau \sim \pi}{ \left. \sum_{t=0}^{\infty} \gamma^t  R(s_t, a_t, s_{t+1}) + \alpha \sum_{t=1}^{\infty} \gamma^t H\left(\pi(\cdot|s_t)\right)\right| s_0 = s, a_0 = a}

With these definitions, :math:`V^{\pi}` and :math:`Q^{\pi}` are connected by:

.. math::

    V^{\pi}(s) = \underE{a \sim \pi}{Q^{\pi}(s,a)} + \alpha H\left(\pi(\cdot|s)\right)

and the Bellman equation for :math:`Q^{\pi}` is

.. math::

    Q^{\pi}(s,a) &= \underE{s' \sim P \\ a' \sim \pi}{R(s,a,s') + \gamma\left(Q^{\pi}(s',a') + \alpha H\left(\pi(\cdot|s')\right) \right)} \\
    &= \underE{s' \sim P}{R(s,a,s') + \gamma V^{\pi}(s')}.

.. _`the RL problem`: ../spinningup/rl_intro.html#the-rl-problem

.. admonition:: You Should Know

    The way we've set up the value functions in the entropy-regularized setting is a little bit arbitrary, and actually we could have done it differently (eg make :math:`Q^{\pi}` include the entropy bonus at the first timestep). The choice of definition may vary slightly across papers on the subject.


Soft Actor-Critic
^^^^^^^^^^^^^^^^^

SAC concurrently learns a policy :math:`\pi_{\theta}` and two Q-functions :math:`Q_{\phi_1}, Q_{\phi_2}`. There are two variants of SAC that are currently standard: one that uses a fixed entropy regularization coefficient :math:`\alpha`, and another that enforces an entropy constraint by varying :math:`\alpha` over the course of training. For simplicity, Spinning Up makes use of the version with a fixed entropy regularization coefficient, but the entropy-constrained variant is generally preferred by practitioners.

.. admonition:: You Should Know

    The SAC algorithm has changed a little bit over time. An older version of SAC also learns a value function :math:`V_{\psi}` in addition to the Q-functions; this page will focus on the modern version that omits the extra value function.



**Learning Q.** The Q-functions are learned in a similar way to TD3, but with a few key differences. 

First, what's similar?

1) Like in TD3, both Q-functions are learned with MSBE minimization, by regressing to a single shared target.

2) Like in TD3, the shared target is computed using target Q-networks, and the target Q-networks are obtained by polyak averaging the Q-network parameters over the course of training.

3) Like in TD3, the shared target makes use of the **clipped double-Q** trick.

What's different?

1) Unlike in TD3, the target also includes a term that comes from SAC's use of entropy regularization.

2) Unlike in TD3, the next-state actions used in the target come from the **current policy** instead of a target policy.

3) Unlike in TD3, there is no explicit target policy smoothing. TD3 trains a deterministic policy, and so it accomplishes smoothing by adding random noise to the next-state actions. SAC trains a stochastic policy, and so the noise from that stochasticity is sufficient to get a similar effect.

Before we give the final form of the Q-loss, let’s take a moment to discuss how the contribution from entropy regularization comes in. We'll start by taking our recursive Bellman equation for the entropy-regularized :math:`Q^{\pi}` from earlier, and rewriting it a little bit by using the definition of entropy:

.. math::

    Q^{\pi}(s,a) &= \underE{s' \sim P \\ a' \sim \pi}{R(s,a,s') + \gamma\left(Q^{\pi}(s',a') + \alpha H\left(\pi(\cdot|s')\right) \right)} \\
    &= \underE{s' \sim P \\ a' \sim \pi}{R(s,a,s') + \gamma\left(Q^{\pi}(s',a') - \alpha \log \pi(a'|s') \right)} 

The RHS is an expectation over next states (which come from the replay buffer) and next actions (which come from the current policy, and **not** the replay buffer). Since it's an expectation, we can approximate it with samples:

.. math::

    Q^{\pi}(s,a) &\approx r + \gamma\left(Q^{\pi}(s',\tilde{a}') - \alpha \log \pi(\tilde{a}'|s') \right), \;\;\;\;\;  \tilde{a}' \sim \pi(\cdot|s').

.. admonition:: You Should Know

    We switch next action notation to :math:`\tilde{a}'`, instead of :math:`a'`, to highlight that the next actions have to be sampled fresh from the policy (whereas by contrast, :math:`r` and :math:`s'` should come from the replay buffer).

SAC sets up the MSBE loss for each Q-function using this kind of sample approximation for the target. The only thing still undetermined here is which Q-function gets used to compute the sample backup: like TD3, SAC uses the clipped double-Q trick, and takes the minimum Q-value between the two Q approximators. 

Putting it all together, the loss functions for the Q-networks in SAC are:

.. math::

    L(\phi_i, {\mathcal D}) = \underset{(s,a,r,s',d) \sim {\mathcal D}}{{\mathrm E}}\left[
        \Bigg( Q_{\phi_i}(s,a) - y(r,s',d) \Bigg)^2
        \right],

where the target is given by

.. math::

    y(r, s', d) = r + \gamma (1 - d) \left( \min_{j=1,2} Q_{\phi_{\text{targ},j}}(s', \tilde{a}') - \alpha \log \pi_{\theta}(\tilde{a}'|s') \right), \;\;\;\;\; \tilde{a}' \sim \pi_{\theta}(\cdot|s').


**Learning the Policy.** The policy should, in each state, act to maximize the expected future return plus expected future entropy. That is, it should maximize :math:`V^{\pi}(s)`, which we expand out into

.. math::

    V^{\pi}(s) &= \underE{a \sim \pi}{Q^{\pi}(s,a)} + \alpha H\left(\pi(\cdot|s)\right) \\
    &= \underE{a \sim \pi}{Q^{\pi}(s,a) - \alpha \log \pi(a|s)}.

The way we optimize the policy makes use of the **reparameterization trick**, in which a sample from :math:`\pi_{\theta}(\cdot|s)` is drawn by computing a deterministic function of state, policy parameters, and independent noise. To illustrate: following the authors of the SAC paper, we use a squashed Gaussian policy, which means that samples are obtained according to

.. math::

    \tilde{a}_{\theta}(s, \xi) = \tanh\left( \mu_{\theta}(s) + \sigma_{\theta}(s) \odot \xi \right), \;\;\;\;\; \xi \sim \mathcal{N}(0, I).

.. admonition:: You Should Know

    This policy has two key differences from the policies we use in the other policy optimization algorithms:

    **1. The squashing function.** The :math:`\tanh` in the SAC policy ensures that actions are bounded to a finite range. This is absent in the VPG, TRPO, and PPO policies. It also changes the distribution: before the :math:`\tanh` the SAC policy is a factored Gaussian like the other algorithms' policies, but after the :math:`\tanh` it is not. (You can still compute the log-probabilities of actions in closed form, though: see the paper appendix for details.)

    **2. The way standard deviations are parameterized.** In VPG, TRPO, and PPO, we represent the log std devs with state-independent parameter vectors. In SAC, we represent the log std devs as outputs from the neural network, meaning that they depend on state in a complex way. SAC with state-independent log std devs, in our experience, did not work. (Can you think of why? Or better yet: run an experiment to verify?)

The reparameterization trick allows us to rewrite the expectation over actions (which contains a pain point: the distribution depends on the policy parameters) into an expectation over noise (which removes the pain point: the distribution now has no dependence on parameters):

.. math::

    \underE{a \sim \pi_{\theta}}{Q^{\pi_{\theta}}(s,a) - \alpha \log \pi_{\theta}(a|s)} = \underE{\xi \sim \mathcal{N}}{Q^{\pi_{\theta}}(s,\tilde{a}_{\theta}(s,\xi)) - \alpha \log \pi_{\theta}(\tilde{a}_{\theta}(s,\xi)|s)}

To get the policy loss, the final step is that we need to substitute :math:`Q^{\pi_{\theta}}` with one of our function approximators. Unlike in TD3, which uses :math:`Q_{\phi_1}` (just the first Q approximator), SAC uses :math:`\min_{j=1,2} Q_{\phi_j}` (the minimum of the two Q approximators). The policy is thus optimized according to

.. math::

    \max_{\theta} \underE{s \sim \mathcal{D} \\ \xi \sim \mathcal{N}}{\min_{j=1,2} Q_{\phi_j}(s,\tilde{a}_{\theta}(s,\xi)) - \alpha \log \pi_{\theta}(\tilde{a}_{\theta}(s,\xi)|s)},

which is almost the same as the DDPG and TD3 policy optimization, except for the min-double-Q trick, the stochasticity, and the entropy term.


Exploration vs. Exploitation
----------------------------

SAC trains a stochastic policy with entropy regularization, and explores in an on-policy way. The entropy regularization coefficient :math:`\alpha` explicitly controls the explore-exploit tradeoff, with higher :math:`\alpha` corresponding to more exploration, and lower :math:`\alpha` corresponding to more exploitation. The right coefficient (the one which leads to the stablest / highest-reward learning) may vary from environment to environment, and could require careful tuning.

At test time, to see how well the policy exploits what it has learned, we remove stochasticity and use the mean action instead of a sample from the distribution. This tends to improve performance over the original stochastic policy.

.. admonition:: You Should Know

    Our SAC implementation uses a trick to improve exploration at the start of training. For a fixed number of steps at the beginning (set with the ``start_steps`` keyword argument), the agent takes actions which are sampled from a uniform random distribution over valid actions. After that, it returns to normal SAC exploration.


Pseudocode
----------


.. math::
    :nowrap:

    \begin{algorithm}[H]
        \caption{Soft Actor-Critic}
        \label{alg1}
    \begin{algorithmic}[1]
        \STATE Input: initial policy parameters $\theta$, Q-function parameters $\phi_1$, $\phi_2$, empty replay buffer $\mathcal{D}$
        \STATE Set target parameters equal to main parameters $\phi_{\text{targ},1} \leftarrow \phi_1$, $\phi_{\text{targ},2} \leftarrow \phi_2$
        \REPEAT
            \STATE Observe state $s$ and select action $a \sim \pi_{\theta}(\cdot|s)$
            \STATE Execute $a$ in the environment
            \STATE Observe next state $s'$, reward $r$, and done signal $d$ to indicate whether $s'$ is terminal
            \STATE Store $(s,a,r,s',d)$ in replay buffer $\mathcal{D}$
            \STATE If $s'$ is terminal, reset environment state.
            \IF{it's time to update}
                \FOR{$j$ in range(however many updates)}
                    \STATE Randomly sample a batch of transitions, $B = \{ (s,a,r,s',d) \}$ from $\mathcal{D}$
                    \STATE Compute targets for the Q functions:
                    \begin{align*}
                        y (r,s',d) &= r + \gamma (1-d) \left(\min_{i=1,2} Q_{\phi_{\text{targ}, i}} (s', \tilde{a}') - \alpha \log \pi_{\theta}(\tilde{a}'|s')\right), && \tilde{a}' \sim \pi_{\theta}(\cdot|s')
                    \end{align*}
                    \STATE Update Q-functions by one step of gradient descent using
                    \begin{align*}
                        & \nabla_{\phi_i} \frac{1}{|B|}\sum_{(s,a,r,s',d) \in B} \left( Q_{\phi_i}(s,a) - y(r,s',d) \right)^2 && \text{for } i=1,2
                    \end{align*}
                    \STATE Update policy by one step of gradient ascent using
                    \begin{equation*}
                        \nabla_{\theta} \frac{1}{|B|}\sum_{s \in B} \Big(\min_{i=1,2} Q_{\phi_i}(s, \tilde{a}_{\theta}(s)) - \alpha \log \pi_{\theta} \left(\left. \tilde{a}_{\theta}(s) \right| s\right) \Big),
                    \end{equation*}
                    where $\tilde{a}_{\theta}(s)$ is a sample from $\pi_{\theta}(\cdot|s)$ which is differentiable wrt $\theta$ via the reparametrization trick.
                    \STATE Update target networks with
                    \begin{align*}
                        \phi_{\text{targ},i} &\leftarrow \rho \phi_{\text{targ}, i} + (1-\rho) \phi_i && \text{for } i=1,2
                    \end{align*}
                \ENDFOR
            \ENDIF
        \UNTIL{convergence}
    \end{algorithmic}
    \end{algorithm}


Documentation
=============

.. admonition:: You Should Know

    In what follows, we give documentation for the PyTorch and Tensorflow implementations of SAC in Spinning Up. They have nearly identical function calls and docstrings, except for details relating to model construction. However, we include both full docstrings for completeness.



Documentation: PyTorch Version
------------------------------

.. autofunction:: spinup.sac_pytorch

Saved Model Contents: PyTorch Version
-------------------------------------

The PyTorch saved model can be loaded with ``ac = torch.load('path/to/model.pt')``, yielding an actor-critic object (``ac``) that has the properties described in the docstring for ``sac_pytorch``. 

You can get actions from this model with

.. code-block:: python

    actions = ac.act(torch.as_tensor(obs, dtype=torch.float32))


Documentation: Tensorflow Version
---------------------------------

.. autofunction:: spinup.sac_tf1

Saved Model Contents: Tensorflow Version
----------------------------------------

The computation graph saved by the logger includes:

========  ====================================================================
Key       Value
========  ====================================================================
``x``     Tensorflow placeholder for state input.
``a``     Tensorflow placeholder for action input.
``mu``    Deterministically computes mean action from the agent, given states in ``x``. 
``pi``    Samples an action from the agent, conditioned on states in ``x``.
``q1``    Gives one action-value estimate for states in ``x`` and actions in ``a``.
``q2``    Gives the other action-value estimate for states in ``x`` and actions in ``a``.
``v``     Gives the value estimate for states in ``x``. 
========  ====================================================================

This saved model can be accessed either by

* running the trained policy with the `test_policy.py`_ tool,
* or loading the whole saved graph into a program with `restore_tf_graph`_. 

Note: for SAC, the correct evaluation policy is given by ``mu`` and not by ``pi``. The policy ``pi`` may be thought of as the exploration policy, while ``mu`` is the exploitation policy.

.. _`test_policy.py`: ../user/saving_and_loading.html#loading-and-running-trained-policies
.. _`restore_tf_graph`: ../utils/logger.html#spinup.utils.logx.restore_tf_graph


References
==========

Relevant Papers
---------------

- `Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor`_, Haarnoja et al, 2018
- `Soft Actor-Critic Algorithms and Applications`_, Haarnoja et al, 2018
- `Learning to Walk via Deep Reinforcement Learning`_, Haarnoja et al, 2018

.. _`Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor`: https://arxiv.org/abs/1801.01290
.. _`Soft Actor-Critic Algorithms and Applications`: https://arxiv.org/abs/1812.05905
.. _`Learning to Walk via Deep Reinforcement Learning`: https://arxiv.org/abs/1812.11103

Other Public Implementations
----------------------------

- `SAC release repo`_ (original "official" codebase)
- `Softlearning repo`_ (current "official" codebase)
- `Yarats and Kostrikov repo`_

.. _`SAC release repo`: https://github.com/haarnoja/sac
.. _`Softlearning repo`: https://github.com/rail-berkeley/softlearning
.. _`Yarats and Kostrikov repo`: https://github.com/denisyarats/pytorch_sac

================================================
FILE: docs/algorithms/td3.rst
================================================
=================
Twin Delayed DDPG
=================

.. contents:: Table of Contents

Background
==========

(Previously: `Background for DDPG`_)

.. _`Background for DDPG`: ../algorithms/ddpg.html#background

While DDPG can achieve great performance sometimes, it is frequently brittle with respect to hyperparameters and other kinds of tuning. A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking, because it exploits the errors in the Q-function. Twin Delayed DDPG (TD3) is an algorithm that addresses this issue by introducing three critical tricks:

**Trick One: Clipped Double-Q Learning.** TD3 learns *two* Q-functions instead of one (hence "twin"), and uses the smaller of the two Q-values to form the targets in the Bellman error loss functions.

**Trick Two: "Delayed" Policy Updates.** TD3 updates the policy (and target networks) less frequently than the Q-function. The paper recommends one policy update for every two Q-function updates.

**Trick Three: Target Policy Smoothing.** TD3 adds noise to the target action, to make it harder for the policy to exploit Q-function errors by smoothing out Q along changes in action.

Together, these three tricks result in substantially improved performance over baseline DDPG.

Quick Facts
-----------

* TD3 is an off-policy algorithm.
* TD3 can only be used for environments with continuous action spaces.
* The Spinning Up implementation of TD3 does not support parallelization.

Key Equations
-------------

TD3 concurrently learns two Q-functions, :math:`Q_{\phi_1}` and :math:`Q_{\phi_2}`, by mean square Bellman error minimization, in almost the same way that DDPG learns its single Q-function. To show exactly how TD3 does this and how it differs from normal DDPG, we'll work from the innermost part of the loss function outwards.

First: **target policy smoothing**. Actions used to form the Q-learning target are based on the target policy, :math:`\mu_{\theta_{\text{targ}}}`, but with clipped noise added on each dimension of the action. After adding the clipped noise, the target action is then clipped to lie in the valid action range (all valid actions, :math:`a`, satisfy :math:`a_{Low} \leq a \leq a_{High}`). The target actions are thus: 

.. math::

    a'(s') = \text{clip}\left(\mu_{\theta_{\text{targ}}}(s') + \text{clip}(\epsilon,-c,c), a_{Low}, a_{High}\right), \;\;\;\;\; \epsilon \sim \mathcal{N}(0, \sigma)

Target policy smoothing essentially serves as a regularizer for the algorithm. It addresses a particular failure mode that can happen in DDPG: if the Q-function approximator develops an incorrect sharp peak for some actions, the policy will quickly exploit that peak and then have brittle or incorrect behavior. This can be averted by smoothing out the Q-function over similar actions, which target policy smoothing is designed to do. 

Next: **clipped double-Q learning**. Both Q-functions use a single target, calculated using whichever of the two Q-functions gives a smaller target value:

.. math::

    y(r,s',d) = r + \gamma (1 - d) \min_{i=1,2} Q_{\phi_{i, \text{targ}}}(s', a'(s')),

and then both are learned by regressing to this target:

.. math::

    L(\phi_1, {\mathcal D}) = \underE{(s,a,r,s',d) \sim {\mathcal D}}{
        \Bigg( Q_{\phi_1}(s,a) - y(r,s',d) \Bigg)^2
        },

.. math::

    L(\phi_2, {\mathcal D}) = \underE{(s,a,r,s',d) \sim {\mathcal D}}{
        \Bigg( Q_{\phi_2}(s,a) - y(r,s',d) \Bigg)^2
        }.

Using the smaller Q-value for the target, and regressing towards that, helps fend off overestimation in the Q-function.

Lastly: the policy is learned just by maximizing :math:`Q_{\phi_1}`:

.. math::

    \max_{\theta} \underset{s \sim {\mathcal D}}{{\mathrm E}}\left[ Q_{\phi_1}(s, \mu_{\theta}(s)) \right],

which is pretty much unchanged from DDPG. However, in TD3, the policy is updated less frequently than the Q-functions are. This helps damp the volatility that normally arises in DDPG because of how a policy update changes the target.


Exploration vs. Exploitation
----------------------------

TD3 trains a deterministic policy in an off-policy way. Because the policy is deterministic, if the agent were to explore on-policy, in the beginning it would probably not try a wide enough variety of actions to find useful learning signals. To make TD3 policies explore better, we add noise to their actions at training time, typically uncorrelated mean-zero Gaussian noise. To facilitate getting higher-quality training data, you may reduce the scale of the noise over the course of training. (We do not do this in our implementation, and keep noise scale fixed throughout.)

At test time, to see how well the policy exploits what it has learned, we do not add noise to the actions.

.. admonition:: You Should Know

    Our TD3 implementation uses a trick to improve exploration at the start of training. For a fixed number of steps at the beginning (set with the ``start_steps`` keyword argument), the agent takes actions which are sampled from a uniform random distribution over valid actions. After that, it returns to normal TD3 exploration.


Pseudocode
----------


.. math::
    :nowrap:

    \begin{algorithm}[H]
        \caption{Twin Delayed DDPG}
        \label{alg1}
    \begin{algorithmic}[1]
        \STATE Input: initial policy parameters $\theta$, Q-function parameters $\phi_1$, $\phi_2$, empty replay buffer $\mathcal{D}$
        \STATE Set target parameters equal to main parameters $\theta_{\text{targ}} \leftarrow \theta$, $\phi_{\text{targ},1} \leftarrow \phi_1$, $\phi_{\text{targ},2} \leftarrow \phi_2$
        \REPEAT
            \STATE Observe state $s$ and select action $a = \text{clip}(\mu_{\theta}(s) + \epsilon, a_{Low}, a_{High})$, where $\epsilon \sim \mathcal{N}$
            \STATE Execute $a$ in the environment
            \STATE Observe next state $s'$, reward $r$, and done signal $d$ to indicate whether $s'$ is terminal
            \STATE Store $(s,a,r,s',d)$ in replay buffer $\mathcal{D}$
            \STATE If $s'$ is terminal, reset environment state.
            \IF{it's time to update}
                \FOR{$j$ in range(however many updates)}
                    \STATE Randomly sample a batch of transitions, $B = \{ (s,a,r,s',d) \}$ from $\mathcal{D}$
                    \STATE Compute target actions
                    \begin{equation*}
                        a'(s') = \text{clip}\left(\mu_{\theta_{\text{targ}}}(s') + \text{clip}(\epsilon,-c,c), a_{Low}, a_{High}\right), \;\;\;\;\; \epsilon \sim \mathcal{N}(0, \sigma)
                    \end{equation*}
                    \STATE Compute targets
                    \begin{equation*}
                        y(r,s',d) = r + \gamma (1-d) \min_{i=1,2} Q_{\phi_{\text{targ},i}}(s', a'(s'))
                    \end{equation*}
                    \STATE Update Q-functions by one step of gradient descent using
                    \begin{align*}
                        & \nabla_{\phi_i} \frac{1}{|B|}\sum_{(s,a,r,s',d) \in B} \left( Q_{\phi_i}(s,a) - y(r,s',d) \right)^2 && \text{for } i=1,2
                    \end{align*}
                    \IF{ $j \mod$ \texttt{policy\_delay} $ = 0$}
                        \STATE Update policy by one step of gradient ascent using
                        \begin{equation*}
                            \nabla_{\theta} \frac{1}{|B|}\sum_{s \in B}Q_{\phi_1}(s, \mu_{\theta}(s))
                        \end{equation*}
                        \STATE Update target networks with
                        \begin{align*}
                            \phi_{\text{targ},i} &\leftarrow \rho \phi_{\text{targ}, i} + (1-\rho) \phi_i && \text{for } i=1,2\\
                            \theta_{\text{targ}} &\leftarrow \rho \theta_{\text{targ}} + (1-\rho) \theta
                        \end{align*}
                    \ENDIF
                \ENDFOR
            \ENDIF
        \UNTIL{convergence}
    \end{algorithmic}
    \end{algorithm}


Documentation
=============

.. admonition:: You Should Know

    In what follows, we give documentation for the PyTorch and Tensorflow implementations of TD3 in Spinning Up. They have nearly identical function calls and docstrings, except for details relating to model construction. However, we include both full docstrings for completeness.



Documentation: PyTorch Version
------------------------------

.. autofunction:: spinup.td3_pytorch

Saved Model Contents: PyTorch Version
-------------------------------------

The PyTorch saved model can be loaded with ``ac = torch.load('path/to/model.pt')``, yielding an actor-critic object (``ac``) that has the properties described in the docstring for ``td3_pytorch``. 

You can get actions from this model with

.. code-block:: python

    actions = ac.act(torch.as_tensor(obs, dtype=torch.float32))


Documentation: Tensorflow Version
---------------------------------

.. autofunction:: spinup.td3_tf1

Saved Model Contents: Tensorflow Version
----------------------------------------

The computation graph saved by the logger includes:

========  ====================================================================
Key       Value
========  ====================================================================
``x``     Tensorflow placeholder for state input.
``a``     Tensorflow placeholder for action input.
``pi``    | Deterministically computes an action from the agent, conditioned 
          | on states in ``x``.
``q1``    Gives one action-value estimate for states in ``x`` and actions in ``a``.
``q2``    Gives the other action-value estimate for states in ``x`` and actions in ``a``.
========  ====================================================================

This saved model can be accessed either by

* running the trained policy with the `test_policy.py`_ tool,
* or loading the whole saved graph into a program with `restore_tf_graph`_. 

.. _`test_policy.py`: ../user/saving_and_loading.html#loading-and-running-trained-policies
.. _`restore_tf_graph`: ../utils/logger.html#spinup.utils.logx.restore_tf_graph

References
==========

Relevant Papers
---------------

- `Addressing Function Approximation Error in Actor-Critic Methods`_, Fujimoto et al, 2018

.. _`Addressing Function Approximation Error in Actor-Critic Methods`: https://arxiv.org/abs/1802.09477


Other Public Implementations
----------------------------

- `TD3 release repo`_

.. _`TD3 release repo`: https://github.com/sfujim/TD3

================================================
FILE: docs/algorithms/trpo.rst
================================================
================================
Trust Region Policy Optimization
================================

.. contents:: Table of Contents



Background
==========

(Previously: `Background for VPG`_)

.. _`Background for VPG`: ../algorithms/vpg.html#background

TRPO updates policies by taking the largest step possible to improve performance, while satisfying a special constraint on how close the new and old policies are allowed to be. The constraint is expressed in terms of `KL-Divergence`_, a measure of (something like, but not exactly) distance between probability distributions. 

This is different from normal policy gradient, which keeps new and old policies close in parameter space. But even seemingly small differences in parameter space can have very large differences in performance---so a single bad step can collapse the policy performance. This makes it dangerous to use large step sizes with vanilla policy gradients, thus hurting its sample efficiency. TRPO nicely avoids this kind of collapse, and tends to quickly and monotonically improve performance.

.. _`KL-Divergence`: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

Quick Facts
-----------

* TRPO is an on-policy algorithm.
* TRPO can be used for environments with either discrete or continuous action spaces.
* The Spinning Up implementation of TRPO supports parallelization with MPI.

Key Equations
-------------

Let :math:`\pi_{\theta}` denote a policy with parameters :math:`\theta`. The theoretical TRPO update is:

.. math:: 
    
    \theta_{k+1} = \arg \max_{\theta} \; & {\mathcal L}(\theta_k, \theta) \\
    \text{s.t.} \; & \bar{D}_{KL}(\theta || \theta_k) \leq \delta

where :math:`{\mathcal L}(\theta_k, \theta)` is the *surrogate advantage*, a measure of how policy :math:`\pi_{\theta}` performs relative to the old policy :math:`\pi_{\theta_k}` using data from the old policy:

.. math::

    {\mathcal L}(\theta_k, \theta) = \underE{s,a \sim \pi_{\theta_k}}{
        \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a)
        },

and :math:`\bar{D}_{KL}(\theta || \theta_k)` is an average KL-divergence between policies across states visited by the old policy:

.. math::

    \bar{D}_{KL}(\theta || \theta_k) = \underE{s \sim \pi_{\theta_k}}{
        D_{KL}\left(\pi_{\theta}(\cdot|s) || \pi_{\theta_k} (\cdot|s) \right)
    }.

.. admonition:: You Should Know

    The objective and constraint are both zero when :math:`\theta = \theta_k`. Furthermore, the gradient of the constraint with respect to :math:`\theta` is zero when :math:`\theta = \theta_k`. Proving these facts requires some subtle command of the relevant math---it's an exercise worth doing, whenever you feel ready!


The theoretical TRPO update isn't the easiest to work with, so TRPO makes some approximations to get an answer quickly. We Taylor expand the objective and constraint to leading order around :math:`\theta_k`:

.. math:: 

    {\mathcal L}(\theta_k, \theta) &\approx g^T (\theta - \theta_k) \\
    \bar{D}_{KL}(\theta || \theta_k) & \approx \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k)

resulting in an approximate optimization problem,

.. math:: 
    
    \theta_{k+1} = \arg \max_{\theta} \; & g^T (\theta - \theta_k) \\
    \text{s.t.} \; & \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k) \leq \delta.

.. admonition:: You Should Know

    By happy coincidence, the gradient :math:`g` of the surrogate advantage function with respect to :math:`\theta`, evaluated at :math:`\theta = \theta_k`, is exactly equal to the policy gradient, :math:`\nabla_{\theta} J(\pi_{\theta})`! Try proving this, if you feel comfortable diving into the math.

This approximate problem can be analytically solved by the methods of Lagrangian duality [1]_, yielding the solution:

.. math::

    \theta_{k+1} = \theta_k + \sqrt{\frac{2 \delta}{g^T H^{-1} g}} H^{-1} g.

If we were to stop here, and just use this final result, the algorithm would be exactly calculating the `Natural Policy Gradient`_. A problem is that, due to the approximation errors introduced by the Taylor expansion, this may not satisfy the KL constraint, or actually improve the surrogate advantage. TRPO adds a modification to this update rule: a backtracking line search,

.. math::

    \theta_{k+1} = \theta_k + \alpha^j \sqrt{\frac{2 \delta}{g^T H^{-1} g}} H^{-1} g,

where :math:`\alpha \in (0,1)` is the backtracking coefficient, and :math:`j` is the smallest nonnegative integer such that :math:`\pi_{\theta_{k+1}}` satisfies the KL constraint and produces a positive surrogate advantage. 

Lastly: computing and storing the matrix inverse, :math:`H^{-1}`, is painfully expensive when dealing with neural network policies with thousands or millions of parameters. TRPO sidesteps the issue by using the `conjugate gradient`_ algorithm to solve :math:`Hx = g` for :math:`x = H^{-1} g`, requiring only a function which can compute the matrix-vector product :math:`Hx` instead of computing and storing the whole matrix :math:`H` directly. This is not too hard to do: we set up a symbolic operation to calculate

.. math::

    Hx = \nabla_{\theta} \left( \left(\nabla_{\theta} \bar{D}_{KL}(\theta || \theta_k)\right)^T x \right),

which gives us the correct output without computing the whole matrix.

.. [1] See `Convex Optimization`_ by Boyd and Vandenberghe, especially chapters 2 through 5.

.. _`Convex Optimization`: http://stanford.edu/~boyd/cvxbook/
.. _`Natural Policy Gradient`: https://papers.nips.cc/paper/2073-a-natural-policy-gradient.pdf
.. _`conjugate gradient`: https://en.wikipedia.org/wiki/Conjugate_gradient_method


Exploration vs. Exploitation
----------------------------

TRPO trains a stochastic policy in an on-policy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.


Pseudocode
----------

.. math::
    :nowrap:

    \begin{algorithm}[H]
        \caption{Trust Region Policy Optimization}
        \label{alg1}
    \begin{algorithmic}[1]
        \STATE Input: initial policy parameters $\theta_0$, initial value function parameters $\phi_0$
        \STATE Hyperparameters: KL-divergence limit $\delta$, backtracking coefficient $\alpha$, maximum number of backtracking steps $K$
        \FOR{$k = 0,1,2,...$} 
        \STATE Collect set of trajectories ${\mathcal D}_k = \{\tau_i\}$ by running policy $\pi_k = \pi(\theta_k)$ in the environment.
        \STATE Compute rewards-to-go $\hat{R}_t$.
        \STATE Compute advantage estimates, $\hat{A}_t$ (using any method of advantage estimation) based on the current value function $V_{\phi_k}$.
        \STATE Estimate policy gradient as
            \begin{equation*}
            \hat{g}_k = \frac{1}{|{\mathcal D}_k|} \sum_{\tau \in {\mathcal D}_k} \sum_{t=0}^T \left. \nabla_{\theta} \log\pi_{\theta}(a_t|s_t)\right|_{\theta_k} \hat{A}_t.
            \end{equation*}
        \STATE Use the conjugate gradient algorithm to compute
            \begin{equation*}
            \hat{x}_k \approx \hat{H}_k^{-1} \hat{g}_k,
            \end{equation*}
            where $\hat{H}_k$ is the Hessian of the sample average KL-divergence.
        \STATE Update the policy by backtracking line search with
            \begin{equation*}
            \theta_{k+1} = \theta_k + \alpha^j \sqrt{ \frac{2\delta}{\hat{x}_k^T \hat{H}_k \hat{x}_k}} \hat{x}_k,
            \end{equation*}
            where $j \in \{0, 1, 2, ... K\}$ is the smallest value which improves the sample loss and satisfies the sample KL-divergence constraint.
        \STATE Fit value function by regression on mean-squared error:
            \begin{equation*}
            \phi_{k+1} = \arg \min_{\phi} \frac{1}{|{\mathcal D}_k| T} \sum_{\tau \in {\mathcal D}_k} \sum_{t=0}^T\left( V_{\phi} (s_t) - \hat{R}_t \right)^2,
            \end{equation*}
            typically via some gradient descent algorithm.
        \ENDFOR
    \end{algorithmic}
    \end{algorithm}



Documentation
=============

.. admonition:: You Should Know

    Spinning Up currently only has a Tensorflow implementation of TRPO. 

.. autofunction:: spinup.trpo_tf1


Saved Model Contents
--------------------

The computation graph saved by the logger includes:

========  ====================================================================
Key       Value
========  ====================================================================
``x``     Tensorflow placeholder for state input.
``pi``    Samples an action from the agent, conditioned on states in ``x``.
``v``     Gives value estimate for states in ``x``. 
========  ====================================================================

This saved model can be accessed either by

* running the trained policy with the `test_policy.py`_ tool,
* or loading the whole saved graph into a program with `restore_tf_graph`_. 

.. _`test_policy.py`: ../user/saving_and_loading.html#loading-and-running-trained-policies
.. _`restore_tf_graph`: ../utils/logger.html#spinup.utils.logx.restore_tf_graph

References
==========

Relevant Papers
---------------

- `Trust Region Policy Optimization`_, Schulman et al. 2015
- `High Dimensional Continuous Control Using Generalized Advantage Estimation`_, Schulman et al. 2016
- `Approximately Optimal Approximate Reinforcement Learning`_, Kakade and Langford 2002

.. _`Trust Region Policy Optimization`: https://arxiv.org/abs/1502.05477
.. _`High Dimensional Continuous Control Using Generalized Advantage Estimation`: https://arxiv.org/abs/1506.02438
.. _`Approximately Optimal Approximate Reinforcement Learning`: https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/KakadeLangford-icml2002.pdf

Why These Papers?
-----------------

Schulman 2015 is included because it is the original paper describing TRPO. Schulman 2016 is included because our implementation of TRPO makes use of Generalized Advantage Estimation for computing the policy gradient. Kakade and Langford 2002 is included because it contains theoretical results which motivate and deeply connect to the theoretical foundations of TRPO. 



Other Public Implementations
----------------------------

- Baselines_
- ModularRL_
- rllab_

.. _Baselines: https://github.com/openai/baselines/tree/master/baselines/trpo_mpi
.. _ModularRL: https://github.com/joschu/modular_rl/blob/master/modular_rl/trpo.py
.. _rllab: https://github.com/rll/rllab/blob/master/rllab/algos/trpo.py

================================================
FILE: docs/algorithms/vpg.rst
================================================
=======================
Vanilla Policy Gradient
=======================

.. contents:: Table of Contents


Background
==========

(Previously: `Introduction to RL, Part 3`_)

.. _`Introduction to RL, Part 3`: ../spinningup/rl_intro3.html

The key idea underlying policy gradients is to push up the probabilities of actions that lead to higher return, and push down the probabilities of actions that lead to lower return, until you arrive at the optimal policy.

Quick Facts
-----------

* VPG is an on-policy algorithm.
* VPG can be used for environments with either discrete or continuous action spaces.
* The Spinning Up implementation of VPG supports parallelization with MPI.

Key Equations
-------------

Let :math:`\pi_{\theta}` denote a policy with parameters :math:`\theta`, and :math:`J(\pi_{\theta})` denote the expected finite-horizon undiscounted return of the policy. The gradient of :math:`J(\pi_{\theta})` is

.. math:: 
    
    \nabla_{\theta} J(\pi_{\theta}) = \underE{\tau \sim \pi_{\theta}}{
        \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) A^{\pi_{\theta}}(s_t,a_t)
        },

where :math:`\tau` is a trajectory and :math:`A^{\pi_{\theta}}` is the advantage function for the current policy. 

The policy gradient algorithm works by updating policy parameters via stochastic gradient ascent on policy performance:

.. math::

    \theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta_k})

Policy gradient implementations typically compute advantage function estimates based on the infinite-horizon discounted return, despite otherwise using the finite-horizon undiscounted policy gradient formula. 

Exploration vs. Exploitation
----------------------------

VPG trains a stochastic policy in an on-policy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.


Pseudocode
----------

.. math::
    :nowrap:

    \begin{algorithm}[H]
        \caption{Vanilla Policy Gradient Algorithm}
        \label{alg1}
    \begin{algorithmic}[1]
        \STATE Input: initial policy parameters $\theta_0$, initial value function parameters $\phi_0$
        \FOR{$k = 0,1,2,...$} 
        \STATE Collect set of trajectories ${\mathcal D}_k = \{\tau_i\}$ by running policy $\pi_k = \pi(\theta_k)$ in the environment.
        \STATE Compute rewards-to-go $\hat{R}_t$.
        \STATE Compute advantage estimates, $\hat{A}_t$ (using any method of advantage estimation) based on the current value function $V_{\phi_k}$.
        \STATE Estimate policy gradient as
            \begin{equation*}
            \hat{g}_k = \frac{1}{|{\mathcal D}_k|} \sum_{\tau \in {\mathcal D}_k} \sum_{t=0}^T \left. \nabla_{\theta} \log\pi_{\theta}(a_t|s_t)\right|_{\theta_k} \hat{A}_t.
            \end{equation*}
        \STATE Compute policy update, either using standard gradient ascent,
            \begin{equation*}
            \theta_{k+1} = \theta_k + \alpha_k \hat{g}_k,
            \end{equation*}
            or via another gradient ascent algorithm like Adam.
        \STATE Fit value function by regression on mean-squared error:
            \begin{equation*}
            \phi_{k+1} = \arg \min_{\phi} \frac{1}{|{\mathcal D}_k| T} \sum_{\tau \in {\mathcal D}_k} \sum_{t=0}^T\left( V_{\phi} (s_t) - \hat{R}_t \right)^2,
            \end{equation*}
            typically via some gradient descent algorithm.
        \ENDFOR
    \end{algorithmic}
    \end{algorithm}


Documentation
=============

.. admonition:: You Should Know

    In what follows, we give documentation for the PyTorch and Tensorflow implementations of VPG in Spinning Up. They have nearly identical function calls and docstrings, except for details relating to model construction. However, we include both full docstrings for completeness.


Documentation: PyTorch Version
------------------------------

.. autofunction:: spinup.vpg_pytorch

Saved Model Contents: PyTorch Version
-------------------------------------

The PyTorch saved model can be loaded with ``ac = torch.load('path/to/model.pt')``, yielding an actor-critic object (``ac``) that has the properties described in the docstring for ``vpg_pytorch``. 

You can get actions from this model with

.. code-block:: python

    actions = ac.act(torch.as_tensor(obs, dtype=torch.float32))


Documentation: Tensorflow Version
---------------------------------

.. autofunction:: spinup.vpg_tf1

Saved Model Contents: Tensorflow Version
----------------------------------------

The computation graph saved by the logger includes:

========  ====================================================================
Key       Value
========  ====================================================================
``x``     Tensorflow placeholder for state input.
``pi``    Samples an action from the agent, conditioned on states in ``x``.
``v``     Gives value estimate for states in ``x``. 
========  ====================================================================

This saved model can be accessed either by

* running the trained policy with the `test_policy.py`_ tool,
* or loading the whole saved graph into a program with `restore_tf_graph`_. 

.. _`test_policy.py`: ../user/saving_and_loading.html#loading-and-running-trained-policies
.. _`restore_tf_graph`: ../utils/logger.html#spinup.utils.logx.restore_tf_graph

References
==========

Relevant Papers
---------------

- `Policy Gradient Methods for Reinforcement Learning with Function Approximation`_, Sutton et al. 2000
- `Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs`_, Schulman 2016(a)
- `Benchmarking Deep Reinforcement Learning for Continuous Control`_, Duan et al. 2016
- `High Dimensional Continuous Control Using Generalized Advantage Estimation`_, Schulman et al. 2016(b)

.. _`Policy Gradient Methods for Reinforcement Learning with Function Approximation`: https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf
.. _`Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs`: http://joschu.net/docs/thesis.pdf
.. _`Benchmarking Deep Reinforcement Learning for Continuous Control`: https://arxiv.org/abs/1604.06778
.. _`High Dimensional Continuous Control Using Generalized Advantage Estimation`: https://arxiv.org/abs/1506.02438

Why These Papers?
-----------------

Sutton 2000 is included because it is a timeless classic of reinforcement learning theory, and contains references to the earlier work which led to modern policy gradients. Schulman 2016(a) is included because Chapter 2 contains a lucid introduction to the theory of policy gradient algorithms, including pseudocode. Duan 2016 is a clear, recent benchmark paper that shows how vanilla policy gradient in the deep RL setting (eg with neural network policies and Adam as the optimizer) compares with other deep RL algorithms. Schulman 2016(b) is included because our implementation of VPG makes use of Generalized Advantage Estimation for computing the policy gradient.


Other Public Implementations
----------------------------

- rllab_
- `rllib (Ray)`_

.. _rllab: https://github.com/rll/rllab/blob/master/rllab/algos/vpg.py
.. _`rllib (Ray)`: https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/pg


================================================
FILE: docs/conf.py
================================================
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Spinning Up documentation build configuration file, created by
# sphinx-quickstart on Wed Aug 15 04:21:07 2018.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys

# Make sure spinup is accessible without going through setup.py
dirname = os.path.dirname
sys.path.insert(0, dirname(dirname(__file__)))

# Mock mpi4py to get around having to install it on RTD server (which fails)
# Also to mock PyTorch, because it is too large for the RTD server to download
from unittest.mock import MagicMock

class Mock(MagicMock):
    @classmethod
    def __getattr__(cls, name):
        return MagicMock()

MOCK_MODULES = ['mpi4py', 
                'torch', 
                'torch.optim', 
                'torch.nn',
                'torch.distributions',
                'torch.distributions.normal',
                'torch.distributions.categorical',
                'torch.nn.functional',
                ]
sys.modules.update((mod_name, Mock()) for mod_name in MOCK_MODULES)

# Finish imports
import spinup
from recommonmark.parser import CommonMarkParser


source_parsers = {
    '.md': CommonMarkParser,
}


# -- General configuration ------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.imgmath',
    'sphinx.ext.viewcode',
    'sphinx.ext.autodoc',
    'sphinx.ext.napoleon']

#'sphinx.ext.mathjax', ??

# imgmath settings
imgmath_image_format = 'svg'
imgmath_font_size = 14

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
source_suffix = ['.rst', '.md']
# source_suffix = '.rst'

# The master toctree document.
master_doc = 'index'

# General information about the project.
project = 'Spinning Up'
copyright = '2018, OpenAI'
author = 'Joshua Achiam'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = ''
# The full version, including alpha/beta/rc tags.
release = ''

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This patterns also effect to html_static_path and html_extra_path
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'default' #'sphinx'

# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False


# -- Options for HTML output ----------------------------------------------

# The theme to use for HTML and HTML Help pages.  See the documentation for
# a list of builtin themes.
#
# html_theme = 'alabaster'
html_theme = "sphinx_rtd_theme"

# Theme options are theme-specific and customize the look and feel of a theme
# further.  For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']

html_logo = 'images/spinning-up-logo2.png'
html_theme_options = {
    'logo_only': True
}
#html_favicon = 'openai-favicon2_32x32.ico'
html_favicon = 'openai_icon.ico'

# -- Options for HTMLHelp output ------------------------------------------

# Output file base name for HTML help builder.
htmlhelp_basename = 'SpinningUpdoc'

# -- Options for LaTeX output ---------------------------------------------


imgmath_latex_preamble = r'''
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{amsmath}
\usepackage{cancel}

\usepackage[verbose=true,letterpaper]{geometry}
\geometry{
    textheight=12in,
    textwidth=6.5in,
    top=1in,
    headheight=12pt,
    headsep=25pt,
    footskip=30pt
    }

\newcommand{\E}{{\mathrm E}}

\newcommand{\underE}[2]{\underset{\begin{subarray}{c}#1 \end{subarray}}{\E}\left[ #2 \right]}

\newcommand{\Epi}[1]{\underset{\begin{subarray}{c}\tau \sim \pi \end{subarray}}{\E}\left[ #1 \right]}
'''

latex_elements = {
    # The paper size ('letterpaper' or 'a4paper').
    #
    # 'papersize': 'letterpaper',

    # The font size ('10pt', '11pt' or '12pt').
    #
    # 'pointsize': '10pt',

    # Additional stuff for the LaTeX preamble.
    #
    'preamble': r'''
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{amsmath}
\usepackage{cancel}


\newcommand{\E}{{\mathrm E}}

\newcommand{\underE}[2]{\underset{\begin{subarray}{c}#1 \end{subarray}}{\E}\left[ #2 \right]}

\newcommand{\Epi}[1]{\underset{\begin{subarray}{c}\tau \sim \pi \end{subarray}}{\E}\left[ #1 \right]}
''',

    # Latex figure (float) alignment
    #
    # 'figure_align': 'htbp',
}

# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
#  author, documentclass [howto, manual, or own class]).
latex_documents = [
    (master_doc, 'SpinningUp.tex', 'Spinning Up Documentation',
     'Joshua Achiam', 'manual'),
]


# -- Options for manual page output ---------------------------------------

# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
    (master_doc, 'spinningup', 'Spinning Up Documentation',
     [author], 1)
]


# -- Options for Texinfo output -------------------------------------------

# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
#  dir menu entry, description, category)
texinfo_documents = [
    (master_doc, 'SpinningUp', 'Spinning Up Documentation',
     author, 'SpinningUp', 'One line description of project.',
     'Miscellaneous'),
]


def setup(app):
    app.add_stylesheet('css/modify.css')

================================================
FILE: docs/docs_requirements.txt
================================================
cloudpickle~=1.2.1
gym~=0.15.3
ipython
joblib
matplotlib
numpy
pandas
pytest
psutil
scipy
seaborn==0.8.1
sphinx==1.5.6
sphinx-autobuild==0.7.1       
sphinx-rtd-theme==0.4.1 
tensorflow>=1.8.0,<2.0
tqdm

================================================
FILE: docs/etc/acknowledgements.rst
================================================
================
Acknowledgements
================

We gratefully acknowledge the contributions of the many people who helped get this project off of the ground, including people who beta tested the software, gave feedback on the material, improved dependencies of Spinning Up code in service of this release, or otherwise supported the project. Given the number of people who were involved at various points, this list of names may not be exhaustive. (If you think you should have been listed here, please do not hesitate to reach out.)

In no particular order, thank you Alex Ray, Amanda Askell, Ben Garfinkel, Christy Dennison, Coline Devin, Daniel Zeigler, Dylan Hadfield-Menell, Ge Yang, Greg Khan, Jack Clark, Jonas Rothfuss, Larissa Schiavo, Leandro Castelao, Lilian Weng, Maddie Hall, Matthias Plappert, Miles Brundage, Peter Zokhov, and Pieter Abbeel. 

We are also grateful to Pieter Abbeel's group at Berkeley, and the Center for Human-Compatible AI, for giving feedback on presentations about Spinning Up.

================================================
FILE: docs/etc/author.rst
================================================
================
About the Author
================

Spinning Up in Deep RL was primarily developed by Josh Achiam, a research scientist on the OpenAI Safety Team and PhD student at UC Berkeley advised by Pieter Abbeel. Josh studies topics related to safety in deep reinforcement learning, and has previously published work on `safe exploration`_. 

.. _`safe exploration`: https://arxiv.org/abs/1705.10528

================================================
FILE: docs/images/rl_algorithms.xml
================================================
<mxfile userAgent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0" version="9.1.2" editor="www.draw.io" type="device"><diagram name="Page-1" id="8ce9d11a-91a2-4d17-14d8-a56ed91bf033">7Z3bdps4FIafxpfpQuLoS+fYWaudpklXZzp3CpZtphh5sNIkffoRRtiAwCFGknGr9KJBCCH4vy1tbUlkZF8sn29StFp8JFMcj6A1fR7ZlyMIge0B9l+W8pKn+K6TJ8zTaMoz7RLuo5+YJ1o89TGa4nUlIyUkptGqmhiSJMEhraShNCVP1WwzElfvukJzLCTchygWU/+KpnRR1A66uxPvcTRf8Ft7xYkHFH6fp+Qx4fcbQXu2+clPL1FRFn/Q9QJNyVMpyb4a2RcpITT/bfl8gePs3RavLb/uuuXstt4pTminC3yMPM/HwA0hq2p4Bse8YvSleBl4yt4NPyQpXZA5SVB8tUs93zwwzooE7GhBlzH/NUYPOD7fvpMLEpOUnUpIkl22piilk0yuWtp1FGclWMUxB8RlxziZFleEMVqvo/DLIkryE/wykB+VLvoXU/rCj9EjJSxp9yAfCFnxq9Y0Jd9xUUumnbX52Z4pWMjyzkhCr9EyijPEv+J0ihLEk/mdAn7YVB5+jujf2SO+c/nRN16uqCAXdU0e05BLAvOkTJlSHi7xDSZLTNMXliHFMaLRjyrViBvHfJtve+ktidhdocUN2QXeOw42N+TAqpaRV4pfVubs1ZIArBXFhJ5jKhTFfik90i5pA3JHqG3LQK0D6oRJVKI6O/zGH7kV61z2PMk+FtYCjF259myhKF8j18BwfQpcC8L5Jwe6PQZHBJ2/rx8ofsSFY+XFlGtWsQDvv0dSnDhbb8SdsAzAWz3vTrLf5tn/G8/17DrFmGW5+1AUyuqXl5vnEoysakJPi4ji+xXadM5PzCGumtXWxbM62FgN6ak/fmhDmtlPKedshr0w7IQ6gHtYR3E0TzJ7ZTzjdB/gP3BK8fNefAt2xEbS4ew8lbxrl6ctyo611c58BbA9NMH+8LhN8GS8WJN4TtKILpZrA48SeByn3vDoZMdWxE7e8JyjNQPBtDzK4HGd8TFbHse4Z8MfSx/LOXOsoMYmdN0DnTOr3sFuG0n5zplroD6FMUegh+IO6HWl2IF2vaixMoo9Q/EpUKypLZZIMfCFZl0dxaCfh2FVMd6qBSpavXsLW51UBZri12Bc18LuGP9gxodeStlWWYb1m+7kWTVx8zIPlrpfv1uXutVnUgpB2e/yG7iwlHDRX0wHSBazX/dzEmKOh6qlbMP0f30tARiomK5kw/R1G6Yliin6foLfd7Cn57jZvy0JNT0PQcMJepLQNZji67az05fG1yVNYKR5ozSeLmnyIqQHsG9JHIVZjT+taLSMfrJak8REsZVEsYEYGPF0zp/Zx7duqQ5OINojHIaD44D6ZJfsYWTPOYmTEBOoiX32F1P2MLJnLP4kxITDiAk0iCnbMjUNPZQLpia411+wvWNF8fqg3ukW1yuI4/r9Vl1LiSEM2IFuoEzbsLNY8Wyk6SyNrmFn0G/56+8oja5hJ7+39GHn57MPGKVJlMzNWFPTWk2tY80i0CwdnC93t58Gh4wXBvhh1gWZKcLBbKjIOHXPeNxADGwgpj7HflB0ol9Y2Kzk0LLITtPOnsD2661X3Wnvuo7Dt6G+NXWwXwDdQKxrg5qaMGEDfPZ+9LpSHAgLoOEYKKPY7B0+DYrVhNTUUexZdRdDIcX2AJYASI3RFXJXEHCVIPDWIJ0H68LKnu+wB7BqQL2azkDVlD3h4cDfQE1nGHORDWrKtk3+8NIH+5sYUfZWF9m+Xv55moGN/X+NcJFbBIKOESxydIfmlbcGTc7aMCazxZG19J5aVejvhr0U0xroaA18T4iZ6GwPCjdIOkF/wMngiPk1YseeN642KjpDx8UgSDov919vDC96eGnazq+MF1WzmpN4tUD/4JQYapRQEwTWEalRtLbHL7uptbigLwQG5fqpDUsL5PupnYcBqhzHj+cfr41FamnHHZ0W6TR91aemZybGquWptt+dRQ9FdqvlnbS+AMcVxlNdXwGo7yc/aKGFqr7swgXGZrrYzJbC142mWICqc1mFqtjc57uzy89/GkTUIAIsnYyo6ngNIPIB0QIEVATE+6s7A4SiFkNnr+I0Reg0e17CcsXujpeMVzBWZCK3Zn1rRxNxWuAaRqcKVDleE8jubWX1gNYkq4eBRQUsOpvT4jta6nb336RoGuENHoYWmbRo6W2b5jc097asEi3zxnq6W8dTZCL3E9OEthpFA0oVO3GH0986qr6Q8uXSNoCoAUTrkEVVoPTy8tbMEssGRAcRvrr5pq9XhogDiXjjfJPOdSZ+k5te07O0N4UrUdavdWd4216Q3VRul13j4uaQvZbVtH6rSOu5ZURcdCx8daPzh5iB8AEPt1aUxA94dJhSbPxUwBtVHgkLUA9T2X99WKJU5Ve/rTIEldnh7i9x5tl3f+7Uvvof</diagram></mxfile>

================================================
FILE: docs/images/rl_algorithms_9_15.xml
================================================
<mxfile userAgent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0" version="9.1.3" editor="www.draw.io" type="device"><diagram name="Page-1" id="8ce9d11a-91a2-4d17-14d8-a56ed91bf033">7Z1bd6MqFMc/TR47S/Cax/Q6Z62ZM5121lzOG1WSeMZIjrHTdj79wQiJikltBWI6tA+NiAT9/zZsNmBH9tni8SpDy/lHEuFkBK3ocWSfjyAEtgfonyLlqUzxXadMmGVxxDJtE27j35glWiz1Po7wqpYxJyTJ42U9MSRpisO8loayjDzUs01JUv/WJZphIeE2RImY+i2O8jmvHXS3J97jeDZnX+3xE3co/DnLyH3Kvm8E7en6pzy9QLwsdqOrOYrIQyXJvhjZZxkheflp8XiGk+LZ8sdWXne54+ym3hlO804X+Bh5no+BG0Ja1fAEjlnF8if+MHBEnw07JFk+JzOSouRim3q6vmFcFAno0TxfJOxjgu5wcrp5JmckIRk9lZK0uGyVoyyfFHI10i7jpCjB4scMEJce4zTiV4QJWq3i8Ms8TssT7DJQHlUu+hfn+RM7Rvc5oUnbG/lAyJJdtcoz8hPzWlLtrPXP5gxnocg7JWl+iRZxUiD+FWcRShFLZt8UsMO28vBjnH8vbvGdy45+sHJFBZmoK3KfhUwSWCYVylTyMImvMFngPHuiGTKcoDz+VacaMeOYbfJtLr0mMf1WaDFDdoH3joHNDDmw6mWUlWKXVTl7tiQAG0VRoWc4F4qiHyq3tE1ag9wRatsyUOuAOqUSVaguDn+wW96JdSl7mWQfCmsBxq5ce7ZQlK+Ra2C4PgauBeH8owPdHoMDgs6e1y+U3GPuWHlJzjSrWYD33z3hJ05Wa3EnNAPwlo/bk/TTrPi79lxPLjOMaZabD7xQWr+y3DKXYGR1E3qYxzm+XaJ15/xAHeK6WW1cPKuDjTWQjvzx3S6kqf1Uck6n2AvDTqgDuId1lMSztLBXyjPO9gH+C2c5ftyLL2dHbCQdxs5Dxbt2Wdq86lhbu5mvAbaHJtgfHrcNnoIXa5LMSBbn88XKwKMEHsdpNjw62bEVsVM2PKdoRUEwLY8yeFxnfMiWxzHu2fDH0odyzhwraLAJXfeVzpnV7GA3jaR858w1UB/DmCPQQ3EH9LpS7EC7WdRYGcWeofgYKNbUFkukGPhCs66OYtDPw7DqGG/UAjWt3r2ErU6qAk3xazBuamF3jH9Q40NPlWzLIsPqRd/kWQ1xyzJfLXW/frcp9U6fSSkEVb/Lb+HCUsJFfzEdIFnMft3PUYg5HqqWsg3Tf/taAjBQMV3JhunrNkxLFFP0/QS/79WenuMWvz1IcKSPK7oGU3zddnZ00kh3ljtLExhp9kvjHUyasgjpAexrksRhUeNPyzxexL9prUlqothKothADIx4OufP7MNbt1QHJxANFA7DwXFAc7JL9jCy55zEUYgJ1MQ++4spexjZMxZ/FGLCYcQEWsSUbZmahh7KBVMT3Osv2N6xonh90Ox0+fUK4rh+v1XXUmIIw3GgW6A63LCTr3g20uyS5mDDzqDf8tc/QJqDDTtZZaQPOz+ffMAoS+N0ZsaamtZqah1r8kCzdHC+3Fx/GhwyXhjgu2kXZCKEg+lQkXGanvG4hRjYQkxzjv1V0Yl+YWGzkkPLIjtNO3sC22+2Xk2nves6Dt+G+tbUwX4BdAOxrg1qasKELfDZ+9HrSnEgLICGY6CMYrN3+DgoVhNSU0exZzVdDIUU2wNYAiA1RsflriHgKkHgpUE6DzaFlT3fYQ9g1YB6NZ2Bqil7wsOBf4CazjDmIlvUlG2b7OalD/bXMaLiqc6Lfb3s9TQDG/u/jXCRywNBhwgWObpD88pbgzZnbRiT2eLIWnpPrSr0d0UfimkNdLQGvifETHS2B9wNkk7QX3AyOGLeRuzY88b1RkVn6JgPgqTz8o1kScTbmuG9GOKNotO2s18ZOqomOCfJco7+wRkx1CihJgisA1KjaJmPX/VYGyFCX4gRynVZW5YdyHdZO48IVPmQH08/XhqL1NKOOzot0ml7wU9Dz0KMZfcb3byVFt3xEqy9D8BxhaFV10cAmlvLX7XmQlVfduYCYzNdbGZD4fNGw9ei6lxhoSpM9/nm5Pzz3wYRNYgASycjqjpeA4h8QLQAARUB8f7ixgChqMXQ2as4bcE6zZ6XsHKxu+Ml4xGMFZnItVnq2tFEnB0kDaNTBaocrwmk320V9YDWpKiHgUUFLDqbU/5KLXUb/a8yFMV4jYehRSYtWnrbtqkOzb0trcSOKWQ93S3fDSXdRG4npgntZhTuDpaG0d86ql6W8uXcNoCoAUTrkEVVoPT8/PrKACIZEB1E+Ormm75eGCL0zDfpXHLit7npDT0r21SYElX9dm4S37UtZO9U7qu2hVQeVNtSLp7Wc/eIuP5YeAFH53cyA+FdHm6jKInv8ugwpdj61oAXqjwS1qJ2Upk3WYNR+dnXrAxBZXq4/aecZfbtfz61L/4H</diagram></mxfile>

================================================
FILE: docs/index.rst
================================================
.. Spinning Up documentation master file, created by
   sphinx-quickstart on Wed Aug 15 04:21:07 2018.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to Spinning Up in Deep RL!
==================================

.. image:: images/spinning-up-in-rl.png

.. toctree::
   :maxdepth: 2
   :caption: User Documentation

   user/introduction
   user/installation
   user/algorithms
   user/running
   user/saving_and_loading
   user/plotting

.. toctree::
   :maxdepth: 2
   :caption: Introduction to RL

   spinningup/rl_intro
   spinningup/rl_intro2
   spinningup/rl_intro3

.. toctree::
   :maxdepth: 2
   :caption: Resources

   spinningup/spinningup
   spinningup/keypapers
   spinningup/exercises
   spinningup/bench

.. toctree::
   :maxdepth: 2
   :caption: Algorithms Docs

   algorithms/vpg
   algorithms/trpo
   algorithms/ppo
   algorithms/ddpg
   algorithms/td3
   algorithms/sac

.. toctree::
   :maxdepth: 2
   :caption: Utilities Docs

   utils/logger
   utils/plotter
   utils/mpi
   utils/run_utils

.. toctree::
   :maxdepth: 2
   :caption: Etc.

   etc/acknowledgements
   etc/author

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`


================================================
FILE: docs/make.bat
================================================
@ECHO OFF

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
	set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build
set SPHINXPROJ=SpinningUp

if "%1" == "" goto help

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
	echo.
	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
	echo.installed, then set the SPHINXBUILD environment variable to point
	echo.to the full path of the 'sphinx-build' executable. Alternatively you
	echo.may add the Sphinx directory to PATH.
	echo.
	echo.If you don't have Sphinx installed, grab it from
	echo.http://sphinx-doc.org/
	exit /b 1
)

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%

:end
popd


================================================
FILE: docs/spinningup/bench.rst
================================================
==========================================
Benchmarks for Spinning Up Implementations
==========================================

.. contents:: Table of Contents

We benchmarked the Spinning Up algorithm implementations in five environments from the MuJoCo_ Gym task suite: HalfCheetah, Hopper, Walker2d, Swimmer, and Ant.

.. _MuJoCo: https://gym.openai.com/envs/#mujoco

Performance in Each Environment
===============================

HalfCheetah: PyTorch Versions
-----------------------------

.. figure:: ../images/plots/pyt/pytorch_halfcheetah_performance.svg
    :align: center

    3M timestep benchmark for HalfCheetah-v3 using **PyTorch** implementations.


HalfCheetah: Tensorflow Versions
--------------------------------

.. figure:: ../images/plots/tf1/tensorflow_halfcheetah_performance.svg
    :align: center

    3M timestep benchmark for HalfCheetah-v3 using **Tensorflow** implementations.



Hopper: PyTorch Versions
------------------------

.. figure:: ../images/plots/pyt/pytorch_hopper_performance.svg
    :align: center

    3M timestep benchmark for Hopper-v3 using **PyTorch** implementations.


Hopper: Tensorflow Versions
---------------------------

.. figure:: ../images/plots/tf1/tensorflow_hopper_performance.svg
    :align: center

    3M timestep benchmark for Hopper-v3 using **Tensorflow** implementations.




Walker2d: PyTorch Versions
--------------------------

.. figure:: ../images/plots/pyt/pytorch_walker2d_performance.svg
    :align: center

    3M timestep benchmark for Walker2d-v3 using **PyTorch** implementations.


Walker2d: Tensorflow Versions
-----------------------------

.. figure:: ../images/plots/tf1/tensorflow_walker2d_performance.svg
    :align: center

    3M timestep benchmark for Walker2d-v3 using **Tensorflow** implementations.



Swimmer: PyTorch Versions
-------------------------

.. figure:: ../images/plots/pyt/pytorch_swimmer_performance.svg
    :align: center

    3M timestep benchmark for Swimmer-v3 using **PyTorch** implementations.


Swimmer: Tensorflow Versions
----------------------------

.. figure:: ../images/plots/tf1/tensorflow_swimmer_performance.svg
    :align: center

    3M timestep benchmark for Swimmer-v3 using **Tensorflow** implementations.



Ant: PyTorch Versions
------------------------

.. figure:: ../images/plots/pyt/pytorch_ant_performance.svg
    :align: center

    3M timestep benchmark for Ant-v3 using **PyTorch** implementations.


Ant: Tensorflow Versions
---------------------------

.. figure:: ../images/plots/tf1/tensorflow_ant_performance.svg
    :align: center

    3M timestep benchmark for Ant-v3 using **Tensorflow** implementations.


Experiment Details
==================

**Random seeds.** All experiments were run for 10 random seeds each. Graphs show the average (solid line) and std dev (shaded) of performance over random seed over the course of training.

**Performance metric.** Performance for the on-policy algorithms is measured as the average trajectory return across the batch collected at each epoch. Performance for the off-policy algorithms is measured once every 10,000 steps by running the deterministic policy (or, in the case of SAC, the mean policy) without action noise for ten trajectories, and reporting the average return over those test trajectories.

**Network architectures.** The on-policy algorithms use networks of size (64, 32) with tanh units for both the policy and the value function. The off-policy algorithms use networks of size (256, 256) with relu units.

**Batch size.** The on-policy algorithms collected 4000 steps of agent-environment interaction per batch update. The off-policy algorithms used minibatches of size 100 at each gradient descent step.

All other hyperparameters are left at default settings for the Spinning Up implementations. See algorithm pages for details.

Learning curves are smoothed by averaging over a window of 11 epochs.

.. admonition:: You Should Know

    By comparison to the literature, the Spinning Up implementations of DDPG, TD3, and SAC are roughly at-parity with the best reported results for these algorithms. As a result, you can use the Spinning Up implementations of these algorithms for research purposes.

    The Spinning Up implementations of VPG, TRPO, and PPO are overall a bit weaker than the best reported results for these algorithms. This is due to the absence of some standard tricks (such as observation normalization and normalized value regression targets) from our implementations. For research comparisons, you should use the implementations of TRPO or PPO from `OpenAI Baselines`_.

.. _`OpenAI Baselines`: https://github.com/openai/baselines


PyTorch vs Tensorflow
=====================


We provide graphs for head-to-head comparisons between the PyTorch and Tensorflow implementations of each algorithm at the following pages:

* `VPG Head-to-Head`_

* `PPO Head-to-Head`_

* `DDPG Head-to-Head`_

* `TD3 Head-to-Head`_

* `SAC Head-to-Head`_

.. _`VPG Head-to-Head`: ../spinningup/bench_vpg.html
.. _`PPO Head-to-Head`: ../spinningup/bench_ppo.html
.. _`DDPG Head-to-Head`: ../spinningup/bench_ddpg.html
.. _`TD3 Head-to-Head`: ../spinningup/bench_td3.html
.. _`SAC Head-to-Head`: ../spinningup/bench_sac.html

================================================
FILE: docs/spinningup/bench_ddpg.rst
================================================
DDPG Head-to-Head
=================

HalfCheetah
-----------

.. figure:: ../images/plots/ddpg/ddpg_halfcheetah_performance.svg
    :align: center


Hopper
------

.. figure:: ../images/plots/ddpg/ddpg_hopper_performance.svg
    :align: center


Walker2d
--------

.. figure:: ../images/plots/ddpg/ddpg_walker2d_performance.svg
    :align: center

Swimmer
-------

.. figure:: ../images/plots/ddpg/ddpg_swimmer_performance.svg
    :align: center


Ant
---

.. figure:: ../images/plots/ddpg/ddpg_ant_performance.svg
    :align: center

================================================
FILE: docs/spinningup/bench_ppo.rst
================================================
Proximal Policy Optimization Head-to-Head
=========================================

HalfCheetah
-----------

.. figure:: ../images/plots/ppo/ppo_halfcheetah_performance.svg
    :align: center


Hopper
------

.. figure:: ../images/plots/ppo/ppo_hopper_performance.svg
    :align: center


Walker2d
--------

.. figure:: ../images/plots/ppo/ppo_walker2d_performance.svg
    :align: center

Swimmer
-------

.. figure:: ../images/plots/ppo/ppo_swimmer_performance.svg
    :align: center


Ant
---

.. figure:: ../images/plots/ppo/ppo_ant_performance.svg
    :align: center

================================================
FILE: docs/spinningup/bench_sac.rst
================================================
SAC Head-to-Head
=================

HalfCheetah
-----------

.. figure:: ../images/plots/sac/sac_halfcheetah_performance.svg
    :align: center


Hopper
------

.. figure:: ../images/plots/sac/sac_hopper_performance.svg
    :align: center


Walker2d
--------

.. figure:: ../images/plots/sac/sac_walker2d_performance.svg
    :align: center

Swimmer
-------

.. figure:: ../images/plots/sac/sac_swimmer_performance.svg
    :align: center


Ant
---

.. figure:: ../images/plots/sac/sac_ant_performance.svg
    :align: center

================================================
FILE: docs/spinningup/bench_td3.rst
================================================
TD3 Head-to-Head
=================

HalfCheetah
-----------

.. figure:: ../images/plots/td3/td3_halfcheetah_performance.svg
    :align: center


Hopper
------

.. figure:: ../images/plots/td3/td3_hopper_performance.svg
    :align: center


Walker2d
--------

.. figure:: ../images/plots/td3/td3_walker2d_performance.svg
    :align: center

Swimmer
-------

.. figure:: ../images/plots/td3/td3_swimmer_performance.svg
    :align: center


Ant
---

.. figure:: ../images/plots/td3/td3_ant_performance.svg
    :align: center

================================================
FILE: docs/spinningup/bench_vpg.rst
================================================
Vanilla Policy Gradients Head-to-Head
=====================================

HalfCheetah
-----------

.. figure:: ../images/plots/vpg/vpg_halfcheetah_performance.svg
    :align: center


Hopper
------

.. figure:: ../images/plots/vpg/vpg_hopper_performance.svg
    :align: center


Walker2d
--------

.. figure:: ../images/plots/vpg/vpg_walker2d_performance.svg
    :align: center

Swimmer
-------

.. figure:: ../images/plots/vpg/vpg_swimmer_performance.svg
    :align: center


Ant
---

.. figure:: ../images/plots/vpg/vpg_ant_performance.svg
    :align: center

================================================
FILE: docs/spinningup/exercise2_1_soln.rst
================================================
========================
Solution to Exercise 2.1
========================

.. figure:: ../images/ex2-1_trpo_hopper.png
    :align: center

    Learning curves for TRPO in Hopper-v2 with different values of ``train_v_iters``, averaged over three random seeds.


The difference is quite substantial: with a trained value function, the agent is able to quickly make progress. With an untrained value function, the agent gets stuck early on.

================================================
FILE: docs/spinningup/exercise2_2_soln.rst
================================================
========================
Solution to Exercise 2.2
========================

.. figure:: ../images/ex2-2_ddpg_bug.svg
    :align: center

    Learning curves for DDPG in HalfCheetah-v2 for bugged and non-bugged actor-critic implementations, averaged over three random seeds.


.. admonition:: You Should Know

    This page will give the solution primarily in terms of a detailed analysis of the Tensorflow version of this exercise. However, the problem in the PyTorch version is basically the same and so is its solution.


The Bug in the Code: Tensorflow Version
=======================================

The only difference between the correct actor-critic code,

.. code-block:: python
    :emphasize-lines: 11, 13

    """
    Actor-Critic
    """
    def mlp_actor_critic(x, a, hidden_sizes=(400,300), activation=tf.nn.relu, 
                         output_activation=tf.tanh, action_space=None):
        act_dim = a.shape.as_list()[-1]
        act_limit = action_space.high[0]
        with tf.variable_scope('pi'):
            pi = act_limit * mlp(x, list(hidden_sizes)+[act_dim], activation, output_activation)
        with tf.variable_scope('q'):
            q = tf.squeeze(mlp(tf.concat([x,a], axis=-1), list(hidden_sizes)+[1], activation, None), axis=1)
        with tf.variable_scope('q', reuse=True):
            q_pi = tf.squeeze(mlp(tf.concat([x,pi], axis=-1), list(hidden_sizes)+[1], activation, None), axis=1)
        return pi, q, q_pi

and the bugged actor-critic code,

.. code-block:: python
    :emphasize-lines: 11, 13

    """
    Bugged Actor-Critic
    """
    def bugged_mlp_actor_critic(x, a, hidden_sizes=(400,300), activation=tf.nn.relu, 
                                output_activation=tf.tanh, action_space=None):
        act_dim = a.shape.as_list()[-1]
        act_limit = action_space.high[0]
        with tf.variable_scope('pi'):
            pi = act_limit * mlp(x, list(hidden_sizes)+[act_dim], activation, output_activation)
        with tf.variable_scope('q'):
            q = mlp(tf.concat([x,a], axis=-1), list(hidden_sizes)+[1], activation, None)
        with tf.variable_scope('q', reuse=True):
            q_pi = mlp(tf.concat([x,pi], axis=-1), list(hidden_sizes)+[1], activation, None)
        return pi, q, q_pi

is the tensor shape for the Q-functions. The correct version squeezes ouputs so that they have shape ``[batch size]``, whereas the bugged version doesn't, resulting in Q-functions with shape ``[batch size, 1]``.


The Bug in the Code: PyTorch Version
====================================

In the PyTorch version of the exercise, the difference is virtually the same. The correct actor-critic code computes a forward pass on the Q-function that squeezes its output:


.. code-block:: python
    :emphasize-lines: 12

    """
    Correct Q-Function
    """
    class MLPQFunction(nn.Module):

        def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
            super().__init__()
            self.q = mlp([obs_dim + act_dim] + list(hidden_sizes) + [1], activation)

        def forward(self, obs, act):
            q = self.q(torch.cat([obs, act], dim=-1))
            return torch.squeeze(q, -1) # Critical to ensure q has right shape.


while the bugged version does not:

.. code-block:: python
    :emphasize-lines: 11

    """
    Bugged Q-Function
    """
    class BuggedMLPQFunction(nn.Module):

        def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
            super().__init__()
            self.q = mlp([obs_dim + act_dim] + list(hidden_sizes) + [1], activation)

        def forward(self, obs, act):
            return self.q(torch.cat([obs, act], dim=-1))

How it Gums Up the Works: Tensorflow Version
============================================

Consider the excerpt from the part in the code that builds the DDPG computation graph:

.. code-block:: python

    # Bellman backup for Q function
    backup = tf.stop_gradient(r_ph + gamma*(1-d_ph)*q_pi_targ)

    # DDPG losses
    pi_loss = -tf.reduce_mean(q_pi)
    q_loss = tf.reduce_mean((q-backup)**2)

This is where the tensor shape issue comes into play. It's important to know that ``r_ph`` and ``d_ph`` have shape ``[batch size]``.

The line that produces the Bellman backup was written with the assumption that it would add together tensors with the same shape. However, this line can **also** add together tensors with different shapes, as long as they're broadcast-compatible. 

Tensors with shapes ``[batch size]`` and ``[batch size, 1]`` are broadcast compatible, but the behavior is not actually what you might expect! Check out this example:

>>> import tensorflow as tf
>>> import numpy as np
>>> x = tf.constant(np.arange(5))
>>> y = tf.constant(np.arange(5).reshape(-1,1))
>>> z1 = x * y
>>> z2 = x + y
>>> z3 = x + z1
>>> x.shape
TensorShape([Dimension(5)])
>>> y.shape
TensorShape([Dimension(5), Dimension(1)])
>>> z1.shape
TensorShape([Dimension(5), Dimension(5)])
>>> z2.shape
TensorShape([Dimension(5), Dimension(5)])
>>> sess = tf.InteractiveSession()
>>> sess.run(z1)
array([[ 0,  0,  0,  0,  0],
       [ 0,  1,  2,  3,  4],
       [ 0,  2,  4,  6,  8],
       [ 0,  3,  6,  9, 12],
       [ 0,  4,  8, 12, 16]])
>>> sess.run(z2)
array([[0, 1, 2, 3, 4],
       [1, 2, 3, 4, 5],
       [2, 3, 4, 5, 6],
       [3, 4, 5, 6, 7],
       [4, 5, 6, 7, 8]])
>>> sess.run(z3)
array([[ 0,  1,  2,  3,  4],
       [ 0,  2,  4,  6,  8],
       [ 0,  3,  6,  9, 12],
       [ 0,  4,  8, 12, 16],
       [ 0,  5, 10, 15, 20]])

Adding or multiplying a shape ``[5]`` tensor by a shape ``[5,1]`` tensor returns a shape ``[5,5]`` tensor!

When you don't squeeze the Q-functions, ``q_pi_targ`` has shape ``[batch size, 1]``, and the backup---and in turn, the whole Q-loss---gets totally messed up. 

Broadcast error 1: ``(1 - d_ph) * q_pi_targ`` becomes a ``[batch size, batch size]`` tensor containing the outer product of the mask with the target network Q-values. 

Broadcast error 2: ``r_ph`` then gets treated as a row vector and added to each row of ``(1 - d_ph) * q_pi_targ`` separately.

Broadcast error 3: ``q_loss`` depends on ``q - backup``, which involves another bad broadcast between ``q`` (shape ``[batch size, 1]``) and ``backup`` (shape ``[batch size, batch size]``). 

To put it mathematically: let :math:`q`, :math:`q'`, :math:`r`, :math:`d` denote vectors containing the q-values, target q-values, rewards, and dones for a given batch, where there are :math:`n` entries in the batch. The correct backup is

.. math::

    z_i = r_i + \gamma (1-d_i) q'_i,

and the correct loss function is 

.. math::
    
    \frac{1}{n} \sum_{i=1}^n (q_i - z_i)^2.

But with these errors, what gets computed is a backup *matrix*,

.. math::

    z_{ij} = r_j + \gamma (1-d_j) q'_i,

and a messed up loss function

.. math::

    \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n (q_j - z_{ij})^2.

If you leave this to run in HalfCheetah long enough, you'll actually see some non-trivial learning process, because weird details specific to this environment partly cancel out the errors. But almost everywhere else, it fails completely.


How it Gums Up the Works: PyTorch Version
=========================================

Exactly the same broadcasting shenanigans as in the Tensorflow version. Check out `this note`_ in the PyTorch documentation about it.


.. figure:: ../images/ex2-2_ddpg_bug_pytorch.png
    :align: center

    Learning curves for DDPG in HalfCheetah-v2 for bugged and non-bugged actor-critic implementations using PyTorch, averaged over three random seeds.



.. _`this note`: https://pytorch.org/docs/stable/notes/broadcasting.html#backwards-compatibility

================================================
FILE: docs/spinningup/exercises.rst
================================================
=========
Exercises
=========


.. contents:: Table of Contents
    :depth: 2

Problem Set 1: Basics of Implementation
---------------------------------------

.. admonition:: Exercise 1.1: Gaussian Log-Likelihood

    **Path to Exercise:** 

    * PyTorch version: ``spinup/exercises/pytorch/problem_set_1/exercise1_1.py``
    
    * Tensorflow version: ``spinup/exercises/tf1/problem_set_1/exercise1_1.py``

    **Path to Solution:** 

    * PyTorch version: ``spinup/exercises/pytorch/problem_set_1_solutions/exercise1_1_soln.py``

    * Tensorflow version: ``spinup/exercises/tf1/problem_set_1_solutions/exercise1_1_soln.py``


    **Instructions.** Write a function that takes in the means and log stds of a batch of diagonal Gaussian distributions, along with (previously-generated) samples from those distributions, and returns the log likelihoods of those samples. (In the Tensorflow version, you will write a function that creates computation graph operations to do this; in the PyTorch version, you will directly operate on given Tensors.)

    You may find it useful to review the formula given in `this section of the RL introduction`_.

    Implement your solution in ``exercise1_1.py``, and run that file to automatically check your work.

    **Evaluation Criteria.** Your solution will be checked by comparing outputs against a known-good implementation, using a batch of random inputs.

.. _`this section of the RL introduction`: ../spinningup/rl_intro.html#stochastic-policies


.. admonition:: Exercise 1.2: Policy for PPO

    **Path to Exercise:** 

    * PyTorch version: ``spinup/exercises/pytorch/problem_set_1/exercise1_2.py``
    
    * Tensorflow version: ``spinup/exercises/tf1/problem_set_1/exercise1_2.py``

    **Path to Solution:** 

    * PyTorch version: ``spinup/exercises/pytorch/problem_set_1_solutions/exercise1_2_soln.py``

    * Tensorflow version: ``spinup/exercises/tf1/problem_set_1_solutions/exercise1_2_soln.py``

    **Instructions.** Implement an MLP diagonal Gaussian policy for PPO. 

    Implement your solution in ``exercise1_2.py``, and run that file to automatically check your work. 

    **Evaluation Criteria.** Your solution will be evaluated by running for 20 epochs in the InvertedPendulum-v2 Gym environment, and this should take in the ballpark of 3-5 minutes (depending on your machine, and other processes you are running in the background). The bar for success is reaching an average score of over 500 in the last 5 epochs, or getting to a score of 1000 (the maximum possible score) in the last 5 epochs.


.. admonition:: Exercise 1.3: Computation Graph for TD3

    **Path to Exercise.** 

    * PyTorch version: ``spinup/exercises/pytorch/problem_set_1/exercise1_3.py``
    
    * Tensorflow version: ``spinup/exercises/tf1/problem_set_1/exercise1_3.py``

    **Path to Solution.** 

    * PyTorch version: ``spinup/algos/pytorch/td3/td3.py``

    * Tensorflow version: ``spinup/algos/tf1/td3/td3.py``

    **Instructions.** Implement the main mathematical logic for the TD3 algorithm.

    As starter code, you are given the entirety of the TD3 algorithm except for the main mathematical logic (essentially, the loss functions and intermediate calculations needed for them). Find "YOUR CODE HERE" to begin. 

    You may find it useful to review the pseudocode in our `page on TD3`_.

    Implement your solution in ``exercise1_3.py``, and run that file to see the results of your work. There is no automatic checking for this exercise.

    **Evaluation Criteria.** Evaluate your code by running ``exercise1_3.py`` with HalfCheetah-v2, InvertedPendulum-v2, and one other Gym MuJoCo environment of your choosing (set via the ``--env`` flag). It is set up to use smaller neural networks (hidden sizes [128,128]) than typical for TD3, with a maximum episode length of 150, and to run for only 10 epochs. The goal is to see significant learning progress relatively quickly (in terms of wall clock time). Experiments will likely take on the order of ~10 minutes. 

    Use the ``--use_soln`` flag to run Spinning Up's TD3 instead of your implementation. Anecdotally, within 10 epochs, the score in HalfCheetah should go over 300, and the score in InvertedPendulum should max out at 150.

.. _`page on TD3`: ../algorithms/td3.html


Problem Set 2: Algorithm Failure Modes
--------------------------------------

.. admonition:: Exercise 2.1: Value Function Fitting in TRPO

    **Path to Exercise.** (Not applicable, there is no code for this one.)

    **Path to Solution.** `Solution available here. <../spinningup/exercise2_1_soln.html>`_

    Many factors can impact the performance of policy gradient algorithms, but few more drastically than the quality of the learned value function used for advantage estimation. 

    In this exercise, you will compare results between runs of TRPO where you put lots of effort into fitting the value function (``train_v_iters=80``), versus where you put very little effort into fitting the value function (``train_v_iters=0``). 

    **Instructions.** Run the following command:

    .. parsed-literal::

        python -m spinup.run trpo --env Hopper-v2 --train_v_iters[v] 0 80 --exp_name ex2-1 --epochs 250 --steps_per_epoch 4000 --seed 0 10 20 --dt

    and plot the results. (These experiments might take ~10 minutes each, and this command runs six of them.) What do you find?

.. admonition:: Exercise 2.2: Silent Bug in DDPG

    **Path to Exercise.** 

    * PyTorch version: ``spinup/exercises/pytorch/problem_set_2/exercise2_2.py``
    
    * Tensorflow version: ``spinup/exercises/tf1/problem_set_2/exercise2_2.py``

    **Path to Solution.** `Solution available here. <../spinningup/exercise2_2_soln.html>`_

    The hardest part of writing RL code is dealing with bugs, because failures are frequently silent. The code will appear to run correctly, but the agent's performance will degrade relative to a bug-free implementation---sometimes to the extent that it never learns anything.

    In this exercise, you will observe a bug in vivo and compare results against correct code. The bug is the same (conceptually, if not in exact implementation) for both the PyTorch and Tensorflow versions of this exercise. 

    **Instructions.** Run ``exercise2_2.py``, which will launch DDPG experiments with and without a bug. The non-bugged version runs the default Spinning Up implementation of DDPG, using a default method for creating the actor and critic networks. The bugged version runs the same DDPG code, except uses a bugged method for creating the networks.

    There will be six experiments in all (three random seeds for each case), and each should take in the ballpark of 10 minutes. When they're finished, plot the results. What is the difference in performance with and without the bug? 

    Without referencing the correct actor-critic code (which is to say---don't look in DDPG's ``core.py`` file), try to figure out what the bug is and explain how it breaks things.

    **Hint.** To figure out what's going wrong, think about how the DDPG code implements the DDPG computation graph. For the Tensorflow version, look at this excerpt:

    .. code-block:: python

        # Bellman backup for Q function
        backup = tf.stop_gradient(r_ph + gamma*(1-d_ph)*q_pi_targ)

        # DDPG losses
        pi_loss = -tf.reduce_mean(q_pi)
        q_loss = tf.reduce_mean((q-backup)**2)

    How could a bug in the actor-critic code have an impact here?

    **Bonus.** Are there any choices of hyperparameters which would have hidden the effects of the bug? 


Challenges
----------

.. admonition:: Write Code from Scratch

    As we suggest in `the essay <../spinningup/spinningup.html#learn-by-doing>`_, try reimplementing various deep RL algorithms from scratch. 

.. admonition:: Requests for Research

    If you feel comfortable with writing deep learning and deep RL code, consider trying to make progress on any of OpenAI's standing requests for research:

    * `Requests for Research 1 <https://openai.com/requests-for-research/>`_
    * `Requests for Research 2 <https://blog.openai.com/requests-for-research-2/>`_

================================================
FILE: docs/spinningup/extra_pg_proof1.rst
================================================
==============
Extra Material
==============

Proof for Don't Let the Past Distract You
=========================================

In this subsection, we will prove that actions should not be reinforced for rewards obtained in the past.

Expand out :math:`R(\tau)` in the expression for the `simplest policy gradient`_ to obtain:

.. math::

    \nabla_{\theta} J(\pi_{\theta}) &= \underE{\tau \sim \pi_{\theta}}{\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(\tau)} \\
    &= \underE{\tau \sim \pi_{\theta}}{\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \sum_{t'=0}^T R(s_{t'}, a_{t'}, s_{t'+1})} \\
    &= \sum_{t=0}^{T} \sum_{t'=0}^T  \underE{\tau \sim \pi_{\theta}}{\nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(s_{t'}, a_{t'}, s_{t'+1})},

and consider the term

.. math::

    \underE{\tau \sim \pi_{\theta}}{f(t,t')} = \underE{\tau \sim \pi_{\theta}}{\nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(s_{t'}, a_{t'}, s_{t'+1})}.

We will show that for the case of :math:`t' < t` (the reward comes before the action being reinforced), this term is zero. This is a complete proof of the original claim, because after dropping terms with :math:`t' < t` from the expression, we are left with the reward-to-go form of the policy gradient, as desired:

.. math::

    \nabla_{\theta} J(\pi_{\theta}) = \underE{\tau \sim \pi_{\theta}}{\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \sum_{t'=t}^T R(s_{t'}, a_{t'}, s_{t'+1})}

**1. Using the Marginal Distribution.** To proceed, we have to break down the expectation in :math:`\underE{\tau \sim \pi_{\theta}}{f(t,t')}`. It's an expectation over trajectories, but the expression inside the expectation only deals with a few states and actions: :math:`s_t`, :math:`a_t`, :math:`s_{t'}`, :math:`a_{t'}`, and :math:`s_{t'+1}`. So in computing the expectation, we only need to worry about the `marginal distribution`_ over these random variables. 

We derive:

.. math:: 

    \underE{\tau \sim \pi_{\theta}}{f(t,t')} &= \int_{\tau} P(\tau|\pi_{\theta}) f(t,t') \\
    &= \int_{s_t, a_t, s_{t'}, a_{t'}, s_{t'+1}} P(s_t, a_t, s_{t'}, a_{t'}, s_{t'+1} | \pi_{\theta}) f(t,t') \\
    &= \underE{s_t, a_t, s_{t'}, a_{t'}, s_{t'+1} \sim \pi_{\theta}}{f(t,t')}.

**2. Probability Chain Rule.** Joint distributions can be calculated in terms of conditional and marginal probabilities via `chain rule of probability`_: :math:`P(A,B) = P(B|A) P(A)`. Here, we use this rule to compute

.. math::

    P(s_t, a_t, s_{t'}, a_{t'}, s_{t'+1} | \pi_{\theta}) = P(s_t, a_t | \pi_{\theta}, s_{t'}, a_{t'}, s_{t'+1}) P(s_{t'}, a_{t'}, s_{t'+1} | \pi_{\theta})


**3. Separating Expectations Over Multiple Random Variables.** If we have an expectation over two random variables :math:`A` and :math:`B`, we can split it into an inner and outer expectation, where the inner expectation treats the variable from the outer expectation as a constant. Our ability to make this split relies on probability chain rule. Mathematically:

.. math::

    \underE{A,B}{f(A,B)} &= \int_{A,B} P(A,B) f(A,B) \\
    &= \int_{A} \int_B P(B|A) P(A) f(A,B) \\
    &= \int_A P(A) \int_B P(B|A) f(A,B) \\
    &= \int_A P(A) \underE{B}{f(A,B) \Big| A} \\
    &= \underE{A}{\underE{B}{f(A,B) \Big| A} }

An expectation over :math:`s_t, a_t, s_{t'}, a_{t'}, s_{t'+1}` can thus be expressed by

.. math:: 

    \underE{\tau \sim \pi_{\theta}}{f(t,t')} &= \underE{s_t, a_t, s_{t'}, a_{t'}, s_{t'+1} \sim \pi_{\theta}}{f(t,t')} \\
    &= \underE{s_{t'}, a_{t'}, s_{t'+1} \sim \pi_{\theta}}{\underE{s_t, a_t \sim \pi_{\theta}}{f(t,t') \Big| s_{t'}, a_{t'}, s_{t'+1}}}

**4. Constants Can Be Pulled Outside of Expectations.** If a term inside an expectation is constant with respect to the variable being expected over, it can be pulled outside of the expectation. To give an example, consider again an expectation over two random variables :math:`A` and :math:`B`, where this time, :math:`f(A,B) = h(A) g(B)`. Then, using the result from before:

.. math::

    \underE{A,B}{f(A,B)} &= \underE{A}{\underE{B}{f(A,B) \Big| A}} \\
    &= \underE{A}{\underE{B}{h(A) g(B) \Big| A}}\\
    &= \underE{A}{h(A) \underE{B}{g(B) \Big| A}}.

The function in our expectation decomposes this way, allowing us to write:

.. math:: 

    \underE{\tau \sim \pi_{\theta}}{f(t,t')} &= \underE{s_{t'}, a_{t'}, s_{t'+1} \sim \pi_{\theta}}{\underE{s_t, a_t \sim \pi_{\theta}}{f(t,t') \Big| s_{t'}, a_{t'}, s_{t'+1}}} \\
    &= \underE{s_{t'}, a_{t'}, s_{t'+1} \sim \pi_{\theta}}{\underE{s_t, a_t \sim \pi_{\theta}}{\nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(s_{t'}, a_{t'}, s_{t'+1}) \Big| s_{t'}, a_{t'}, s_{t'+1}}} \\
    &= \underE{s_{t'}, a_{t'}, s_{t'+1} \sim \pi_{\theta}}{R(s_{t'}, a_{t'}, s_{t'+1})  \underE{s_t, a_t \sim \pi_{\theta}}{\nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \Big| s_{t'}, a_{t'}, s_{t'+1}}}.

**5. Applying the EGLP Lemma.** The last step in our proof relies on the `EGLP lemma`_. At this point, we will only worry about the innermost expectation, 

.. math::

    \underE{s_t, a_t \sim \pi_{\theta}}{\nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \Big| s_{t'}, a_{t'}, s_{t'+1}} = \int_{s_t, a_t} P(s_t, a_t | \pi_{\theta}, s_{t'}, a_{t'}, s_{t'+1}) \nabla_{\theta} \log \pi_{\theta}(a_t |s_t).

We now have to make a distinction between two cases: :math:`t' < t`, the case where the reward happened before the action, and :math:`t' \geq t`, where it didn't.

**Case One: Reward Before Action.** If :math:`t' < t`, then the conditional probabilities for actions at :math:`a_t` come from the policy:

.. math::

    P(s_t, a_t | \pi_{\theta}, s_{t'}, a_{t'}, s_{t'+1}) &= \pi_{\theta}(a_t | s_t) P(s_t | \pi_{\theta}, s_{t'}, a_{t'}, s_{t'+1}),

the innermost expectation can be broken down farther into

.. math::

    \underE{s_t, a_t \sim \pi_{\theta}}{\nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \Big| s_{t'}, a_{t'}, s_{t'+1}} &= \int_{s_t, a_t} P(s_t, a_t | \pi_{\theta}, s_{t'}, a_{t'}, s_{t'+1}) \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \\
    &= \int_{s_t} P(s_t | \pi_{\theta}, s_{t'}, a_{t'}, s_{t'+1}) \int_{a_t} \pi_{\theta}(a_t | s_t) \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \\
    &= \underE{s_t \sim \pi_{\theta}}{ \underE{a_t \sim \pi_{\theta}}{\nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \Big| s_t } \Big| s_{t'}, a_{t'}, s_{t'+1}}.

The EGLP lemma says that 

.. math::

    \underE{a_t \sim \pi_{\theta}}{\nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \Big| s_t } = 0,

allowing us to conclude that for :math:`t' < t`, :math:`\underE{\tau \sim \pi_{\theta}}{f(t,t')} = 0`. 

**Case Two: Reward After Action.** What about the :math:`t' \geq t` case, though? Why doesn't the same logic apply? In this case, the conditional probabilities for :math:`a_t` can't be broken down the same way, because you're conditioning **on the future.** Think about it like this: let's say that every day, in the morning, you make a choice between going for a jog and going to work early, and you have a 50-50 chance of each option. If you condition on a future where you went to work early, what are the odds that you went for a jog? Clearly, you didn't. But if you're conditioning on the past---before you made the decision---what are the odds that you will later go for a jog? Now it's back to 50-50. 

So in the case where :math:`t' \geq t`, the conditional distribution over actions :math:`a_t` is **not** :math:`\pi(a_t|s_t)`, and the EGLP lemma does not apply. 

.. _`simplest policy gradient`: ../spinningup/rl_intro3.html#deriving-the-simplest-policy-gradient
.. _`marginal distribution`: https://en.wikipedia.org/wiki/Marginal_distribution
.. _`chain rule of probability`: https://en.wikipedia.org/wiki/Chain_rule_(probability)
.. _`EGLP lemma`: ../spinningup/rl_intro3.html#expected-grad-log-prob-lemma

================================================
FILE: docs/spinningup/extra_pg_proof2.rst
================================================
==============
Extra Material
==============

Proof for Using Q-Function in Policy Gradient Formula
=====================================================

In this section, we will show that

.. math::

    \nabla_{\theta} J(\pi_{\theta}) &= \underE{\tau \sim \pi_{\theta}}{\sum_{t=0}^{T} \Big( \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \Big) Q^{\pi_{\theta}}(s_t, a_t)},

for the finite-horizon undiscounted return setting. (An analagous result holds in the infinite-horizon discounted case using basically the same proof.)


The proof of this claim depends on the `law of iterated expectations`_. First, let's rewrite the expression for the policy gradient, starting from the reward-to-go form (using the notation :math:`\hat{R}_t = \sum_{t'=t}^T R(s_t', a_t', s_{t'+1})` to help shorten things):

.. math::

    \nabla_{\theta} J(\pi_{\theta}) &= \underE{\tau \sim \pi_{\theta}}{\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \hat{R}_t} \\
    &= \sum_{t=0}^{T} \underE{\tau \sim \pi_{\theta}}{\nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \hat{R}_t}

Define :math:`\tau_{:t} = (s_0, a_0, ..., s_t, a_t)` as the trajectory up to time :math:`t`, and :math:`\tau_{t:}` as the remainder of the trajectory after that. By the law of iterated expectations, we can break up the preceding expression into:

.. math::

    \nabla_{\theta} J(\pi_{\theta}) &= \sum_{t=0}^{T} \underE{\tau_{:t} \sim \pi_{\theta}}{ \underE{\tau_{t:} \sim \pi_{\theta}}{ \left. \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \hat{R}_t \right| \tau_{:t}}}

The grad-log-prob is constant with respect to the inner expectation (because it depends on :math:`s_t` and :math:`a_t`, which the inner expectation conditions on as fixed in :math:`\tau_{:t}`), so it can be pulled out, leaving:

.. math::

    \nabla_{\theta} J(\pi_{\theta}) &= \sum_{t=0}^{T} \underE{\tau_{:t} \sim \pi_{\theta}}{ \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \underE{\tau_{t:} \sim \pi_{\theta}}{ \left. \hat{R}_t \right| \tau_{:t}}}

In Markov Decision Processes, the future only depends on the most recent state and action. As a result, the inner expectation---which expects over the future, conditioned on the entirety of the past (everything up to time :math:`t`)---is equal to the same expectation if it only conditioned on the last timestep (just :math:`(s_t,a_t)`):

.. math::

    \underE{\tau_{t:} \sim \pi_{\theta}}{ \left. \hat{R}_t \right| \tau_{:t}} = \underE{\tau_{t:} \sim \pi_{\theta}}{ \left. \hat{R}_t \right| s_t, a_t},

which is the *definition* of :math:`Q^{\pi_{\theta}}(s_t, a_t)`: the expected return, starting from state :math:`s_t` and action :math:`a_t`, when acting on-policy for the rest of the trajectory. 

The result follows immediately.

.. _`law of iterated expectations`: https://en.wikipedia.org/wiki/Law_of_total_expectation


================================================
FILE: docs/spinningup/extra_tf_pg_implementation.rst
================================================
==================================================================
Extra Material: Tensorflow Policy Gradient Implementation Examples
==================================================================


Implementing the Simplest Policy Gradient
=========================================

We give a short Tensorflow implementation of this simple version of the policy gradient algorithm in ``spinup/examples/tf1/pg_math/1_simple_pg.py``. (It can also be viewed `on github <https://github.com/openai/spinningup/blob/master/spinup/examples/tf1/pg_math/1_simple_pg.py>`_.) It is only 122 lines long, so we highly recommend reading through it in depth. While we won't go through the entirety of the code here, we'll highlight and explain a few important pieces.

**1. Making the Policy Network.** 

.. code-block:: python
    :linenos:
    :lineno-start: 25

    # make core of policy network
    obs_ph = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32)
    logits = mlp(obs_ph, sizes=hidden_sizes+[n_acts])

    # make action selection op (outputs int actions, sampled from policy)
    actions = tf.squeeze(tf.multinomial(logits=logits,num_samples=1), axis=1)

This block builds a feedforward neural network categorical policy. (See the `Stochastic Policies`_ section in Part 1 for a refresher.) The ``logits`` tensor can be used to construct log-probabilities and probabilities for actions, and the ``actions`` tensor samples actions based on the probabilities implied by ``logits``.

.. _`Stochastic Policies`: ../spinningup/rl_intro.html#stochastic-policies

**2. Making the Loss Function.**

.. code-block:: python
    :linenos:
    :lineno-start: 32

    # make loss function whose gradient, for the right data, is policy gradient
    weights_ph = tf.placeholder(shape=(None,), dtype=tf.float32)
    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32)
    action_masks = tf.one_hot(act_ph, n_acts)
    log_probs = tf.reduce_sum(action_masks * tf.nn.log_softmax(logits), axis=1)
    loss = -tf.reduce_mean(weights_ph * log_probs)


In this block, we build a "loss" function for the policy gradient algorithm. When the right data is plugged in, the gradient of this loss is equal to the policy gradient. The right data means a set of (state, action, weight) tuples collected while acting according to the current policy, where the weight for a state-action pair is the return from the episode to which it belongs. (Although as we will show in later subsections, there are other values you can plug in for the weight which also work correctly.)


.. admonition:: You Should Know
    
    Even though we describe this as a loss function, it is **not** a loss function in the typical sense from supervised learning. There are two main differences from standard loss functions.

    **1. The data distribution depends on the parameters.** A loss function is usually defined on a fixed data distribution which is independent of the parameters we aim to optimize. Not so here, where the data must be sampled on the most recent policy. 

    **2. It doesn't measure performance.** A loss function usually evaluates the performance metric that we care about. Here, we care about expected return, :math:`J(\pi_{\theta})`, but our "loss" function does not approximate this at all, even in expectation. This "loss" function is only useful to us because, when evaluated at the current parameters, with data generated by the current parameters, it has the negative gradient of performance. 

    But after that first step of gradient descent, there is no more connection to performance. This means that minimizing this "loss" function, for a given batch of data, has *no* guarantee whatsoever of improving expected return. You can send this loss to :math:`-\infty` and policy performance could crater; in fact, it usually will. Sometimes a deep RL researcher might describe this outcome as the policy "overfitting" to a batch of data. This is descriptive, but should not be taken literally because it does not refer to generalization error.

    We raise this point because it is common for ML practitioners to interpret a loss function as a useful signal during training---"if the loss goes down, all is well." In policy gradients, this intuition is wrong, and you should only care about average return. The loss function means nothing.




.. admonition:: You Should Know
    
    The approach used here to make the ``log_probs`` tensor---creating an action mask, and using it to select out particular log probabilities---*only* works for categorical policies. It does not work in general. 



**3. Running One Epoch of Training.**

.. code-block:: python
    :linenos:
    :lineno-start: 45

        # for training policy
        def train_one_epoch():
            # make some empty lists for logging.
            batch_obs = []          # for observations
            batch_acts = []         # for actions
            batch_weights = []      # for R(tau) weighting in policy gradient
            batch_rets = []         # for measuring episode returns
            batch_lens = []         # for measuring episode lengths

            # reset episode-specific variables
            obs = env.reset()       # first obs comes from starting distribution
            done = False            # signal from environment that episode is over
            ep_rews = []            # list for rewards accrued throughout ep

            # render first episode of each epoch
            finished_rendering_this_epoch = False

            # collect experience by acting in the environment with current policy
            while True:

                # rendering
                if not(finished_rendering_this_epoch):
                    env.render()

                # save obs
                batch_obs.append(obs.copy())

                # act in the environment
                act = sess.run(actions, {obs_ph: obs.reshape(1,-1)})[0]
                obs, rew, done, _ = env.step(act)

                # save action, reward
                batch_acts.append(act)
                ep_rews.append(rew)

                if done:
                    # if episode is over, record info about episode
                    ep_ret, ep_len = sum(ep_rews), len(ep_rews)
                    batch_rets.append(ep_ret)
                    batch_lens.append(ep_len)

                    # the weight for each logprob(a|s) is R(tau)
                    batch_weights += [ep_ret] * ep_len

                    # reset episode-specific variables
                    obs, done, ep_rews = env.reset(), False, []

                    # won't render again this epoch
                    finished_rendering_this_epoch = True

                    # end experience loop if we have enough of it
                    if len(batch_obs) > batch_size:
                        break

            # take a single policy gradient update step
            batch_loss, _ = sess.run([loss, train_op],
                                     feed_dict={
                                        obs_ph: np.array(batch_obs),
                                        act_ph: np.array(batch_acts),
                                        weights_ph: np.array(batch_weights)
                                     })
            return batch_loss, batch_rets, batch_lens

The ``train_one_epoch()`` function runs one "epoch" of policy gradient, which we define to be 

1) the experience collection step (L62-97), where the agent acts for some number of episodes in the environment using the most recent policy, followed by 

2) a single policy gradient update step (L99-105). 

The main loop of the algorithm just repeatedly calls ``train_one_epoch()``. 




Implementing Reward-to-Go Policy Gradient
=========================================

We give a short Tensorflow implementation of the reward-to-go policy gradient in ``spinup/examples/tf1/pg_math/2_rtg_pg.py``. (It can also be viewed `on github <https://github.com/openai/spinningup/blob/master/spinup/examples/tf1/pg_math/2_rtg_pg.py>`_.) 

The only thing that has changed from ``1_simple_pg.py`` is that we now use different weights in the loss function. The code modification is very slight: we add a new function, and change two other lines. The new function is:

.. code-block:: python
    :linenos:
    :lineno-start: 12

    def reward_to_go(rews):
        n = len(rews)
        rtgs = np.zeros_like(rews)
        for i in reversed(range(n)):
            rtgs[i] = rews[i] + (rtgs[i+1] if i+1 < n else 0)
        return rtgs


And then we tweak the old L86-87 from:

.. code-block:: python
    :linenos:
    :lineno-start: 86

                    # the weight for each logprob(a|s) is R(tau)
                    batch_weights += [ep_ret] * ep_len

to:

.. code-block:: python
    :linenos:
    :lineno-start: 93

                    # the weight for each logprob(a_t|s_t) is reward-to-go from t
                    batch_weights += list(reward_to_go(ep_rews))


================================================
FILE: docs/spinningup/keypapers.rst
================================================
=====================
Key Papers in Deep RL
=====================

What follows is a list of papers in deep RL that are worth reading. This is *far* from comprehensive, but should provide a useful starting point for someone looking to do research in the field.

.. contents:: Table of Contents
    :depth: 2


1. Model-Free RL
================

a. Deep Q-Learning
------------------


.. [#] `Playing Atari with Deep Reinforcement Learning <https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf>`_, Mnih et al, 2013. **Algorithm: DQN.**

.. [#] `Deep Recurrent Q-Learning for Partially Observable MDPs <https://arxiv.org/abs/1507.06527>`_, Hausknecht and Stone, 2015. **Algorithm: Deep Recurrent Q-Learning.**

.. [#] `Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>`_, Wang et al, 2015. **Algorithm: Dueling DQN.**

.. [#] `Deep Reinforcement Learning with Double Q-learning <https://arxiv.org/abs/1509.06461>`_, Hasselt et al 2015. **Algorithm: Double DQN.**

.. [#] `Prioritized Experience Replay <https://arxiv.org/abs/1511.05952>`_, Schaul et al, 2015. **Algorithm: Prioritized Experience Replay (PER).**

.. [#] `Rainbow: Combining Improvements in Deep Reinforcement Learning <https://arxiv.org/abs/1710.02298>`_, Hessel et al, 2017. **Algorithm: Rainbow DQN.**


b. Policy Gradients
-------------------


.. [#] `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_, Mnih et al, 2016. **Algorithm: A3C.**

.. [#] `Trust Region Policy Optimization <https://arxiv.org/abs/1502.05477>`_, Schulman et al, 2015. **Algorithm: TRPO.**

.. [#] `High-Dimensional Continuous Control Using Generalized Advantage Estimation <https://arxiv.org/abs/1506.02438>`_, Schulman et al, 2015. **Algorithm: GAE.**

.. [#] `Proximal Policy Optimization Algorithms <https://arxiv.org/abs/1707.06347>`_, Schulman et al, 2017. **Algorithm: PPO-Clip, PPO-Penalty.**

.. [#] `Emergence of Locomotion Behaviours in Rich Environments <https://arxiv.org/abs/1707.02286>`_, Heess et al, 2017. **Algorithm: PPO-Penalty.**

.. [#] `Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation <https://arxiv.org/abs/1708.05144>`_, Wu et al, 2017. **Algorithm: ACKTR.**

.. [#] `Sample Efficient Actor-Critic with Experience Replay <https://arxiv.org/abs/1611.01224>`_, Wang et al, 2016. **Algorithm: ACER.**

.. [#] `Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor <https://arxiv.org/abs/1801.01290>`_, Haarnoja et al, 2018. **Algorithm: SAC.**

c. Deterministic Policy Gradients
---------------------------------


.. [#] `Deterministic Policy Gradient Algorithms <http://proceedings.mlr.press/v32/silver14.pdf>`_, Silver et al, 2014. **Algorithm: DPG.**

.. [#] `Continuous Control With Deep Reinforcement Learning <https://arxiv.org/abs/1509.02971>`_, Lillicrap et al, 2015. **Algorithm: DDPG.**

.. [#] `Addressing Function Approximation Error in Actor-Critic Methods <https://arxiv.org/abs/1802.09477>`_, Fujimoto et al, 2018. **Algorithm: TD3.**


d. Distributional RL
--------------------

.. [#] `A Distributional Perspective on Reinforcement Learning <https://arxiv.org/abs/1707.06887>`_, Bellemare et al, 2017. **Algorithm: C51.** 

.. [#] `Distributional Reinforcement Learning with Quantile Regression <https://arxiv.org/abs/1710.10044>`_, Dabney et al, 2017. **Algorithm: QR-DQN.**

.. [#] `Implicit Quantile Networks for Distributional Reinforcement Learning <https://arxiv.org/abs/1806.06923>`_, Dabney et al, 2018. **Algorithm: IQN.**

.. [#] `Dopamine: A Research Framework for Deep Reinforcement Learning <https://openreview.net/forum?id=ByG_3s09KX>`_, Anonymous, 2018. **Contribution:** Introduces Dopamine, a code repository containing implementations of DQN, C51, IQN, and Rainbow. `Code link. <https://github.com/google/dopamine>`_

e. Policy Gradients with Action-Dependent Baselines
---------------------------------------------------

.. [#] `Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic <https://arxiv.org/abs/1611.02247>`_, Gu et al, 2016. **Algorithm: Q-Prop.**

.. [#] `Action-depedent Control Variates for Policy Optimization via Stein's Identity <https://arxiv.org/abs/1710.11198>`_, Liu et al, 2017. **Algorithm: Stein Control Variates.**

.. [#] `The Mirage of Action-Dependent Baselines in Reinforcement Learning <https://arxiv.org/abs/1802.10031>`_, Tucker et al, 2018. **Contribution:** interestingly, critiques and reevaluates claims from earlier papers (including Q-Prop and stein control variates) and finds important methodological errors in them.


f. Path-Consistency Learning
----------------------------

.. [#] `Bridging the Gap Between Value and Policy Based Reinforcement Learning <https://arxiv.org/abs/1702.08892>`_, Nachum et al, 2017. **Algorithm: PCL.**

.. [#] `Trust-PCL: An Off-Policy Trust Region Method for Continuous Control <https://arxiv.org/abs/1707.01891>`_, Nachum et al, 2017. **Algorithm: Trust-PCL.**

g. Other Directions for Combining Policy-Learning and Q-Learning
----------------------------------------------------------------

.. [#] `Combining Policy Gradient and Q-learning <https://arxiv.org/abs/1611.01626>`_, O'Donoghue et al, 2016. **Algorithm: PGQL.**

.. [#] `The Reactor: A Fast and Sample-Efficient Actor-Critic Agent for Reinforcement Learning <https://arxiv.org/abs/1704.04651>`_, Gruslys et al, 2017. **Algorithm: Reactor.**

.. [#] `Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning <http://papers.nips.cc/paper/6974-interpolated-policy-gradient-merging-on-policy-and-off-policy-gradient-estimation-for-deep-reinforcement-learning>`_, Gu et al, 2017. **Algorithm: IPG.**

.. [#] `Equivalence Between Policy Gradients and Soft Q-Learning <https://arxiv.org/abs/1704.06440>`_, Schulman et al, 2017. **Contribution:** Reveals a theoretical link between these two families of RL algorithms.


h. Evolutionary Algorithms
--------------------------

.. [#] `Evolution Strategies as a Scalable Alternative to Reinforcement Learning <https://arxiv.org/abs/1703.03864>`_, Salimans et al, 2017. **Algorithm: ES.**



2. Exploration
==============

a. Intrinsic Motivation
-----------------------

.. [#] `VIME: Variational Information Maximizing Exploration <https://arxiv.org/abs/1605.09674>`_, Houthooft et al, 2016. **Algorithm: VIME.**

.. [#] `Unifying Count-Based Exploration and Intrinsic Motivation <https://arxiv.org/abs/1606.01868>`_, Bellemare et al, 2016. **Algorithm: CTS-based Pseudocounts.**

.. [#] `Count-Based Exploration with Neural Density Models <https://arxiv.org/abs/1703.01310>`_, Ostrovski et al, 2017. **Algorithm: PixelCNN-based Pseudocounts.**

.. [#] `#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning <https://arxiv.org/abs/1611.04717>`_, Tang et al, 2016. **Algorithm: Hash-based Counts.**

.. [#] `EX2: Exploration with Exemplar Models for Deep Reinforcement Learning <https://arxiv.org/abs/1703.01260>`_, Fu et al, 2017. **Algorithm: EX2.**

.. [#] `Curiosity-driven Exploration by Self-supervised Prediction <https://arxiv.org/abs/1705.05363>`_, Pathak et al, 2017. **Algorithm: Intrinsic Curiosity Module (ICM).**

.. [#] `Large-Scale Study of Curiosity-Driven Learning <https://arxiv.org/abs/1808.04355>`_, Burda et al, 2018. **Contribution:** Systematic analysis of how surprisal-based intrinsic motivation performs in a wide variety of environments.

.. [#] `Exploration by Random Network Distillation <https://arxiv.org/abs/1810.12894>`_, Burda et al, 2018. **Algorithm: RND.**


b. Unsupervised RL
------------------

.. [#] `Variational Intrinsic Control <https://arxiv.org/abs/1611.07507>`_, Gregor et al, 2016. **Algorithm: VIC.**

.. [#] `Diversity is All You Need: Learning Skills without a Reward Function <https://arxiv.org/abs/1802.06070>`_, Eysenbach et al, 2018. **Algorithm: DIAYN.**

.. [#] `Variational Option Discovery Algorithms <https://arxiv.org/abs/1807.10299>`_, Achiam et al, 2018. **Algorithm: VALOR.**


3. Transfer and Multitask RL
============================

.. [#] `Progressive Neural Networks <https://arxiv.org/abs/1606.04671>`_, Rusu et al, 2016. **Algorithm: Progressive Networks.**

.. [#] `Universal Value Function Approximators <http://proceedings.mlr.press/v37/schaul15.pdf>`_, Schaul et al, 2015. **Algorithm: UVFA.**

.. [#] `Reinforcement Learning with Unsupervised Auxiliary Tasks <https://arxiv.org/abs/1611.05397>`_, Jaderberg et al, 2016. **Algorithm: UNREAL.**

.. [#] `The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously <https://arxiv.org/abs/1707.03300>`_, Cabi et al, 2017. **Algorithm: IU Agent.**

.. [#] `PathNet: Evolution Channels Gradient Descent in Super Neural Networks <https://arxiv.org/abs/1701.08734>`_, Fernando et al, 2017. **Algorithm: PathNet.**

.. [#] `Mutual Alignment Transfer Learning <https://arxiv.org/abs/1707.07907>`_, Wulfmeier et al, 2017. **Algorithm: MATL.**

.. [#] `Learning an Embedding Space for Transferable Robot Skills <https://openreview.net/forum?id=rk07ZXZRb&noteId=rk07ZXZRb>`_, Hausman et al, 2018. 

.. [#] `Hindsight Experience Replay <https://arxiv.org/abs/1707.01495>`_, Andrychowicz et al, 2017. **Algorithm: Hindsight Experience Replay (HER).**

4. Hierarchy
============

.. [#] `Strategic Attentive Writer for Learning Macro-Actions <https://arxiv.org/abs/1606.04695>`_, Vezhnevets et al, 2016. **Algorithm: STRAW.**

.. [#] `FeUdal Networks for Hierarchical Reinforcement Learning <https://arxiv.org/abs/1703.01161>`_, Vezhnevets et al, 2017. **Algorithm: Feudal Networks**

.. [#] `Data-Efficient Hierarchical Reinforcement Learning <https://arxiv.org/abs/1805.08296>`_, Nachum et al, 2018. **Algorithm: HIRO.**

5. Memory
=========

.. [#] `Model-Free Episodic Control <https://arxiv.org/abs/1606.04460>`_, Blundell et al, 2016. **Algorithm: MFEC.**


.. [#] `Neural Episodic Control <https://arxiv.org/abs/1703.01988>`_, Pritzel et al, 2017. **Algorithm: NEC.**

.. [#] `Neural Map: Structured Memory for Deep Reinforcement Learning <https://arxiv.org/abs/1702.08360>`_, Parisotto and Salakhutdinov, 2017. **Algorithm: Neural Map.**

.. [#] `Unsupervised Predictive Memory in a Goal-Directed Agent <https://arxiv.org/abs/1803.10760>`_, Wayne et al, 2018. **Algorithm: MERLIN.**

.. [#] `Relational Recurrent Neural Networks <https://arxiv.org/abs/1806.01822>`_, Santoro et al, 2018. **Algorithm: RMC.**

6. Model-Based RL
=================

a. Model is Learned
-------------------

.. [#] `Imagination-Augmented Agents for Deep Reinforcement Learning <https://arxiv.org/abs/1707.06203>`_, Weber et al, 2017. **Algorithm: I2A.**

.. [#] `Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning <https://arxiv.org/abs/1708.02596>`_, Nagabandi et al, 2017. **Algorithm: MBMF.**

.. [#] `Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning <https://arxiv.org/abs/1803.00101>`_, Feinberg et al, 2018. **Algorithm: MVE.**

.. [#] `Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion <https://arxiv.org/abs/1807.01675>`_, Buckman et al, 2018. **Algorithm: STEVE.**

.. [#] `Model-Ensemble Trust-Region Policy Optimization <https://openreview.net/forum?id=SJJinbWRZ&noteId=SJJinbWRZ>`_, Kurutach et al, 2018. **Algorithm: ME-TRPO.**

.. [#] `Model-Based Reinforcement Learning via Meta-Policy Optimization <https://arxiv.org/abs/1809.05214>`_, Clavera et al, 2018. **Algorithm: MB-MPO.**

.. [#] `Recurrent World Models Facilitate Policy Evolution <https://arxiv.org/abs/1809.01999>`_, Ha and Schmidhuber, 2018. 

b. Model is Given
-----------------

.. [#] `Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm <https://arxiv.org/abs/1712.01815>`_, Silver et al, 2017. **Algorithm: AlphaZero.**

.. [#] `Thinking Fast and Slow with Deep Learning and Tree Search <https://arxiv.org/abs/1705.08439>`_, Anthony et al, 2017. **Algorithm: ExIt.**

7. Meta-RL
==========

.. [#] `RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning <https://arxiv.org/abs/1611.02779>`_, Duan et al, 2016. **Algorithm: RL^2.**

.. [#] `Learning to Reinforcement Learn <https://arxiv.org/abs/1611.05763>`_, Wang et al, 2016. 

.. [#] `Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks <https://arxiv.org/abs/1703.03400>`_, Finn et al, 2017. **Algorithm: MAML.**

.. [#] `A Simple Neural Attentive Meta-Learner <https://openreview.net/forum?id=B1DmUzWAW&noteId=B1DmUzWAW>`_, Mishra et al, 2018. **Algorithm: SNAIL.**

8. Scaling RL
=============

.. [#] `Accelerated Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1803.02811>`_, Stooke and Abbeel, 2018. **Contribution:** Systematic analysis of parallelization in deep RL across algorithms. 

.. [#] `IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures <https://arxiv.org/abs/1802.01561>`_, Espeholt et al, 2018. **Algorithm: IMPALA.**

.. [#] `Distributed Prioritized Experience Replay <https://openreview.net/forum?id=H1Dy---0Z>`_, Horgan et al, 2018. **Algorithm: Ape-X.**

.. [#] `Recurrent Experience Replay in Distributed Reinforcement Learning <https://openreview.net/forum?id=r1lyTjAqYX>`_, Anonymous, 2018. **Algorithm: R2D2.**

.. [#] `RLlib: Abstractions for Distributed Reinforcement Learning <https://arxiv.org/abs/1712.09381>`_, Liang et al, 2017. **Contribution:** A scalable library of RL algorithm implementations. `Documentation link. <https://ray.readthedocs.io/en/latest/rllib.html>`_


9. RL in the Real World
=======================

.. [#] `Benchmarking Reinforcement Learning Algorithms on Real-World Robots <https://arxiv.org/abs/1809.07731>`_, Mahmood et al, 2018. 

.. [#] `Learning Dexterous In-Hand Manipulation <https://arxiv.org/abs/1808.00177>`_, OpenAI, 2018. 

.. [#] `QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation <https://arxiv.org/abs/1806.10293>`_, Kalashnikov et al, 2018. **Algorithm: QT-Opt.**

.. [#] `Horizon: Facebook's Open Source Applied Reinforcement Learning Platform <https://arxiv.org/abs/1811.00260>`_, Gauci et al, 2018. 


10. Safety
==========

.. [#] `Concrete Problems in AI Safety <https://arxiv.org/abs/1606.06565>`_, Amodei et al, 2016. **Contribution:** establishes a taxonomy of safety problems, serving as an important jumping-off point for future research. We need to solve these!

.. [#] `Deep Reinforcement Learning From Human Preferences <https://arxiv.org/abs/1706.03741>`_, Christiano et al, 2017. **Algorithm: LFP.**

.. [#] `Constrained Policy Optimization <https://arxiv.org/abs/1705.10528>`_, Achiam et al, 2017. **Algorithm: CPO.**

.. [#] `Safe Exploration in Continuous Action Spaces <https://arxiv.org/abs/1801.08757>`_, Dalal et al, 2018. **Algorithm: DDPG+Safety Layer.**

.. [#] `Trial without Error: Towards Safe Reinforcement Learning via Human Intervention <https://arxiv.org/abs/1707.05173>`_, Saunders et al, 2017. **Algorithm: HIRL.**

.. [#] `Leave No Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning <https://arxiv.org/abs/1711.06782>`_, Eysenbach et al, 2017. **Algorithm: Leave No Trace.**


11. Imitation Learning and Inverse Reinforcement Learning
=========================================================

.. [#] `Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy <http://www.cs.cmu.edu/~bziebart/publications/thesis-bziebart.pdf>`_, Ziebart 2010. **Contributions:** Crisp formulation of maximum entropy IRL.

.. [#] `Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization <https://arxiv.org/abs/1603.00448>`_, Finn et al, 2016. **Algorithm: GCL.**

.. [#] `Generative Adversarial Imitation Learning <https://arxiv.org/abs/1606.03476>`_, Ho and Ermon, 2016. **Algorithm: GAIL.**

.. [#] `DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills <https://xbpeng.github.io/projects/DeepMimic/2018_TOG_DeepMimic.pdf>`_, Peng et al, 2018. **Algorithm: DeepMimic.**

.. [#] `Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow <https://arxiv.org/abs/1810.00821>`_, Peng et al, 2018. **Algorithm: VAIL.**

.. [#] `One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL <https://arxiv.org/abs/1810.05017>`_, Le Paine et al, 2018. **Algorithm: MetaMimic.**


12. Reproducibility, Analysis, and Critique
===========================================

.. [#] `Benchmarking Deep Reinforcement Learning for Continuous Control <https://arxiv.org/abs/1604.06778>`_, Duan et al, 2016. **Contribution: rllab.**

.. [#] `Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control <https://arxiv.org/abs/1708.04133>`_, Islam et al, 2017.

.. [#] `Deep Reinforcement Learning that Matters <https://arxiv.org/abs/1709.06560>`_, Henderson et al, 2017. 

.. [#] `Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods <https://arxiv.org/abs/1810.02525>`_, Henderson et al, 2018. 

.. [#] `Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? <https://arxiv.org/abs/1811.02553>`_, Ilyas et al, 2018.

.. [#] `Simple Random Search Provides a Competitive Approach to Reinforcement Learning <https://arxiv.org/abs/1803.07055>`_, Mania et al, 2018.

.. [#] `Benchmarking Model-Based Reinforcement Learning <https://arxiv.org/abs/1907.02057>`_, Wang et al, 2019.

13. Bonus: Classic Papers in RL Theory or Review
================================================

.. [#] `Policy Gradient Methods for Reinforcement Learning with Function Approximation <https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf>`_, Sutton et al, 2000. **Contributions:** Established policy gradient theorem and showed convergence of policy gradient algorithm for arbitrary policy classes. 

.. [#] `An Analysis of Temporal-Difference Learning with Function Approximation <http://web.mit.edu/jnt/www/Papers/J063-97-bvr-td.pdf>`_, Tsitsiklis and Van Roy, 1997. **Contributions:** Variety of convergence results and counter-examples for value-learning methods in RL.

.. [#] `Reinforcement Learning of Motor Skills with Policy Gradients <http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/attachments/Neural-Netw-2008-21-682_4867%5b0%5d.pdf>`_, Peters and Schaal, 2008. **Contributions:** Thorough review of policy gradient methods at the time, many of which are still serviceable descriptions of deep RL methods. 

.. [#] `Approximately Optimal Approximate Reinforcement Learning <https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/KakadeLangford-icml2002.pdf>`_, Kakade and Langford, 2002. **Contributions:** Early roots for monotonic improvement theory, later leading to theoretical justification for TRPO and other algorithms.

.. [#] `A Natural Policy Gradient <https://papers.nips.cc/paper/2073-a-natural-policy-gradient.pdf>`_, Kakade, 2002. **Contributions:** Brought natural gradients into RL, later leading to TRPO, ACKTR, and several other methods in deep RL.

.. [#] `Algorithms for Reinforcement Learning <https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf>`_, Szepesvari, 2009. **Contributions:** Unbeatable reference on RL before deep RL, containing foundations and theoretical background.


================================================
FILE: docs/spinningup/rl_intro.rst
================================================
==========================
Part 1: Key Concepts in RL
==========================


.. contents:: Table of Contents
    :depth: 2

Welcome to our introduction to reinforcement learning! Here, we aim to acquaint you with

* the language and notation used to discuss the subject,
* a high-level explanation of what RL algorithms do (although we mostly avoid the question of *how* they do it),
* and a little bit of the core math that underlies the algorithms.

In a nutshell, RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future. 


What Can RL Do?
===============

RL methods have recently enjoyed a wide variety of successes. For example, it's been used to teach computers to control robots in simulation...

.. raw:: html

    <video autoplay="" src="https://d4mucfpksywv.cloudfront.net/openai-baselines-ppo/knocked-over-stand-up.mp4" loop="" controls="" style="display: block; margin-left: auto; margin-right: auto; margin-bottom:1.5em; width: 100%; max-width: 720px; max-height: 80vh;">
    </video>

...and in the real world...

.. raw:: html

    <div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden; max-width: 100%; height: auto;">
        <iframe src="https://www.youtube.com/embed/jwSbzNHGflM?ecver=1" frameborder="0" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe>
    </div>
    <br />


It's also famously been used to create breakthrough AIs for sophisticated strategy games, most notably `Go`_ and `Dota`_, taught computers to `play Atari games`_ from raw pixels, and trained simulated robots `to follow human instructions`_.

.. _`Go`: https://deepmind.com/research/alphago/
.. _`Dota`: https://blog.openai.com/openai-five/
.. _`play Atari games`: https://deepmind.com/research/dqn/
.. _`to follow human instructions`: https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/


Key Concepts and Terminology
============================

.. figure:: ../images/rl_diagram_transparent_bg.png
    :align: center
    
    Agent-environment interaction loop.

The main characters of RL are the **agent** and the **environment**. The environment is the world that the agent lives in and interacts with. At every step of interaction, the agent sees a (possibly partial) observation of the state of the world, and then decides on an action to take. The environment changes when the agent acts on it, but may also change on its own.

The agent also perceives a **reward** signal from the environment, a number that tells it how good or bad the current world state is. The goal of the agent is to maximize its cumulative reward, called **return**. Reinforcement learning methods are ways that the agent can learn behaviors to achieve its goal.

To talk more specifically what RL does, we need to introduce additional terminology. We need to talk about

* states and observations,
* action spaces,
* policies,
* trajectories,
* different formulations of return,
* the RL optimization problem,
* and value functions.


States and Observations
-----------------------

A **state** :math:`s` is a complete description of the state of the world. There is no information about the world which is hidden from the state. An **observation** :math:`o` is a partial description of a state, which may omit information. 

In deep RL, we almost always represent states and observations by a `real-valued vector, matrix, or higher-order tensor`_. For instance, a visual observation could be represented by the RGB matrix of its pixel values; the state of a robot might be represented by its joint angles and velocities. 

When the agent is able to observe the complete state of the environment, we say that the environment is **fully observed**. When the agent can only see a partial observation, we say that the environment is **partially observed**.

.. admonition:: You Should Know

    Reinforcement learning notation sometimes puts the symbol for state, :math:`s`, in places where it would be technically more appropriate to write the symbol for observation, :math:`o`. Specifically, this happens when talking about how the agent decides an action: we often signal in notation that the action is conditioned on the state, when in practice, the action is conditioned on the observation because the agent does not have access to the state.

    In our guide, we'll follow standard conventions for notation, but it should be clear from context which is meant. If something is unclear, though, please raise an issue! Our goal is to teach, not to confuse.

.. _`real-valued vector, matrix, or higher-order tensor`: https://en.wikipedia.org/wiki/Real_coordinate_space


Action Spaces
-------------

Different environments allow different kinds of actions. The set of all valid actions in a given environment is often called the **action space**. Some environments, like Atari and Go, have **discrete action spaces**, where only a finite number of moves are available to the agent. Other environments, like where the agent controls a robot in a physical world, have **continuous action spaces**. In continuous spaces, actions are real-valued vectors.

This distinction has some quite-profound consequences for methods in deep RL. Some families of algorithms can only be directly applied in one case, and would have to be substantially reworked for the other. 


Policies
--------

A **policy** is a rule used by an agent to decide what actions to take. It can be deterministic, in which case it is usually denoted by :math:`\mu`:

.. math::

    a_t = \mu(s_t),

or it may be stochastic, in which case it is usually denoted by :math:`\pi`:

.. math::

    a_t \sim \pi(\cdot | s_t).

Because the policy is essentially the agent's brain, it's not uncommon to substitute the word "policy" for "agent", eg saying "The policy is trying to maximize reward."

In deep RL, we deal with **parameterized policies**: policies whose outputs are computable functions that depend on a set of parameters (eg the weights and biases of a neural network) which we can adjust to change the behavior via some optimization algorithm. 

We often denote the parameters of such a policy by :math:`\theta` or :math:`\phi`, and then write this as a subscript on the policy symbol to highlight the connection:

.. math::

    a_t &= \mu_{\theta}(s_t) \\
    a_t &\sim \pi_{\theta}(\cdot | s_t).


Deterministic Policies
^^^^^^^^^^^^^^^^^^^^^^

**Example: Deterministic Policies.** Here is a code snippet for building a simple deterministic policy for a continuous action space in PyTorch, using the ``torch.nn`` package:

.. code-block:: python

    pi_net = nn.Sequential(
                  nn.Linear(obs_dim, 64),
                  nn.Tanh(),
                  nn.Linear(64, 64),
                  nn.Tanh(),
                  nn.Linear(64, act_dim)
                )

This builds a multi-layer perceptron (MLP) network with two hidden layers of size 64 and :math:`\tanh` activation functions. If ``obs`` is a Numpy array containing a batch of observations, ``pi_net`` can be used to obtain a batch of actions as follows:

.. code-block:: python

    obs_tensor = torch.as_tensor(obs, dtype=torch.float32)
    actions = pi_net(obs_tensor)

.. admonition:: You Should Know

    Don't worry about it if this neural network stuff is unfamiliar to you---this tutorial will focus on RL, and not on the neural network side of things. So you can skip this example and come back to it later. But we figured that if you already knew, it could be helpful.


Stochastic Policies
^^^^^^^^^^^^^^^^^^^

The two most common kinds of stochastic policies in deep RL are **categorical policies** and **diagonal Gaussian policies**. 

`Categorical`_ policies can be used in discrete action spaces, while diagonal `Gaussian`_ policies are used in continuous action spaces. 

Two key computations are centrally important for using and training stochastic policies:

* sampling actions from the policy,
* and computing log likelihoods of particular actions, :math:`\log \pi_{\theta}(a|s)`.

In what follows, we'll describe how to do these for both categorical and diagonal Gaussian policies. 

.. admonition:: Categorical Policies

    A categorical policy is like a classifier over discrete actions. You build the neural network for a categorical policy the same way you would for a classifier: the input is the observation, followed by some number of layers (possibly convolutional or densely-connected, depending on the kind of input), and then you have one final linear layer that gives you logits for each action, followed by a `softmax`_ to convert the logits into probabilities. 

    **Sampling.** Given the probabilities for each action, frameworks like PyTorch and Tensorflow have built-in tools for sampling. For example, see the documentation for `Categorical distributions in PyTorch`_, `torch.multinomial`_, `tf.distributions.Categorical`_, or `tf.multinomial`_.

    **Log-Likelihood.** Denote the last layer of probabilities as :math:`P_{\theta}(s)`. It is a vector with however many entries as there are actions, so we can treat the actions as indices for the vector. The log likelihood for an action :math:`a` can then be obtained by indexing into the vector:

    .. math::

        \log \pi_{\theta}(a|s) = \log \left[P_{\theta}(s)\right]_a.


.. admonition:: Diagonal Gaussian Policies

    A multivariate Gaussian distribution (or multivariate normal distribution, if you prefer) is described by a mean vector, :math:`\mu`, and a covariance matrix, :math:`\Sigma`. A diagonal Gaussian distribution is a special case where the covariance matrix only has entries on the diagonal. As a result, we can represent it by a vector.

    A diagonal Gaussian policy always has a neural network that maps from observations to mean actions, :math:`\mu_{\theta}(s)`. There are two different ways that the covariance matrix is typically represented.

    **The first way:** There is a single vector of log standard deviations, :math:`\log \sigma`, which is **not** a function of state: the :math:`\log \sigma` are standalone parameters. (You Should Know: our implementations of VPG, TRPO, and PPO do it this way.)

    **The second way:** There is a neural network that maps from states to log standard deviations, :math:`\log \sigma_{\theta}(s)`. It may optionally share some layers with the mean network.

    Note that in both cases we output log standard deviations instead of standard deviations directly. This is because log stds are free to take on any values in :math:`(-\infty, \infty)`, while stds must be nonnegative. It's easier to train parameters if you don't have to enforce those kinds of constraints. The standard deviations can be obtained immediately from the log standard deviations by exponentiating them, so we do not lose anything by representing them this way.

    **Sampling.** Given the mean action :math:`\mu_{\theta}(s)` and standard deviation :math:`\sigma_{\theta}(s)`, and a vector :math:`z` of noise from a spherical Gaussian (:math:`z \sim \mathcal{N}(0, I)`), an action sample can be computed with

    .. math::

        a = \mu_{\theta}(s) + \sigma_{\theta}(s) \odot z,

    where :math:`\odot` denotes the elementwise product of two vectors. Standard frameworks have built-in ways to generate the noise vectors, such as `torch.normal`_ or `tf.random_normal`_. Alternatively, you can build distribution objects, eg through `torch.distributions.Normal`_ or `tf.distributions.Normal`_, and use them to generate samples. (The advantage of the latter approach is that those objects can also calculate log-likelihoods for you.)

    **Log-Likelihood.** The log-likelihood of a :math:`k` -dimensional action :math:`a`, for a diagonal Gaussian with mean :math:`\mu = \mu_{\theta}(s)` and standard deviation :math:`\sigma = \sigma_{\theta}(s)`, is given by

    .. math::

        \log \pi_{\theta}(a|s) = -\frac{1}{2}\left(\sum_{i=1}^k \left(\frac{(a_i - \mu_i)^2}{\sigma_i^2} + 2 \log \sigma_i \right) + k \log 2\pi \right).



.. _`Categorical`: https://en.wikipedia.org/wiki/Categorical_distribution
.. _`Gaussian`: https://en.wikipedia.org/wiki/Multivariate_normal_distribution
.. _`softmax`: https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax
.. _`Categorical distributions in PyTorch`: https://pytorch.org/docs/stable/distributions.html#categorical
.. _`torch.multinomial`: https://pytorch.org/docs/stable/torch.html#torch.multinomial
.. _`tf.distributions.Categorical`: https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/distributions/Categorical
.. _`tf.multinomial`: https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/random/multinomial
.. _`torch.normal`: https://pytorch.org/docs/stable/torch.html#torch.normal
.. _`tf.random_normal`: https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/random/normal
.. _`torch.distributions.Normal`: https://pytorch.org/docs/stable/distributions.html#normal
.. _`tf.distributions.Normal`: https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/distributions/Normal

Trajectories
------------

A trajectory :math:`\tau` is a sequence of states and actions in the world,

.. math::

    \tau = (s_0, a_0, s_1, a_1, ...).

The very first state of the world, :math:`s_0`, is randomly sampled from the **start-state distribution**, sometimes denoted by :math:`\rho_0`:

.. math::

    s_0 \sim \rho_0(\cdot).

State transitions (what happens to the world between the state at time :math:`t`, :math:`s_t`, and the state at :math:`t+1`, :math:`s_{t+1}`), are governed by the natural laws of the environment, and depend on only the most recent action, :math:`a_t`. They can be either deterministic,

.. math::

    s_{t+1} = f(s_t, a_t)

or stochastic,

.. math::

    s_{t+1} \sim P(\cdot|s_t, a_t).

Actions come from an agent according to its policy.

.. admonition:: You Should Know

    Trajectories are also frequently called **episodes** or **rollouts**.


Reward and Return
-----------------

The reward function :math:`R` is critically important in reinforcement learning. It depends on the current state of the world, the action just taken, and the next state of the world:

.. math::

    r_t = R(s_t, a_t, s_{t+1})

although frequently this is simplified to just a dependence on the current state, :math:`r_t = R(s_t)`, or state-action pair :math:`r_t = R(s_t,a_t)`. 

The goal of the agent is to maximize some notion of cumulative reward over a trajectory, but this actually can mean a few things. We'll notate all of these cases with :math:`R(\tau)`, and it will either be clear from context which case we mean, or it won't matter (because the same equations will apply to all cases).

One kind of return is the **finite-horizon undiscounted return**, which is just the sum of rewards obtained in a fixed window of steps:

.. math::

    R(\tau) = \sum_{t=0}^T r_t.

Another kind of return is the **infinite-horizon discounted return**, which is the sum of all rewards *ever* obtained by the agent, but discounted by how far off in the future they're obtained. This formulation of reward includes a discount factor :math:`\gamma \in (0,1)`:

.. math::

    R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t.


Why would we ever want a discount factor, though? Don't we just want to get *all* rewards? We do, but the discount factor is both intuitively appealing and mathematically convenient. On an intuitive level: cash now is better than cash later. Mathematically: an infinite-horizon sum of rewards `may not converge`_ to a finite value, and is hard to deal with in equations. But with a discount factor and under reasonable conditions, the infinite sum converges.

.. admonition:: You Should Know

    While the line between these two formulations of return are quite stark in RL formalism, deep RL practice tends to blur the line a fair bit---for instance, we frequently set up algorithms to optimize the undiscounted return, but use discount factors in estimating **value functions**. 

.. _`may not converge`: https://en.wikipedia.org/wiki/Convergent_series

The RL Problem
--------------

Whatever the choice of return measure (whether infinite-horizon discounted, or finite-horizon undiscounted), and whatever the choice of policy, the goal in RL is to select a policy which maximizes **expected return** when the agent acts according to it.

To talk about expected return, we first have to talk about probability distributions over trajectories. 

Let's suppose that both the environment transitions and the policy are stochastic. In this case, the probability of a :math:`T` -step trajectory is:

.. math::

    P(\tau|\pi) = \rho_0 (s_0) \prod_{t=0}^{T-1} P(s_{t+1} | s_t, a_t) \pi(a_t | s_t).


The expected return (for whichever measure), denoted by :math:`J(\pi)`, is then:

.. math::

    J(\pi) = \int_{\tau} P(\tau|\pi) R(\tau) = \underE{\tau\sim \pi}{R(\tau)}.


The central optimization problem in RL can then be expressed by

.. math::

    \pi^* = \arg \max_{\pi} J(\pi),

with :math:`\pi^*` being the **optimal policy**. 


Value Functions
---------------

It's often useful to know the **value** of a state, or state-action pair. By value, we mean the expected return if you start in that state or state-action pair, and then act according to a particular policy forever after. **Value functions** are used, one way or another, in almost every RL algorithm.


There are four main functions of note here.

1. The **On-Policy Value Function**, :math:`V^{\pi}(s)`, which gives the expected return if you start in state :math:`s` and always act according to policy :math:`\pi`:

    .. math::
        
        V^{\pi}(s) = \underE{\tau \sim \pi}{R(\tau)\left| s_0 = s\right.}

2. The **On-Policy Action-Value Function**, :math:`Q^{\pi}(s,a)`, which gives the expected return if you start in state :math:`s`, take an arbitrary action :math:`a` (which may not have come from the policy), and then forever after act according to policy :math:`\pi`:

    .. math::
        
        Q^{\pi}(s,a) = \underE{\tau \sim \pi}{R(\tau)\left| s_0 = s, a_0 = a\right.}


3. The **Optimal Value Function**, :math:`V^*(s)`, which gives the expected return if you start in state :math:`s` and always act according to the *optimal* policy in the environment:

    .. math::

        V^*(s) = \max_{\pi} \underE{\tau \sim \pi}{R(\tau)\left| s_0 = s\right.}

4. The **Optimal Action-Value Function**, :math:`Q^*(s,a)`, which gives the expected return if you start in state :math:`s`, take an arbitrary action :math:`a`, and then forever after act according to the *optimal* policy in the environment:

    .. math::

        Q^*(s,a) = \max_{\pi} \underE{\tau \sim \pi}{R(\tau)\left| s_0 = s, a_0 = a\right.}


.. admonition:: You Should Know

    When we talk about value functions, if we do not make reference to time-dependence, we only mean expected **infinite-horizon discounted return**. Value functions for finite-horizon undiscounted return would need to accept time as an argument. Can you think about why? Hint: what happens when time's up?

.. admonition:: You Should Know

    There are two key connections between the value function and the action-value function that come up pretty often:

    .. math::

        V^{\pi}(s) = \underE{a\sim \pi}{Q^{\pi}(s,a)},

    and

    .. math::

        V^*(s) = \max_a Q^* (s,a).

    These relations follow pretty directly from the definitions just given: can you prove them?

The Optimal Q-Function and the Optimal Action
---------------------------------------------

There is an important connection between the optimal action-value function :math:`Q^*(s,a)` and the action selected by the optimal policy. By definition, :math:`Q^*(s,a)` gives the expected return for starting in state :math:`s`, taking (arbitrary) action :math:`a`, and then acting according to the optimal policy forever after. 

The optimal policy in :math:`s` will select whichever action maximizes the expected return from starting in :math:`s`. As a result, if we have :math:`Q^*`, we can directly obtain the optimal action, :math:`a^*(s)`, via

.. math::

    a^*(s) = \arg \max_a Q^* (s,a).

Note: there may be multiple actions which maximize :math:`Q^*(s,a)`, in which case, all of them are optimal, and the optimal policy may randomly select any of them. But there is always an optimal policy which deterministically selects an action.


Bellman Equations
-----------------

All four of the value functions obey special self-consistency equations called **Bellman equations**. The basic idea behind the Bellman equations is this:

    The value of your starting point is the reward you expect to get from being there, plus the value of wherever you land next.


The Bellman equations for the on-policy value functions are

.. math::
    :nowrap:

    \begin{align*}
    V^{\pi}(s) &= \underE{a \sim \pi \\ s'\sim P}{r(s,a) + \gamma V^{\pi}(s')}, \\
    Q^{\pi}(s,a) &= \underE{s'\sim P}{r(s,a) + \gamma \underE{a'\sim \pi}{Q^{\pi}(s',a')}},
    \end{align*}

where :math:`s' \sim P` is shorthand for :math:`s' \sim P(\cdot |s,a)`, indicating that the next state :math:`s'` is sampled from the environment's transition rules; :math:`a \sim \pi` is shorthand for :math:`a \sim \pi(\cdot|s)`; and :math:`a' \sim \pi` is shorthand for :math:`a' \sim \pi(\cdot|s')`. 

The Bellman equations for the optimal value functions are

.. math::
    :nowrap:

    \begin{align*}
    V^*(s) &= \max_a \underE{s'\sim P}{r(s,a) + \gamma V^*(s')}, \\
    Q^*(s,a) &= \underE{s'\sim P}{r(s,a) + \gamma \max_{a'} Q^*(s',a')}.
    \end{align*}

The crucial difference between the Bellman equations for the on-policy value functions and the optimal value functions, is the absence or presence of the :math:`\max` over actions. Its inclusion reflects the fact that whenever the agent gets to choose its action, in order to act optimally, it has to pick whichever action leads to the highest value.

.. admonition:: You Should Know

    The term "Bellman backup" comes up quite frequently in the RL literature. The Bellman backup for a state, or state-action pair, is the right-hand side of the Bellman equation: the reward-plus-next-value. 


Advantage Functions
-------------------

Sometimes in RL, we don't need to describe how good an action is in an absolute sense, but only how much better it is than others on average. That is to say, we want to know the relative **advantage** of that action. We make this concept precise with the **advantage function.**

The advantage function :math:`A^{\pi}(s,a)` corresponding to a policy :math:`\pi` describes how much better it is to take a specific action :math:`a` in state :math:`s`, over randomly selecting an action according to :math:`\pi(\cdot|s)`, assuming you act according to :math:`\pi` forever after. Mathematically, the advantage function is defined by

.. math::

    A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s).

.. admonition:: You Should Know

    We'll discuss this more later, but the advantage function is crucially important to policy gradient methods.



(Optional) Formalism
====================

So far, we've discussed the agent's environment in an informal way, but if you try to go digging through the literature, you're likely to run into the standard mathematical formalism for this setting: **Markov Decision Processes** (MDPs). An MDP is a 5-tuple, :math:`\langle S, A, R, P, \rho_0 \rangle`, where

* :math:`S` is the set of all valid states,
* :math:`A` is the set of all valid actions,
* :math:`R : S \times A \times S \to \mathbb{R}` is the reward function, with :math:`r_t = R(s_t, a_t, s_{t+1})`,
* :math:`P : S \times A \to \mathcal{P}(S)` is the transition probability function, with :math:`P(s'|s,a)` being the probability of transitioning into state :math:`s'` if you start in state :math:`s` and take action :math:`a`,
* and :math:`\rho_0` is the starting state distribution.

The name Markov Decision Process refers to the fact that the system obeys the `Markov property`_: transitions only depend on the most recent state and action, and no prior history.  




.. _`Markov property`: https://en.wikipedia.org/wiki/Markov_property

================================================
FILE: docs/spinningup/rl_intro2.rst
================================================
==============================
Part 2: Kinds of RL Algorithms
==============================

.. contents:: Table of Contents
    :depth: 2

Now that we've gone through the basics of RL terminology and notation, we can cover a little bit of the richer material: the landscape of algorithms in modern RL, and a description of the kinds of trade-offs that go into algorithm design.

A Taxonomy of RL Algorithms
===========================

.. figure:: ../images/rl_algorithms_9_15.svg
    :align: center

    A non-exhaustive, but useful taxonomy of algorithms in modern RL. `Citations below.`_

We'll start this section with a disclaimer: it's really quite hard to draw an accurate, all-encompassing taxonomy of algorithms in the modern RL space, because the modularity of algorithms is not well-represented by a tree structure. Also, to make something that fits on a page and is reasonably digestible in an introduction essay, we have to omit quite a bit of more advanced material (exploration, transfer learning, meta learning, etc). That said, our goals here are 

* to highlight the most foundational design choices in deep RL algorithms about what to learn and how to learn it,
* to expose the trade-offs in those choices,
* and to place a few prominent modern algorithms into context with respect to those choices.

Model-Free vs Model-Based RL
----------------------------

One of the most important branching points in an RL algorithm is the question of **whether the agent has access to (or learns) a model of the environment**. By a model of the environment, we mean a function which predicts state transitions and rewards. 

The main upside to having a model is that **it allows the agent to plan** by thinking ahead, seeing what would happen for a range of possible choices, and explicitly deciding between its options. Agents can then distill the results from planning ahead into a learned policy. A particularly famous example of this approach is `AlphaZero`_. When this works, it can result in a substantial improvement in sample efficiency over methods that don't have a model.

The main downside is that **a ground-truth model of the environment is usually not available to the agent.** If an agent wants to use a model in this case, it has to learn the model purely from experience, which creates several challenges. The biggest challenge is that bias in the model can be exploited by the agent, resulting in an agent which performs well with respect to the learned model, but behaves sub-optimally (or super terribly) in the real environment. Model-learning is fundamentally hard, so even intense effort---being willing to throw lots of time and compute at it---can fail to pay off. 

Algorithms which use a model are called **model-based** methods, and those that don't are called **model-free**. While model-free methods forego the potential gains in sample efficiency from using a model, they tend to be easier to implement and tune. As of the time of writing this introduction (September 2018), model-free methods are more popular and have been more extensively developed and tested than model-based methods.


What to Learn
-------------

Another critical branching point in an RL algorithm is the question of **what to learn.** The list of usual suspects includes

* policies, either stochastic or deterministic,
* action-value functions (Q-functions),
* value functions,
* and/or environment models.



What to Learn in Model-Free RL
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are two main approaches to representing and training agents with model-free RL:

**Policy Optimization.** Methods in this family represent a policy explicitly as :math:`\pi_{\theta}(a|s)`. They optimize the parameters :math:`\theta` either directly by gradient ascent on the performance objective :math:`J(\pi_{\theta})`,  or indirectly, by maximizing local approximations of :math:`J(\pi_{\theta})`. This optimization is almost always performed **on-policy**, which means that each update only uses data collected while acting according to the most recent version of the policy. Policy optimization also usually involves learning an approximator :math:`V_{\phi}(s)` for the on-policy value function :math:`V^{\pi}(s)`, which gets used in figuring out how to update the policy.

A couple of examples of policy optimization methods are:

* `A2C / A3C`_, which performs gradient ascent to directly maximize performance,
* and `PPO`_, whose updates indirectly maximize performance, by instead maximizing a *surrogate objective* function which gives a conservative estimate for how much :math:`J(\pi_{\theta})` will change as a result of the update. 

**Q-Learning.** Methods in this family learn an approximator :math:`Q_{\theta}(s,a)` for the optimal action-value function, :math:`Q^*(s,a)`. Typically they use an objective function based on the `Bellman equation`_. This optimization is almost always performed **off-policy**, which means that each update can use data collected at any point during training, regardless of how the agent was choosing to explore the environment when the data was obtained. The corresponding policy is obtained via the connection between :math:`Q^*` and :math:`\pi^*`: the actions taken by the Q-learning agent are given by 

.. math::
    
    a(s) = \arg \max_a Q_{\theta}(s,a).

Examples of Q-learning methods include

* `DQN`_, a classic which substantially launched the field of deep RL,
* and `C51`_, a variant that learns a distribution over return whose expectation is :math:`Q^*`.

**Trade-offs Between Policy Optimization and Q-Learning.** The primary strength of policy optimization methods is that they are principled, in the sense that *you directly optimize for the thing you want.* This tends to make them stable and reliable. By contrast, Q-learning methods only *indirectly* optimize for agent performance, by training :math:`Q_{\theta}` to satisfy a self-consistency equation. There are many failure modes for this kind of learning, so it tends to be less stable. [1]_ But, Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques. 

**Interpolating Between Policy Optimization and Q-Learning.** Serendipitously, policy optimization and Q-learning are not incompatible (and under some circumstances, it turns out, `equivalent`_), and there exist a range of algorithms that live in between the two extremes. Algorithms that live on this spectrum are able to carefully trade-off between the strengths and weaknesses of either side. Examples include

* `DDPG`_, an algorithm which concurrently learns a deterministic policy and a Q-function by using each to improve the other,
* and `SAC`_, a variant which uses stochastic policies, entropy regularization, and a few other tricks to stabilize learning and score higher than DDPG on standard benchmarks.



.. [1] For more information about how and why Q-learning methods can fail, see 1) this classic paper by `Tsitsiklis and van Roy`_, 2) the (much more recent) `review by Szepesvari`_ (in section 4.3.2), and 3) chapter 11 of `Sutton and Barto`_, especially section 11.3 (on "the deadly triad" of function approximation, bootstrapping, and off-policy data, together causing instability in value-learning algorithms).


.. _`Bellman equation`: ../spinningup/rl_intro.html#bellman-equations
.. _`Tsitsiklis and van Roy`: http://web.mit.edu/jnt/www/Papers/J063-97-bvr-td.pdf
.. _`review by Szepesvari`: https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf
.. _`Sutton and Barto`: https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
.. _`equivalent`: https://arxiv.org/abs/1704.06440

What to Learn in Model-Based RL
-------------------------------

Unlike model-free RL, there aren't a small number of easy-to-define clusters of methods for model-based RL: there are many orthogonal ways of using models. We'll give a few examples, but the list is far from exhaustive. In each case, the model may either be given or learned. 

**Background: Pure Planning.** The most basic approach *never* explicitly represents the policy, and instead, uses pure planning techniques like `model-predictive control`_ (MPC) to select actions. In MPC, each time the agent observes the environment, it computes a plan which is optimal with respect to the model, where the plan describes all actions to take over some fixed window of time after the present. (Future rewards beyond the horizon may be considered by the planning algorithm through the use of a learned value function.) The agent then executes the first action of the plan, and immediately discards the rest of it. It computes a new plan each time it prepares to interact with the environment, to avoid using an action from a plan with a shorter-than-desired planning horizon.

* The `MBMF`_ work explores MPC with learned environment models on some standard benchmark tasks for deep RL.

**Expert Iteration.** A straightforward follow-on to pure planning involves using and learning an explicit representation of the policy, :math:`\pi_{\theta}(a|s)`. The agent uses a planning algorithm (like Monte Carlo Tree Search) in the model, generating candidate actions for the plan by sampling from its current policy. The planning algorithm produces an action which is better than what the policy alone would have produced, hence it is an "expert" relative to the policy. The policy is afterwards updated to produce an action more like the planning algorithm's output.

* The `ExIt`_ algorithm uses this approach to train deep neural networks to play Hex.
* `AlphaZero`_ is another example of this approach.

**Data Augmentation for Model-Free Methods.** Use a model-free RL algorithm to train a policy or Q-function, but either 1) augment real experiences with fictitious ones in updating the agent, or 2) use *only* fictitous experience for updating the agent. 

* See `MBVE`_ for an example of augmenting real experiences with fictitious ones.
* See `World Models`_ for an example of using purely fictitious experience to train the agent, which they call "training in the dream."

**Embedding Planning Loops into Policies.** Another approach embeds the planning procedure directly into a policy as a subroutine---so that complete plans become side information for the policy---while training the output of the policy with any standard model-free algorithm. The key concept is that in this framework, the policy can learn to choose how and when to use the plans. This makes model bias less of a problem, because if the model is bad for planning in some states, the policy can simply learn to ignore it.

* See `I2A`_ for an example of agents being endowed with this style of imagination.

.. _`model-predictive control`: https://en.wikipedia.org/wiki/Model_predictive_control
.. _`ExIt`: https://arxiv.org/abs/1705.08439
.. _`World Models`: https://worldmodels.github.io/



Links to Algorithms in Taxonomy
===============================

.. _`Citations below.`: 

.. [#] `A2C / A3C <https://arxiv.org/abs/1602.01783>`_ (Asynchronous Advantage Actor-Critic): Mnih et al, 2016
.. [#] `PPO <https://arxiv.org/abs/1707.06347>`_ (Proximal Policy Opt

Download .txt

gitextract_hmpqqyur/

├── .gitignore
├── .travis.yml
├── LICENSE
├── docs/
│   ├── Makefile
│   ├── _static/
│   │   └── css/
│   │       └── modify.css
│   ├── algorithms/
│   │   ├── ddpg.rst
│   │   ├── ppo.rst
│   │   ├── sac.rst
│   │   ├── td3.rst
│   │   ├── trpo.rst
│   │   └── vpg.rst
│   ├── conf.py
│   ├── docs_requirements.txt
│   ├── etc/
│   │   ├── acknowledgements.rst
│   │   └── author.rst
│   ├── images/
│   │   ├── rl_algorithms.xml
│   │   └── rl_algorithms_9_15.xml
│   ├── index.rst
│   ├── make.bat
│   ├── spinningup/
│   │   ├── bench.rst
│   │   ├── bench_ddpg.rst
│   │   ├── bench_ppo.rst
│   │   ├── bench_sac.rst
│   │   ├── bench_td3.rst
│   │   ├── bench_vpg.rst
│   │   ├── exercise2_1_soln.rst
│   │   ├── exercise2_2_soln.rst
│   │   ├── exercises.rst
│   │   ├── extra_pg_proof1.rst
│   │   ├── extra_pg_proof2.rst
│   │   ├── extra_tf_pg_implementation.rst
│   │   ├── keypapers.rst
│   │   ├── rl_intro.rst
│   │   ├── rl_intro2.rst
│   │   ├── rl_intro3.rst
│   │   ├── rl_intro4.rst
│   │   └── spinningup.rst
│   ├── user/
│   │   ├── algorithms.rst
│   │   ├── installation.rst
│   │   ├── introduction.rst
│   │   ├── plotting.rst
│   │   ├── running.rst
│   │   └── saving_and_loading.rst
│   └── utils/
│       ├── logger.rst
│       ├── mpi.rst
│       ├── plotter.rst
│       └── run_utils.rst
├── readme.md
├── readthedocs.yml
├── setup.py
├── spinup/
│   ├── __init__.py
│   ├── algos/
│   │   ├── __init__.py
│   │   ├── pytorch/
│   │   │   ├── ddpg/
│   │   │   │   ├── core.py
│   │   │   │   └── ddpg.py
│   │   │   ├── ppo/
│   │   │   │   ├── core.py
│   │   │   │   └── ppo.py
│   │   │   ├── sac/
│   │   │   │   ├── core.py
│   │   │   │   └── sac.py
│   │   │   ├── td3/
│   │   │   │   ├── core.py
│   │   │   │   └── td3.py
│   │   │   ├── trpo/
│   │   │   │   └── trpo.py
│   │   │   └── vpg/
│   │   │       ├── core.py
│   │   │       └── vpg.py
│   │   └── tf1/
│   │       ├── ddpg/
│   │       │   ├── __init__.py
│   │       │   ├── core.py
│   │       │   └── ddpg.py
│   │       ├── ppo/
│   │       │   ├── __init__.py
│   │       │   ├── core.py
│   │       │   └── ppo.py
│   │       ├── sac/
│   │       │   ├── __init__.py
│   │       │   ├── core.py
│   │       │   └── sac.py
│   │       ├── td3/
│   │       │   ├── __init__.py
│   │       │   ├── core.py
│   │       │   └── td3.py
│   │       ├── trpo/
│   │       │   ├── __init__.py
│   │       │   ├── core.py
│   │       │   └── trpo.py
│   │       └── vpg/
│   │           ├── __init__.py
│   │           ├── core.py
│   │           └── vpg.py
│   ├── examples/
│   │   ├── pytorch/
│   │   │   ├── bench_ppo_cartpole.py
│   │   │   └── pg_math/
│   │   │       ├── 1_simple_pg.py
│   │   │       └── 2_rtg_pg.py
│   │   └── tf1/
│   │       ├── bench_ppo_cartpole.py
│   │       ├── pg_math/
│   │       │   ├── 1_simple_pg.py
│   │       │   └── 2_rtg_pg.py
│   │       └── train_mnist.py
│   ├── exercises/
│   │   ├── common.py
│   │   ├── pytorch/
│   │   │   ├── problem_set_1/
│   │   │   │   ├── exercise1_1.py
│   │   │   │   ├── exercise1_2.py
│   │   │   │   ├── exercise1_2_auxiliary.py
│   │   │   │   └── exercise1_3.py
│   │   │   ├── problem_set_1_solutions/
│   │   │   │   ├── exercise1_1_soln.py
│   │   │   │   └── exercise1_2_soln.py
│   │   │   └── problem_set_2/
│   │   │       └── exercise2_2.py
│   │   └── tf1/
│   │       ├── problem_set_1/
│   │       │   ├── exercise1_1.py
│   │       │   ├── exercise1_2.py
│   │       │   └── exercise1_3.py
│   │       ├── problem_set_1_solutions/
│   │       │   ├── exercise1_1_soln.py
│   │       │   └── exercise1_2_soln.py
│   │       └── problem_set_2/
│   │           └── exercise2_2.py
│   ├── run.py
│   ├── user_config.py
│   ├── utils/
│   │   ├── __init__.py
│   │   ├── logx.py
│   │   ├── mpi_pytorch.py
│   │   ├── mpi_tf.py
│   │   ├── mpi_tools.py
│   │   ├── plot.py
│   │   ├── run_entrypoint.py
│   │   ├── run_utils.py
│   │   ├── serialization_utils.py
│   │   └── test_policy.py
│   └── version.py
├── test/
│   └── test_ppo.py
└── travis_setup.sh

Download .txt

SYMBOL INDEX (356 symbols across 54 files)

FILE: docs/conf.py
  class Mock (line 31) | class Mock(MagicMock):
    method __getattr__ (line 33) | def __getattr__(cls, name):
  function setup (line 240) | def setup(app):

FILE: spinup/algos/pytorch/ddpg/core.py
  function combined_shape (line 8) | def combined_shape(length, shape=None):
  function mlp (line 13) | def mlp(sizes, activation, output_activation=nn.Identity):
  function count_vars (line 20) | def count_vars(module):
  class MLPActor (line 23) | class MLPActor(nn.Module):
    method __init__ (line 25) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation, act_lim...
    method forward (line 31) | def forward(self, obs):
  class MLPQFunction (line 35) | class MLPQFunction(nn.Module):
    method __init__ (line 37) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
    method forward (line 41) | def forward(self, obs, act):
  class MLPActorCritic (line 45) | class MLPActorCritic(nn.Module):
    method __init__ (line 47) | def __init__(self, observation_space, action_space, hidden_sizes=(256,...
    method act (line 59) | def act(self, obs):

FILE: spinup/algos/pytorch/ddpg/ddpg.py
  class ReplayBuffer (line 11) | class ReplayBuffer:
    method __init__ (line 16) | def __init__(self, obs_dim, act_dim, size):
    method store (line 24) | def store(self, obs, act, rew, next_obs, done):
    method sample_batch (line 33) | def sample_batch(self, batch_size=32):
  function ddpg (line 44) | def ddpg(env_fn, actor_critic=core.MLPActorCritic, ac_kwargs=dict(), see...

FILE: spinup/algos/pytorch/ppo/core.py
  function combined_shape (line 11) | def combined_shape(length, shape=None):
  function mlp (line 17) | def mlp(sizes, activation, output_activation=nn.Identity):
  function count_vars (line 25) | def count_vars(module):
  function discount_cumsum (line 29) | def discount_cumsum(x, discount):
  class Actor (line 47) | class Actor(nn.Module):
    method _distribution (line 49) | def _distribution(self, obs):
    method _log_prob_from_distribution (line 52) | def _log_prob_from_distribution(self, pi, act):
    method forward (line 55) | def forward(self, obs, act=None):
  class MLPCategoricalActor (line 66) | class MLPCategoricalActor(Actor):
    method __init__ (line 68) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
    method _distribution (line 72) | def _distribution(self, obs):
    method _log_prob_from_distribution (line 76) | def _log_prob_from_distribution(self, pi, act):
  class MLPGaussianActor (line 80) | class MLPGaussianActor(Actor):
    method __init__ (line 82) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
    method _distribution (line 88) | def _distribution(self, obs):
    method _log_prob_from_distribution (line 93) | def _log_prob_from_distribution(self, pi, act):
  class MLPCritic (line 97) | class MLPCritic(nn.Module):
    method __init__ (line 99) | def __init__(self, obs_dim, hidden_sizes, activation):
    method forward (line 103) | def forward(self, obs):
  class MLPActorCritic (line 108) | class MLPActorCritic(nn.Module):
    method __init__ (line 111) | def __init__(self, observation_space, action_space,
    method step (line 126) | def step(self, obs):
    method act (line 134) | def act(self, obs):

FILE: spinup/algos/pytorch/ppo/ppo.py
  class PPOBuffer (line 12) | class PPOBuffer:
    method __init__ (line 19) | def __init__(self, obs_dim, act_dim, size, gamma=0.99, lam=0.95):
    method store (line 30) | def store(self, obs, act, rew, val, logp):
    method finish_path (line 42) | def finish_path(self, last_val=0):
    method get (line 71) | def get(self):
  function ppo (line 88) | def ppo(env_fn, actor_critic=core.MLPActorCritic, ac_kwargs=dict(), seed=0,

FILE: spinup/algos/pytorch/sac/core.py
  function combined_shape (line 10) | def combined_shape(length, shape=None):
  function mlp (line 15) | def mlp(sizes, activation, output_activation=nn.Identity):
  function count_vars (line 22) | def count_vars(module):
  class SquashedGaussianMLPActor (line 29) | class SquashedGaussianMLPActor(nn.Module):
    method __init__ (line 31) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation, act_lim...
    method forward (line 38) | def forward(self, obs, deterministic=False, with_logprob=True):
  class MLPQFunction (line 70) | class MLPQFunction(nn.Module):
    method __init__ (line 72) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
    method forward (line 76) | def forward(self, obs, act):
  class MLPActorCritic (line 80) | class MLPActorCritic(nn.Module):
    method __init__ (line 82) | def __init__(self, observation_space, action_space, hidden_sizes=(256,...
    method act (line 95) | def act(self, obs, deterministic=False):

FILE: spinup/algos/pytorch/sac/sac.py
  class ReplayBuffer (line 12) | class ReplayBuffer:
    method __init__ (line 17) | def __init__(self, obs_dim, act_dim, size):
    method store (line 25) | def store(self, obs, act, rew, next_obs, done):
    method sample_batch (line 34) | def sample_batch(self, batch_size=32):
  function sac (line 45) | def sac(env_fn, actor_critic=core.MLPActorCritic, ac_kwargs=dict(), seed=0,

FILE: spinup/algos/pytorch/td3/core.py
  function combined_shape (line 8) | def combined_shape(length, shape=None):
  function mlp (line 13) | def mlp(sizes, activation, output_activation=nn.Identity):
  function count_vars (line 20) | def count_vars(module):
  class MLPActor (line 23) | class MLPActor(nn.Module):
    method __init__ (line 25) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation, act_lim...
    method forward (line 31) | def forward(self, obs):
  class MLPQFunction (line 35) | class MLPQFunction(nn.Module):
    method __init__ (line 37) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
    method forward (line 41) | def forward(self, obs, act):
  class MLPActorCritic (line 45) | class MLPActorCritic(nn.Module):
    method __init__ (line 47) | def __init__(self, observation_space, action_space, hidden_sizes=(256,...
    method act (line 60) | def act(self, obs):

FILE: spinup/algos/pytorch/td3/td3.py
  class ReplayBuffer (line 12) | class ReplayBuffer:
    method __init__ (line 17) | def __init__(self, obs_dim, act_dim, size):
    method store (line 25) | def store(self, obs, act, rew, next_obs, done):
    method sample_batch (line 34) | def sample_batch(self, batch_size=32):
  function td3 (line 45) | def td3(env_fn, actor_critic=core.MLPActorCritic, ac_kwargs=dict(), seed=0,

FILE: spinup/algos/pytorch/trpo/trpo.py
  function trpo (line 1) | def trpo(*args, **kwargs):

FILE: spinup/algos/pytorch/vpg/core.py
  function combined_shape (line 11) | def combined_shape(length, shape=None):
  function mlp (line 17) | def mlp(sizes, activation, output_activation=nn.Identity):
  function count_vars (line 25) | def count_vars(module):
  function discount_cumsum (line 29) | def discount_cumsum(x, discount):
  class Actor (line 47) | class Actor(nn.Module):
    method _distribution (line 49) | def _distribution(self, obs):
    method _log_prob_from_distribution (line 52) | def _log_prob_from_distribution(self, pi, act):
    method forward (line 55) | def forward(self, obs, act=None):
  class MLPCategoricalActor (line 66) | class MLPCategoricalActor(Actor):
    method __init__ (line 68) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
    method _distribution (line 72) | def _distribution(self, obs):
    method _log_prob_from_distribution (line 76) | def _log_prob_from_distribution(self, pi, act):
  class MLPGaussianActor (line 80) | class MLPGaussianActor(Actor):
    method __init__ (line 82) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
    method _distribution (line 88) | def _distribution(self, obs):
    method _log_prob_from_distribution (line 93) | def _log_prob_from_distribution(self, pi, act):
  class MLPCritic (line 97) | class MLPCritic(nn.Module):
    method __init__ (line 99) | def __init__(self, obs_dim, hidden_sizes, activation):
    method forward (line 103) | def forward(self, obs):
  class MLPActorCritic (line 108) | class MLPActorCritic(nn.Module):
    method __init__ (line 111) | def __init__(self, observation_space, action_space,
    method step (line 126) | def step(self, obs):
    method act (line 134) | def act(self, obs):

FILE: spinup/algos/pytorch/vpg/vpg.py
  class VPGBuffer (line 12) | class VPGBuffer:
    method __init__ (line 19) | def __init__(self, obs_dim, act_dim, size, gamma=0.99, lam=0.95):
    method store (line 30) | def store(self, obs, act, rew, val, logp):
    method finish_path (line 42) | def finish_path(self, last_val=0):
    method get (line 71) | def get(self):
  function vpg (line 88) | def vpg(env_fn, actor_critic=core.MLPActorCritic, ac_kwargs=dict(),  see...

FILE: spinup/algos/tf1/ddpg/core.py
  function placeholder (line 5) | def placeholder(dim=None):
  function placeholders (line 8) | def placeholders(*args):
  function mlp (line 11) | def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
  function get_vars (line 16) | def get_vars(scope):
  function count_vars (line 19) | def count_vars(scope):
  function mlp_actor_critic (line 26) | def mlp_actor_critic(x, a, hidden_sizes=(256,256), activation=tf.nn.relu,

FILE: spinup/algos/tf1/ddpg/ddpg.py
  class ReplayBuffer (line 10) | class ReplayBuffer:
    method __init__ (line 15) | def __init__(self, obs_dim, act_dim, size):
    method store (line 23) | def store(self, obs, act, rew, next_obs, done):
    method sample_batch (line 32) | def sample_batch(self, batch_size=32):
  function ddpg (line 42) | def ddpg(env_fn, actor_critic=core.mlp_actor_critic, ac_kwargs=dict(), s...

FILE: spinup/algos/tf1/ppo/core.py
  function combined_shape (line 8) | def combined_shape(length, shape=None):
  function placeholder (line 13) | def placeholder(dim=None):
  function placeholders (line 16) | def placeholders(*args):
  function placeholder_from_space (line 19) | def placeholder_from_space(space):
  function placeholders_from_spaces (line 26) | def placeholders_from_spaces(*args):
  function mlp (line 29) | def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
  function get_vars (line 34) | def get_vars(scope=''):
  function count_vars (line 37) | def count_vars(scope=''):
  function gaussian_likelihood (line 41) | def gaussian_likelihood(x, mu, log_std):
  function discount_cumsum (line 45) | def discount_cumsum(x, discount):
  function mlp_categorical_policy (line 67) | def mlp_categorical_policy(x, a, hidden_sizes, activation, output_activa...
  function mlp_gaussian_policy (line 77) | def mlp_gaussian_policy(x, a, hidden_sizes, activation, output_activatio...
  function mlp_actor_critic (line 91) | def mlp_actor_critic(x, a, hidden_sizes=(64,64), activation=tf.tanh,

FILE: spinup/algos/tf1/ppo/ppo.py
  class PPOBuffer (line 11) | class PPOBuffer:
    method __init__ (line 18) | def __init__(self, obs_dim, act_dim, size, gamma=0.99, lam=0.95):
    method store (line 29) | def store(self, obs, act, rew, val, logp):
    method finish_path (line 41) | def finish_path(self, last_val=0):
    method get (line 70) | def get(self):
  function ppo (line 86) | def ppo(env_fn, actor_critic=core.mlp_actor_critic, ac_kwargs=dict(), se...

FILE: spinup/algos/tf1/sac/core.py
  function placeholder (line 6) | def placeholder(dim=None):
  function placeholders (line 9) | def placeholders(*args):
  function mlp (line 12) | def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
  function get_vars (line 17) | def get_vars(scope):
  function count_vars (line 20) | def count_vars(scope):
  function gaussian_likelihood (line 24) | def gaussian_likelihood(x, mu, log_std):
  function mlp_gaussian_policy (line 36) | def mlp_gaussian_policy(x, a, hidden_sizes, activation, output_activation):
  function apply_squashing_func (line 48) | def apply_squashing_func(mu, pi, logp_pi):
  function mlp_actor_critic (line 64) | def mlp_actor_critic(x, a, hidden_sizes=(256,256), activation=tf.nn.relu,

FILE: spinup/algos/tf1/sac/sac.py
  class ReplayBuffer (line 10) | class ReplayBuffer:
    method __init__ (line 15) | def __init__(self, obs_dim, act_dim, size):
    method store (line 23) | def store(self, obs, act, rew, next_obs, done):
    method sample_batch (line 32) | def sample_batch(self, batch_size=32):
  function sac (line 42) | def sac(env_fn, actor_critic=core.mlp_actor_critic, ac_kwargs=dict(), se...

FILE: spinup/algos/tf1/td3/core.py
  function placeholder (line 5) | def placeholder(dim=None):
  function placeholders (line 8) | def placeholders(*args):
  function mlp (line 11) | def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
  function get_vars (line 16) | def get_vars(scope):
  function count_vars (line 19) | def count_vars(scope):
  function mlp_actor_critic (line 26) | def mlp_actor_critic(x, a, hidden_sizes=(256,256), activation=tf.nn.relu,

FILE: spinup/algos/tf1/td3/td3.py
  class ReplayBuffer (line 10) | class ReplayBuffer:
    method __init__ (line 15) | def __init__(self, obs_dim, act_dim, size):
    method store (line 23) | def store(self, obs, act, rew, next_obs, done):
    method sample_batch (line 32) | def sample_batch(self, batch_size=32):
  function td3 (line 42) | def td3(env_fn, actor_critic=core.mlp_actor_critic, ac_kwargs=dict(), se...

FILE: spinup/algos/tf1/trpo/core.py
  function combined_shape (line 8) | def combined_shape(length, shape=None):
  function keys_as_sorted_list (line 13) | def keys_as_sorted_list(dict):
  function values_as_sorted_list (line 16) | def values_as_sorted_list(dict):
  function placeholder (line 19) | def placeholder(dim=None):
  function placeholders (line 22) | def placeholders(*args):
  function placeholder_from_space (line 25) | def placeholder_from_space(space):
  function placeholders_from_spaces (line 32) | def placeholders_from_spaces(*args):
  function mlp (line 35) | def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
  function get_vars (line 40) | def get_vars(scope=''):
  function count_vars (line 43) | def count_vars(scope=''):
  function gaussian_likelihood (line 47) | def gaussian_likelihood(x, mu, log_std):
  function diagonal_gaussian_kl (line 51) | def diagonal_gaussian_kl(mu0, log_std0, mu1, log_std1):
  function categorical_kl (line 62) | def categorical_kl(logp0, logp1):
  function flat_concat (line 70) | def flat_concat(xs):
  function flat_grad (line 73) | def flat_grad(f, params):
  function hessian_vector_product (line 76) | def hessian_vector_product(f, params):
  function assign_params_from_flat (line 82) | def assign_params_from_flat(x, params):
  function discount_cumsum (line 88) | def discount_cumsum(x, discount):
  function mlp_categorical_policy (line 109) | def mlp_categorical_policy(x, a, hidden_sizes, activation, output_activa...
  function mlp_gaussian_policy (line 126) | def mlp_gaussian_policy(x, a, hidden_sizes, activation, output_activatio...
  function mlp_actor_critic (line 147) | def mlp_actor_critic(x, a, hidden_sizes=(64,64), activation=tf.tanh,

FILE: spinup/algos/tf1/trpo/trpo.py
  class GAEBuffer (line 13) | class GAEBuffer:
    method __init__ (line 20) | def __init__(self, obs_dim, act_dim, size, info_shapes, gamma=0.99, la...
    method store (line 33) | def store(self, obs, act, rew, val, logp, info):
    method finish_path (line 47) | def finish_path(self, last_val=0):
    method get (line 76) | def get(self):
  function trpo (line 92) | def trpo(env_fn, actor_critic=core.mlp_actor_critic, ac_kwargs=dict(), s...

FILE: spinup/algos/tf1/vpg/core.py
  function combined_shape (line 8) | def combined_shape(length, shape=None):
  function placeholder (line 13) | def placeholder(dim=None):
  function placeholders (line 16) | def placeholders(*args):
  function placeholder_from_space (line 19) | def placeholder_from_space(space):
  function placeholders_from_spaces (line 26) | def placeholders_from_spaces(*args):
  function mlp (line 29) | def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
  function get_vars (line 34) | def get_vars(scope=''):
  function count_vars (line 37) | def count_vars(scope=''):
  function gaussian_likelihood (line 41) | def gaussian_likelihood(x, mu, log_std):
  function discount_cumsum (line 45) | def discount_cumsum(x, discount):
  function mlp_categorical_policy (line 67) | def mlp_categorical_policy(x, a, hidden_sizes, activation, output_activa...
  function mlp_gaussian_policy (line 77) | def mlp_gaussian_policy(x, a, hidden_sizes, activation, output_activatio...
  function mlp_actor_critic (line 91) | def mlp_actor_critic(x, a, hidden_sizes=(64,64), activation=tf.tanh,

FILE: spinup/algos/tf1/vpg/vpg.py
  class VPGBuffer (line 11) | class VPGBuffer:
    method __init__ (line 18) | def __init__(self, obs_dim, act_dim, size, gamma=0.99, lam=0.95):
    method store (line 29) | def store(self, obs, act, rew, val, logp):
    method finish_path (line 41) | def finish_path(self, last_val=0):
    method get (line 70) | def get(self):
  function vpg (line 86) | def vpg(env_fn, actor_critic=core.mlp_actor_critic, ac_kwargs=dict(), se...

FILE: spinup/examples/pytorch/pg_math/1_simple_pg.py
  function mlp (line 9) | def mlp(sizes, activation=nn.Tanh, output_activation=nn.Identity):
  function train (line 17) | def train(env_name='CartPole-v0', hidden_sizes=[32], lr=1e-2,

FILE: spinup/examples/pytorch/pg_math/2_rtg_pg.py
  function mlp (line 9) | def mlp(sizes, activation=nn.Tanh, output_activation=nn.Identity):
  function reward_to_go (line 17) | def reward_to_go(rews):
  function train (line 24) | def train(env_name='CartPole-v0', hidden_sizes=[32], lr=1e-2,

FILE: spinup/examples/tf1/pg_math/1_simple_pg.py
  function mlp (line 6) | def mlp(x, sizes, activation=tf.tanh, output_activation=None):
  function train (line 12) | def train(env_name='CartPole-v0', hidden_sizes=[32], lr=1e-2,

FILE: spinup/examples/tf1/pg_math/2_rtg_pg.py
  function mlp (line 6) | def mlp(x, sizes, activation=tf.tanh, output_activation=None):
  function reward_to_go (line 12) | def reward_to_go(rews):
  function train (line 19) | def train(env_name='CartPole-v0', hidden_sizes=[32], lr=1e-2,

FILE: spinup/examples/tf1/train_mnist.py
  function mlp (line 7) | def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
  function train_mnist (line 14) | def train_mnist(steps_per_epoch=100, epochs=5,

FILE: spinup/exercises/common.py
  function print_result (line 1) | def print_result(correct=False):

FILE: spinup/exercises/pytorch/problem_set_1/exercise1_1.py
  function gaussian_likelihood (line 16) | def gaussian_likelihood(x, mu, log_std):

FILE: spinup/exercises/pytorch/problem_set_1/exercise1_2.py
  function mlp (line 19) | def mlp(sizes, activation, output_activation=nn.Identity):
  class DiagonalGaussianDistribution (line 43) | class DiagonalGaussianDistribution:
    method __init__ (line 45) | def __init__(self, mu, log_std):
    method sample (line 49) | def sample(self):
    method log_prob (line 63) | def log_prob(self, value):
    method entropy (line 66) | def entropy(self):
  class MLPGaussianActor (line 71) | class MLPGaussianActor(nn.Module):
    method __init__ (line 73) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
    method forward (line 93) | def forward(self, obs, act=None):

FILE: spinup/exercises/pytorch/problem_set_1/exercise1_2_auxiliary.py
  function mlp (line 17) | def mlp(sizes, activation, output_activation=nn.Identity):
  class MLPCritic (line 25) | class MLPCritic(nn.Module):
    method __init__ (line 27) | def __init__(self, obs_dim, hidden_sizes, activation):
    method forward (line 31) | def forward(self, obs):
  class ExerciseActorCritic (line 35) | class ExerciseActorCritic(nn.Module):
    method __init__ (line 37) | def __init__(self, observation_space, action_space,
    method step (line 45) | def step(self, obs):
    method act (line 53) | def act(self, obs):

FILE: spinup/exercises/pytorch/problem_set_1/exercise1_3.py
  class ReplayBuffer (line 27) | class ReplayBuffer:
    method __init__ (line 32) | def __init__(self, obs_dim, act_dim, size):
    method store (line 40) | def store(self, obs, act, rew, next_obs, done):
    method sample_batch (line 49) | def sample_batch(self, batch_size=32):
  function td3 (line 60) | def td3(env_fn, actor_critic=core.MLPActorCritic, ac_kwargs=dict(), seed=0,

FILE: spinup/exercises/pytorch/problem_set_1_solutions/exercise1_1_soln.py
  function gaussian_likelihood (line 6) | def gaussian_likelihood(x, mu, log_std):

FILE: spinup/exercises/pytorch/problem_set_1_solutions/exercise1_2_soln.py
  function mlp (line 7) | def mlp(sizes, activation, output_activation=nn.Identity):
  function gaussian_likelihood (line 14) | def gaussian_likelihood(x, mu, log_std):
  class DiagonalGaussianDistribution (line 19) | class DiagonalGaussianDistribution:
    method __init__ (line 21) | def __init__(self, mu, log_std):
    method sample (line 25) | def sample(self):
    method log_prob (line 28) | def log_prob(self, value):
    method entropy (line 31) | def entropy(self):
  class MLPGaussianActor (line 35) | class MLPGaussianActor(nn.Module):
    method __init__ (line 37) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
    method forward (line 43) | def forward(self, obs, act=None):

FILE: spinup/exercises/pytorch/problem_set_2/exercise2_2.py
  class BuggedMLPActor (line 24) | class BuggedMLPActor(nn.Module):
    method __init__ (line 26) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation, act_lim...
    method forward (line 32) | def forward(self, obs):
  class BuggedMLPQFunction (line 36) | class BuggedMLPQFunction(nn.Module):
    method __init__ (line 38) | def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
    method forward (line 42) | def forward(self, obs, act):
  class BuggedMLPActorCritic (line 45) | class BuggedMLPActorCritic(nn.Module):
    method __init__ (line 47) | def __init__(self, observation_space, action_space, hidden_sizes=(256,...
    method act (line 59) | def act(self, obs):
  function ddpg_with_actor_critic (line 75) | def ddpg_with_actor_critic(bugged, **kwargs):

FILE: spinup/exercises/tf1/problem_set_1/exercise1_1.py
  function gaussian_likelihood (line 16) | def gaussian_likelihood(x, mu, log_std):

FILE: spinup/exercises/tf1/problem_set_1/exercise1_2.py
  function mlp (line 18) | def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
  function mlp_gaussian_policy (line 43) | def mlp_gaussian_policy(x, a, hidden_sizes, activation, output_activatio...

FILE: spinup/exercises/tf1/problem_set_1/exercise1_3.py
  class ReplayBuffer (line 25) | class ReplayBuffer:
    method __init__ (line 30) | def __init__(self, obs_dim, act_dim, size):
    method store (line 38) | def store(self, obs, act, rew, next_obs, done):
    method sample_batch (line 47) | def sample_batch(self, batch_size=32):
  function td3 (line 58) | def td3(env_fn, actor_critic=core.mlp_actor_critic, ac_kwargs=dict(), se...

FILE: spinup/exercises/tf1/problem_set_1_solutions/exercise1_1_soln.py
  function gaussian_likelihood (line 6) | def gaussian_likelihood(x, mu, log_std):

FILE: spinup/exercises/tf1/problem_set_1_solutions/exercise1_2_soln.py
  function mlp (line 7) | def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
  function gaussian_likelihood (line 12) | def gaussian_likelihood(x, mu, log_std):
  function mlp_gaussian_policy (line 16) | def mlp_gaussian_policy(x, a, hidden_sizes, activation, output_activatio...

FILE: spinup/exercises/tf1/problem_set_2/exercise2_2.py
  function bugged_mlp_actor_critic (line 22) | def bugged_mlp_actor_critic(x, a, hidden_sizes=(400,300), activation=tf....
  function ddpg_with_actor_critic (line 46) | def ddpg_with_actor_critic(bugged, **kwargs):

FILE: spinup/run.py
  function add_with_backends (line 35) | def add_with_backends(algo_list):
  function friendly_err (line 43) | def friendly_err(err_msg):
  function parse_and_execute_grid_search (line 48) | def parse_and_execute_grid_search(cmd, args):

FILE: spinup/utils/logx.py
  function colorize (line 31) | def colorize(string, color, bold=False, highlight=False):
  function restore_tf_graph (line 44) | def restore_tf_graph(sess, fpath):
  class Logger (line 71) | class Logger:
    method __init__ (line 79) | def __init__(self, output_dir=None, output_fname='progress.txt', exp_n...
    method log (line 115) | def log(self, msg, color='green'):
    method log_tabular (line 120) | def log_tabular(self, key, val):
    method save_config (line 136) | def save_config(self, config):
    method save_state (line 162) | def save_state(self, state_dict, itr=None):
    method setup_tf_saver (line 194) | def setup_tf_saver(self, sess, inputs, outputs):
    method _tf_simple_save (line 216) | def _tf_simple_save(self, itr=None):
    method setup_pytorch_saver (line 234) | def setup_pytorch_saver(self, what_to_save):
    method _pytorch_simple_save (line 250) | def _pytorch_simple_save(self, itr=None):
    method dump_tabular (line 275) | def dump_tabular(self):
  class EpochLogger (line 303) | class EpochLogger(Logger):
    method __init__ (line 328) | def __init__(self, *args, **kwargs):
    method store (line 332) | def store(self, **kwargs):
    method log_tabular (line 344) | def log_tabular(self, key, val=None, with_min_and_max=False, average_o...
    method get_stats (line 377) | def get_stats(self, key):

FILE: spinup/utils/mpi_pytorch.py
  function setup_pytorch_for_mpi (line 8) | def setup_pytorch_for_mpi():
  function mpi_avg_grads (line 20) | def mpi_avg_grads(module):
  function sync_params (line 29) | def sync_params(module):

FILE: spinup/utils/mpi_tf.py
  function flat_concat (line 7) | def flat_concat(xs):
  function assign_params_from_flat (line 10) | def assign_params_from_flat(x, params):
  function sync_params (line 16) | def sync_params(params):
  function sync_all_params (line 24) | def sync_all_params():
  class MpiAdamOptimizer (line 29) | class MpiAdamOptimizer(tf.train.AdamOptimizer):
    method __init__ (line 41) | def __init__(self, **kwargs):
    method compute_gradients (line 45) | def compute_gradients(self, loss, var_list, **kwargs):
    method apply_gradients (line 71) | def apply_gradients(self, grads_and_vars, global_step=None, name=None):

FILE: spinup/utils/mpi_tools.py
  function mpi_fork (line 6) | def mpi_fork(n, bind_to_core=False):
  function msg (line 39) | def msg(m, string=''):
  function proc_id (line 42) | def proc_id():
  function allreduce (line 46) | def allreduce(*args, **kwargs):
  function num_procs (line 49) | def num_procs():
  function broadcast (line 53) | def broadcast(x, root=0):
  function mpi_op (line 56) | def mpi_op(x, op):
  function mpi_sum (line 63) | def mpi_sum(x):
  function mpi_avg (line 66) | def mpi_avg(x):
  function mpi_statistics_scalar (line 70) | def mpi_statistics_scalar(x, with_min_and_max=False):

FILE: spinup/utils/plot.py
  function plot_data (line 15) | def plot_data(data, xaxis='Epoch', value="AverageEpRet", condition="Cond...
  function get_datasets (line 61) | def get_datasets(logdir, condition=None):
  function get_all_datasets (line 103) | def get_all_datasets(all_logdirs, legend=None, select=None, exclude=None):
  function make_plots (line 154) | def make_plots(all_logdirs, legend=None, xaxis=None, values=None, count=...
  function main (line 166) | def main():

FILE: spinup/utils/run_utils.py
  function setup_logger_kwargs (line 25) | def setup_logger_kwargs(exp_name, seed=None, data_dir=None, datestamp=Fa...
  function call_experiment (line 89) | def call_experiment(exp_name, thunk, seed=0, num_cpu=1, data_dir=None,
  function all_bools (line 214) | def all_bools(vals):
  function valid_str (line 217) | def valid_str(v):
  class ExperimentGrid (line 240) | class ExperimentGrid:
    method __init__ (line 245) | def __init__(self, name=''):
    method name (line 252) | def name(self, _name):
    method print (line 256) | def print(self):
    method _default_shorthand (line 295) | def _default_shorthand(self, key):
    method add (line 306) | def add(self, key, vals, shorthand=None, in_name=False):
    method variant_name (line 339) | def variant_name(self, variant):
    method _variants (line 394) | def _variants(self, keys, vals):
    method variants (line 412) | def variants(self):
    method run (line 480) | def run(self, thunk, num_cpu=1, data_dir=None, datestamp=False):
  function test_eg (line 549) | def test_eg():

FILE: spinup/utils/serialization_utils.py
  function convert_json (line 3) | def convert_json(obj):
  function is_json_serializable (line 28) | def is_json_serializable(v):

FILE: spinup/utils/test_policy.py
  function load_policy_and_env (line 11) | def load_policy_and_env(fpath, itr='last', deterministic=False):
  function load_tf_policy (line 67) | def load_tf_policy(fpath, itr, deterministic=False):
  function load_pytorch_policy (line 92) | def load_pytorch_policy(fpath, itr, deterministic=False):
  function run_policy (line 110) | def run_policy(env, get_action, max_ep_len=None, num_episodes=100, rende...

FILE: spinup/version.py
  function get_version (line 5) | def get_version():

FILE: test/test_ppo.py
  class TestPPO (line 12) | class TestPPO(unittest.TestCase):
    method test_cartpole (line 13) | def test_cartpole(self):

Download .json

Condensed preview — 117 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (684K chars).

[
  {
    "path": ".gitignore",
    "chars": 94,
    "preview": "*.*~\n__pycache__/\n*.pkl\ndata/\n**/*.egg-info\n.python-version\n.idea/\n.vscode/\n.DS_Store\n_build/\n"
  },
  {
    "path": ".travis.yml",
    "chars": 363,
    "preview": "env:\n global:\n - LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/travis/.mujoco/mujoco200/bin\n\nmatrix:\n    include:\n        - os:"
  },
  {
    "path": "LICENSE",
    "chars": 1087,
    "preview": "The MIT License\n\nCopyright (c) 2018 OpenAI (http://openai.com)\n\nPermission is hereby granted, free of charge, to any per"
  },
  {
    "path": "docs/Makefile",
    "chars": 607,
    "preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS    =\nSPHI"
  },
  {
    "path": "docs/_static/css/modify.css",
    "chars": 4501,
    "preview": ":root {\n    /* Colors */\n    --color--white: #fff;\n    --color--lightwash: #f7fbfb;\n    --color--mediumwash: #eff7f8;\n  "
  },
  {
    "path": "docs/algorithms/ddpg.rst",
    "chars": 15388,
    "preview": "==================================\nDeep Deterministic Policy Gradient\n==================================\n\n.. contents:: "
  },
  {
    "path": "docs/algorithms/ppo.rst",
    "chars": 11439,
    "preview": "============================\nProximal Policy Optimization\n============================\n\n.. contents:: Table of Contents\n"
  },
  {
    "path": "docs/algorithms/sac.rst",
    "chars": 18528,
    "preview": "=================\nSoft Actor-Critic\n=================\n\n.. contents:: Table of Contents\n\nBackground\n==========\n\n(Previous"
  },
  {
    "path": "docs/algorithms/td3.rst",
    "chars": 10486,
    "preview": "=================\nTwin Delayed DDPG\n=================\n\n.. contents:: Table of Contents\n\nBackground\n==========\n\n(Previous"
  },
  {
    "path": "docs/algorithms/trpo.rst",
    "chars": 10733,
    "preview": "================================\nTrust Region Policy Optimization\n================================\n\n.. contents:: Table "
  },
  {
    "path": "docs/algorithms/vpg.rst",
    "chars": 7635,
    "preview": "=======================\nVanilla Policy Gradient\n=======================\n\n.. contents:: Table of Contents\n\n\nBackground\n=="
  },
  {
    "path": "docs/conf.py",
    "chars": 6951,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n#\n# Spinning Up documentation build configuration file, created by\n# sphi"
  },
  {
    "path": "docs/docs_requirements.txt",
    "chars": 202,
    "preview": "cloudpickle~=1.2.1\ngym~=0.15.3\nipython\njoblib\nmatplotlib\nnumpy\npandas\npytest\npsutil\nscipy\nseaborn==0.8.1\nsphinx==1.5.6\ns"
  },
  {
    "path": "docs/etc/acknowledgements.rst",
    "chars": 1017,
    "preview": "================\nAcknowledgements\n================\n\nWe gratefully acknowledge the contributions of the many people who h"
  },
  {
    "path": "docs/etc/author.rst",
    "chars": 405,
    "preview": "================\nAbout the Author\n================\n\nSpinning Up in Deep RL was primarily developed by Josh Achiam, a res"
  },
  {
    "path": "docs/images/rl_algorithms.xml",
    "chars": 2624,
    "preview": "<mxfile userAgent=\"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0\" version=\"9.1.2\" editor="
  },
  {
    "path": "docs/images/rl_algorithms_9_15.xml",
    "chars": 2616,
    "preview": "<mxfile userAgent=\"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0\" version=\"9.1.3\" editor="
  },
  {
    "path": "docs/index.rst",
    "chars": 1271,
    "preview": ".. Spinning Up documentation master file, created by\n   sphinx-quickstart on Wed Aug 15 04:21:07 2018.\n   You can adapt "
  },
  {
    "path": "docs/make.bat",
    "chars": 814,
    "preview": "@ECHO OFF\r\n\r\npushd %~dp0\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sp"
  },
  {
    "path": "docs/spinningup/bench.rst",
    "chars": 5240,
    "preview": "==========================================\nBenchmarks for Spinning Up Implementations\n=================================="
  },
  {
    "path": "docs/spinningup/bench_ddpg.rst",
    "chars": 533,
    "preview": "DDPG Head-to-Head\n=================\n\nHalfCheetah\n-----------\n\n.. figure:: ../images/plots/ddpg/ddpg_halfcheetah_performa"
  },
  {
    "path": "docs/spinningup/bench_ppo.rst",
    "chars": 571,
    "preview": "Proximal Policy Optimization Head-to-Head\n=========================================\n\nHalfCheetah\n-----------\n\n.. figure:"
  },
  {
    "path": "docs/spinningup/bench_sac.rst",
    "chars": 522,
    "preview": "SAC Head-to-Head\n=================\n\nHalfCheetah\n-----------\n\n.. figure:: ../images/plots/sac/sac_halfcheetah_performance"
  },
  {
    "path": "docs/spinningup/bench_td3.rst",
    "chars": 522,
    "preview": "TD3 Head-to-Head\n=================\n\nHalfCheetah\n-----------\n\n.. figure:: ../images/plots/td3/td3_halfcheetah_performance"
  },
  {
    "path": "docs/spinningup/bench_vpg.rst",
    "chars": 563,
    "preview": "Vanilla Policy Gradients Head-to-Head\n=====================================\n\nHalfCheetah\n-----------\n\n.. figure:: ../ima"
  },
  {
    "path": "docs/spinningup/exercise2_1_soln.rst",
    "chars": 438,
    "preview": "========================\nSolution to Exercise 2.1\n========================\n\n.. figure:: ../images/ex2-1_trpo_hopper.png\n"
  },
  {
    "path": "docs/spinningup/exercise2_2_soln.rst",
    "chars": 7637,
    "preview": "========================\nSolution to Exercise 2.2\n========================\n\n.. figure:: ../images/ex2-2_ddpg_bug.svg\n   "
  },
  {
    "path": "docs/spinningup/exercises.rst",
    "chars": 8150,
    "preview": "=========\nExercises\n=========\n\n\n.. contents:: Table of Contents\n    :depth: 2\n\nProblem Set 1: Basics of Implementation\n-"
  },
  {
    "path": "docs/spinningup/extra_pg_proof1.rst",
    "chars": 7756,
    "preview": "==============\nExtra Material\n==============\n\nProof for Don't Let the Past Distract You\n================================"
  },
  {
    "path": "docs/spinningup/extra_pg_proof2.rst",
    "chars": 2815,
    "preview": "==============\nExtra Material\n==============\n\nProof for Using Q-Function in Policy Gradient Formula\n===================="
  },
  {
    "path": "docs/spinningup/extra_tf_pg_implementation.rst",
    "chars": 8959,
    "preview": "==================================================================\nExtra Material: Tensorflow Policy Gradient Implementa"
  },
  {
    "path": "docs/spinningup/keypapers.rst",
    "chars": 19518,
    "preview": "=====================\nKey Papers in Deep RL\n=====================\n\nWhat follows is a list of papers in deep RL that are "
  },
  {
    "path": "docs/spinningup/rl_intro.rst",
    "chars": 24293,
    "preview": "==========================\nPart 1: Key Concepts in RL\n==========================\n\n\n.. contents:: Table of Contents\n    :"
  },
  {
    "path": "docs/spinningup/rl_intro2.rst",
    "chars": 12509,
    "preview": "==============================\nPart 2: Kinds of RL Algorithms\n==============================\n\n.. contents:: Table of Con"
  },
  {
    "path": "docs/spinningup/rl_intro3.rst",
    "chars": 26187,
    "preview": "====================================\nPart 3: Intro to Policy Optimization\n====================================\n\n.. conte"
  },
  {
    "path": "docs/spinningup/rl_intro4.rst",
    "chars": 185,
    "preview": "=========================\nLimitations and Frontiers\n=========================\n\n\nReward Design\n=============\n\n\nSample Com"
  },
  {
    "path": "docs/spinningup/spinningup.rst",
    "chars": 26941,
    "preview": "===================================\nSpinning Up as a Deep RL Researcher\n===================================\nBy Joshua Ac"
  },
  {
    "path": "docs/user/algorithms.rst",
    "chars": 7770,
    "preview": "==========\nAlgorithms\n==========\n\n.. contents:: Table of Contents\n\nWhat's Included\n===============\n\nThe following algori"
  },
  {
    "path": "docs/user/installation.rst",
    "chars": 6092,
    "preview": "============\nInstallation\n============\n\n\n.. contents:: Table of Contents\n\nSpinning Up requires Python3, OpenAI Gym, and "
  },
  {
    "path": "docs/user/introduction.rst",
    "chars": 9990,
    "preview": "============\nIntroduction\n============\n\n.. contents:: Table of Contents\n\nWhat This Is\n============\n\nWelcome to Spinning "
  },
  {
    "path": "docs/user/plotting.rst",
    "chars": 3617,
    "preview": "================\nPlotting Results\n================\n\nSpinning Up ships with a simple plotting utility for interpreting re"
  },
  {
    "path": "docs/user/running.rst",
    "chars": 14345,
    "preview": "===================\nRunning Experiments\n===================\n\n\n.. contents:: Table of Contents\n\nOne of the best ways to g"
  },
  {
    "path": "docs/user/saving_and_loading.rst",
    "chars": 11477,
    "preview": "==================\nExperiment Outputs\n==================\n\n.. contents:: Table of Contents\n\nIn this section we'll cover\n\n"
  },
  {
    "path": "docs/utils/logger.rst",
    "chars": 8627,
    "preview": "======\nLogger\n======\n\n.. contents:: Table of Contents\n\nUsing a Logger\n==============\n\nSpinning Up ships with basic loggi"
  },
  {
    "path": "docs/utils/mpi.rst",
    "chars": 1604,
    "preview": "=========\nMPI Tools\n=========\n\n.. contents:: Table of Contents\n\nCore MPI Utilities\n==================\n\n.. automodule:: s"
  },
  {
    "path": "docs/utils/plotter.rst",
    "chars": 141,
    "preview": "=======\nPlotter\n=======\n\nSee the page on `plotting results`_ for documentation of the plotter.\n\n.. _`plotting results`: "
  },
  {
    "path": "docs/utils/run_utils.rst",
    "chars": 594,
    "preview": "=========\nRun Utils\n=========\n\n.. contents:: Table of Contents\n\nExperimentGrid\n==============\n\nSpinning Up ships with a "
  },
  {
    "path": "readme.md",
    "chars": 1583,
    "preview": "**Status:** Maintenance (expect bug fixes and minor updates)\n\nWelcome to Spinning Up in Deep RL! \n======================"
  },
  {
    "path": "readthedocs.yml",
    "chars": 50,
    "preview": "build:\n    image: latest\n\npython:\n    version: 3.6"
  },
  {
    "path": "setup.py",
    "chars": 934,
    "preview": "from os.path import join, dirname, realpath\nfrom setuptools import setup\nimport sys\n\nassert sys.version_info.major == 3 "
  },
  {
    "path": "spinup/__init__.py",
    "chars": 996,
    "preview": "# Disable TF deprecation warnings.\n# Syntax from tf1 is not expected to be compatible with tf2.\nimport tensorflow as tf\n"
  },
  {
    "path": "spinup/algos/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "spinup/algos/pytorch/ddpg/core.py",
    "chars": 1993,
    "preview": "import numpy as np\nimport scipy.signal\n\nimport torch\nimport torch.nn as nn\n\n\ndef combined_shape(length, shape=None):\n   "
  },
  {
    "path": "spinup/algos/pytorch/ddpg/ddpg.py",
    "chars": 13016,
    "preview": "from copy import deepcopy\nimport numpy as np\nimport torch\nfrom torch.optim import Adam\nimport gym\nimport time\nimport spi"
  },
  {
    "path": "spinup/algos/pytorch/ppo/core.py",
    "chars": 4032,
    "preview": "import numpy as np\nimport scipy.signal\nfrom gym.spaces import Box, Discrete\n\nimport torch\nimport torch.nn as nn\nfrom tor"
  },
  {
    "path": "spinup/algos/pytorch/ppo/ppo.py",
    "chars": 16392,
    "preview": "import numpy as np\nimport torch\nfrom torch.optim import Adam\nimport gym\nimport time\nimport spinup.algos.pytorch.ppo.core"
  },
  {
    "path": "spinup/algos/pytorch/sac/core.py",
    "chars": 3568,
    "preview": "import numpy as np\nimport scipy.signal\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch.di"
  },
  {
    "path": "spinup/algos/pytorch/sac/sac.py",
    "chars": 14887,
    "preview": "from copy import deepcopy\nimport itertools\nimport numpy as np\nimport torch\nfrom torch.optim import Adam\nimport gym\nimpor"
  },
  {
    "path": "spinup/algos/pytorch/td3/core.py",
    "chars": 2069,
    "preview": "import numpy as np\nimport scipy.signal\n\nimport torch\nimport torch.nn as nn\n\n\ndef combined_shape(length, shape=None):\n   "
  },
  {
    "path": "spinup/algos/pytorch/td3/td3.py",
    "chars": 14823,
    "preview": "from copy import deepcopy\nimport itertools\nimport numpy as np\nimport torch\nfrom torch.optim import Adam\nimport gym\nimpor"
  },
  {
    "path": "spinup/algos/pytorch/trpo/trpo.py",
    "chars": 211,
    "preview": "def trpo(*args, **kwargs):\n    print('\\n\\nUnfortunately, TRPO has not yet been implemented in PyTorch '\\\n        + 'for "
  },
  {
    "path": "spinup/algos/pytorch/vpg/core.py",
    "chars": 4032,
    "preview": "import numpy as np\nimport scipy.signal\nfrom gym.spaces import Box, Discrete\n\nimport torch\nimport torch.nn as nn\nfrom tor"
  },
  {
    "path": "spinup/algos/pytorch/vpg/vpg.py",
    "chars": 14799,
    "preview": "import numpy as np\nimport torch\nfrom torch.optim import Adam\nimport gym\nimport time\nimport spinup.algos.pytorch.vpg.core"
  },
  {
    "path": "spinup/algos/tf1/ddpg/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "spinup/algos/tf1/ddpg/core.py",
    "chars": 1358,
    "preview": "import numpy as np\nimport tensorflow as tf\n\n\ndef placeholder(dim=None):\n    return tf.placeholder(dtype=tf.float32, shap"
  },
  {
    "path": "spinup/algos/tf1/ddpg/ddpg.py",
    "chars": 12489,
    "preview": "import numpy as np\nimport tensorflow as tf\nimport gym\nimport time\nfrom spinup.algos.tf1.ddpg import core\nfrom spinup.alg"
  },
  {
    "path": "spinup/algos/tf1/ppo/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "spinup/algos/tf1/ppo/core.py",
    "chars": 3497,
    "preview": "import numpy as np\nimport tensorflow as tf\nimport scipy.signal\nfrom gym.spaces import Box, Discrete\n\nEPS = 1e-8\n\ndef com"
  },
  {
    "path": "spinup/algos/tf1/ppo/ppo.py",
    "chars": 14296,
    "preview": "import numpy as np\nimport tensorflow as tf\nimport gym\nimport time\nimport spinup.algos.tf1.ppo.core as core\nfrom spinup.u"
  },
  {
    "path": "spinup/algos/tf1/sac/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "spinup/algos/tf1/sac/core.py",
    "chars": 2825,
    "preview": "import numpy as np\nimport tensorflow as tf\n\nEPS = 1e-8\n\ndef placeholder(dim=None):\n    return tf.placeholder(dtype=tf.fl"
  },
  {
    "path": "spinup/algos/tf1/sac/sac.py",
    "chars": 14124,
    "preview": "import numpy as np\nimport tensorflow as tf\nimport gym\nimport time\nfrom spinup.algos.tf1.sac import core\nfrom spinup.algo"
  },
  {
    "path": "spinup/algos/tf1/td3/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "spinup/algos/tf1/td3/core.py",
    "chars": 1508,
    "preview": "import numpy as np\nimport tensorflow as tf\n\n\ndef placeholder(dim=None):\n    return tf.placeholder(dtype=tf.float32, shap"
  },
  {
    "path": "spinup/algos/tf1/td3/td3.py",
    "chars": 13785,
    "preview": "import numpy as np\nimport tensorflow as tf\nimport gym\nimport time\nfrom spinup.algos.tf1.td3 import core\nfrom spinup.algo"
  },
  {
    "path": "spinup/algos/tf1/trpo/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "spinup/algos/tf1/trpo/core.py",
    "chars": 5773,
    "preview": "import numpy as np\nimport tensorflow as tf\nimport scipy.signal\nfrom gym.spaces import Box, Discrete\n\nEPS = 1e-8\n\ndef com"
  },
  {
    "path": "spinup/algos/tf1/trpo/trpo.py",
    "chars": 17943,
    "preview": "import numpy as np\nimport tensorflow as tf\nimport gym\nimport time\nimport spinup.algos.tf1.trpo.core as core\nfrom spinup."
  },
  {
    "path": "spinup/algos/tf1/vpg/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "spinup/algos/tf1/vpg/core.py",
    "chars": 3497,
    "preview": "import numpy as np\nimport tensorflow as tf\nimport scipy.signal\nfrom gym.spaces import Box, Discrete\n\nEPS = 1e-8\n\ndef com"
  },
  {
    "path": "spinup/algos/tf1/vpg/vpg.py",
    "chars": 12685,
    "preview": "import numpy as np\nimport tensorflow as tf\nimport gym\nimport time\nimport spinup.algos.tf1.vpg.core as core\nfrom spinup.u"
  },
  {
    "path": "spinup/examples/pytorch/bench_ppo_cartpole.py",
    "chars": 712,
    "preview": "from spinup.utils.run_utils import ExperimentGrid\nfrom spinup import ppo_pytorch\nimport torch\n\nif __name__ == '__main__'"
  },
  {
    "path": "spinup/examples/pytorch/pg_math/1_simple_pg.py",
    "chars": 4913,
    "preview": "import torch\nimport torch.nn as nn\nfrom torch.distributions.categorical import Categorical\nfrom torch.optim import Adam\n"
  },
  {
    "path": "spinup/examples/pytorch/pg_math/2_rtg_pg.py",
    "chars": 5131,
    "preview": "import torch\nimport torch.nn as nn\nfrom torch.distributions.categorical import Categorical\nfrom torch.optim import Adam\n"
  },
  {
    "path": "spinup/examples/tf1/bench_ppo_cartpole.py",
    "chars": 706,
    "preview": "from spinup.utils.run_utils import ExperimentGrid\nfrom spinup import ppo_tf1\nimport tensorflow as tf\n\nif __name__ == '__"
  },
  {
    "path": "spinup/examples/tf1/pg_math/1_simple_pg.py",
    "chars": 4903,
    "preview": "import tensorflow as tf\nimport numpy as np\nimport gym\nfrom gym.spaces import Discrete, Box\n\ndef mlp(x, sizes, activation"
  },
  {
    "path": "spinup/examples/tf1/pg_math/2_rtg_pg.py",
    "chars": 5121,
    "preview": "import tensorflow as tf\nimport numpy as np\nimport gym\nfrom gym.spaces import Discrete, Box\n\ndef mlp(x, sizes, activation"
  },
  {
    "path": "spinup/examples/tf1/train_mnist.py",
    "chars": 2562,
    "preview": "import numpy as np\nimport tensorflow as tf\nimport time\nfrom spinup.utils.logx import EpochLogger\n\n\ndef mlp(x, hidden_siz"
  },
  {
    "path": "spinup/exercises/common.py",
    "chars": 244,
    "preview": "def print_result(correct=False):\n    print('\\n'*5 + '='*50 + '\\n'*3)\n    if correct:\n        print(\"Congratulations! You"
  },
  {
    "path": "spinup/exercises/pytorch/problem_set_1/exercise1_1.py",
    "chars": 1507,
    "preview": "import torch\nimport numpy as np\n\n\"\"\"\n\nExercise 1.1: Diagonal Gaussian Likelihood\n\nWrite a function that takes in PyTorch"
  },
  {
    "path": "spinup/exercises/pytorch/problem_set_1/exercise1_2.py",
    "chars": 4341,
    "preview": "import torch\nimport torch.nn as nn\nimport numpy as np\nfrom spinup.exercises.pytorch.problem_set_1 import exercise1_1\nfro"
  },
  {
    "path": "spinup/exercises/pytorch/problem_set_1/exercise1_2_auxiliary.py",
    "chars": 1684,
    "preview": "import torch\nimport torch.nn as nn\nimport numpy as np\n\n\"\"\"\n\nAuxiliary code for Exercise 1.2. No part of the exercise req"
  },
  {
    "path": "spinup/exercises/pytorch/problem_set_1/exercise1_3.py",
    "chars": 16234,
    "preview": "from copy import deepcopy\nimport itertools\nimport numpy as np\nimport torch\nfrom torch.optim import Adam\nimport gym\nimpor"
  },
  {
    "path": "spinup/exercises/pytorch/problem_set_1_solutions/exercise1_1_soln.py",
    "chars": 205,
    "preview": "import torch\nimport numpy as np\n\nEPS=1e-8\n\ndef gaussian_likelihood(x, mu, log_std):\n    pre_sum = -0.5 * (((x-mu)/(torch"
  },
  {
    "path": "spinup/exercises/pytorch/problem_set_1_solutions/exercise1_2_soln.py",
    "chars": 1506,
    "preview": "import torch\nimport torch.nn as nn\nimport numpy as np\n\nEPS=1e-8\n\ndef mlp(sizes, activation, output_activation=nn.Identit"
  },
  {
    "path": "spinup/exercises/pytorch/problem_set_2/exercise2_2.py",
    "chars": 3299,
    "preview": "from spinup.algos.pytorch.ddpg.core import mlp, MLPActorCritic\nfrom spinup.utils.run_utils import ExperimentGrid\nfrom sp"
  },
  {
    "path": "spinup/exercises/tf1/problem_set_1/exercise1_1.py",
    "chars": 1815,
    "preview": "import tensorflow as tf\nimport numpy as np\n\n\"\"\"\n\nExercise 1.1: Diagonal Gaussian Likelihood\n\nWrite a function which take"
  },
  {
    "path": "spinup/exercises/tf1/problem_set_1/exercise1_2.py",
    "chars": 3459,
    "preview": "import tensorflow as tf\nimport numpy as np\nfrom spinup.exercises.tf1.problem_set_1 import exercise1_1\n\n\"\"\"\n\nExercise 1.2"
  },
  {
    "path": "spinup/exercises/tf1/problem_set_1/exercise1_3.py",
    "chars": 15412,
    "preview": "import numpy as np\nimport tensorflow as tf\nimport gym\nimport time\nfrom spinup.algos.tf1.td3 import core\nfrom spinup.algo"
  },
  {
    "path": "spinup/exercises/tf1/problem_set_1_solutions/exercise1_1_soln.py",
    "chars": 223,
    "preview": "import tensorflow as tf\nimport numpy as np\n\nEPS=1e-8\n\ndef gaussian_likelihood(x, mu, log_std):\n    pre_sum = -0.5 * (((x"
  },
  {
    "path": "spinup/exercises/tf1/problem_set_1_solutions/exercise1_2_soln.py",
    "chars": 989,
    "preview": "import tensorflow as tf\nimport numpy as np\n\n\nEPS = 1e-8\n\ndef mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activ"
  },
  {
    "path": "spinup/exercises/tf1/problem_set_2/exercise2_2.py",
    "chars": 2454,
    "preview": "from spinup.algos.tf1.ddpg.core import mlp, mlp_actor_critic\nfrom spinup.utils.run_utils import ExperimentGrid\nfrom spin"
  },
  {
    "path": "spinup/run.py",
    "chars": 9313,
    "preview": "import spinup\nfrom spinup.user_config import DEFAULT_BACKEND\nfrom spinup.utils.run_utils import ExperimentGrid\nfrom spin"
  },
  {
    "path": "spinup/user_config.py",
    "chars": 731,
    "preview": "import os\nimport os.path as osp\n\n# Default neural network backend for each algo\n# (Must be either 'tf1' or 'pytorch')\nDE"
  },
  {
    "path": "spinup/utils/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "spinup/utils/logx.py",
    "chars": 14919,
    "preview": "\"\"\"\n\nSome simple logging functionality, inspired by rllab's logging.\n\nLogs to a tab-separated-values file (path/to/outpu"
  },
  {
    "path": "spinup/utils/mpi_pytorch.py",
    "chars": 1271,
    "preview": "import multiprocessing\nimport numpy as np\nimport os\nimport torch\nfrom mpi4py import MPI\nfrom spinup.utils.mpi_tools impo"
  },
  {
    "path": "spinup/utils/mpi_tf.py",
    "chars": 3088,
    "preview": "import numpy as np\nimport tensorflow as tf\nfrom mpi4py import MPI\nfrom spinup.utils.mpi_tools import broadcast\n\n\ndef fla"
  },
  {
    "path": "spinup/utils/mpi_tools.py",
    "chars": 2686,
    "preview": "from mpi4py import MPI\nimport os, subprocess, sys\nimport numpy as np\n\n\ndef mpi_fork(n, bind_to_core=False):\n    \"\"\"\n    "
  },
  {
    "path": "spinup/utils/plot.py",
    "chars": 9436,
    "preview": "import seaborn as sns\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport json\nimport os\nimport os.path as osp\nim"
  },
  {
    "path": "spinup/utils/run_entrypoint.py",
    "chars": 290,
    "preview": "import zlib\nimport pickle\nimport base64\n\nif __name__ == '__main__':\n    import argparse\n    parser = argparse.ArgumentPa"
  },
  {
    "path": "spinup/utils/run_utils.py",
    "chars": 19706,
    "preview": "from spinup.user_config import DEFAULT_DATA_DIR, FORCE_DATESTAMP, \\\n                               DEFAULT_SHORTHAND, WA"
  },
  {
    "path": "spinup/utils/serialization_utils.py",
    "chars": 946,
    "preview": "import json\n\ndef convert_json(obj):\n    \"\"\" Convert obj to a version which can be serialized with JSON. \"\"\"\n    if is_js"
  },
  {
    "path": "spinup/utils/test_policy.py",
    "chars": 5258,
    "preview": "import time\nimport joblib\nimport os\nimport os.path as osp\nimport tensorflow as tf\nimport torch\nfrom spinup import EpochL"
  },
  {
    "path": "spinup/version.py",
    "chars": 223,
    "preview": "version_info = (0, 2, 0)\n# format:\n# ('spinup_major', 'spinup_minor', 'spinup_patch')\n\ndef get_version():\n    \"Returns t"
  },
  {
    "path": "test/test_ppo.py",
    "chars": 589,
    "preview": "#!/usr/bin/env python\n\nimport unittest\nfrom functools import partial\n\nimport gym\nimport tensorflow as tf\n\nfrom spinup im"
  },
  {
    "path": "travis_setup.sh",
    "chars": 1155,
    "preview": "#!/usr/bin/env bash\n\nset -e\n\nmkdir -p $HOME/.mujoco\n\n# Avoid using pyenv in travis, since it adds ~7 minutes to turnarou"
  }
]

About this extraction

This page contains the full source code of the openai/spinningup GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 117 files (643.1 KB), approximately 165.9k tokens, and a symbol index with 356 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo