Full Code of openai/baselines for AI

master ea25b9e8b234 cached
171 files
1.7 MB
822.4k tokens
1041 symbols
1 requests
Download .txt
Showing preview only (1,857K chars total). Download the full file or copy to clipboard to get everything.
Repository: openai/baselines
Branch: master
Commit: ea25b9e8b234
Files: 171
Total size: 1.7 MB

Directory structure:
gitextract_zotywiye/

├── .benchmark_pattern
├── .gitignore
├── .travis.yml
├── Dockerfile
├── LICENSE
├── README.md
├── baselines/
│   ├── __init__.py
│   ├── a2c/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── a2c.py
│   │   ├── runner.py
│   │   └── utils.py
│   ├── acer/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── acer.py
│   │   ├── buffer.py
│   │   ├── defaults.py
│   │   ├── policies.py
│   │   └── runner.py
│   ├── acktr/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── acktr.py
│   │   ├── defaults.py
│   │   ├── kfac.py
│   │   ├── kfac_utils.py
│   │   └── utils.py
│   ├── bench/
│   │   ├── __init__.py
│   │   ├── benchmarks.py
│   │   ├── monitor.py
│   │   └── test_monitor.py
│   ├── common/
│   │   ├── __init__.py
│   │   ├── atari_wrappers.py
│   │   ├── cg.py
│   │   ├── cmd_util.py
│   │   ├── console_util.py
│   │   ├── dataset.py
│   │   ├── distributions.py
│   │   ├── input.py
│   │   ├── math_util.py
│   │   ├── misc_util.py
│   │   ├── models.py
│   │   ├── mpi_adam.py
│   │   ├── mpi_adam_optimizer.py
│   │   ├── mpi_fork.py
│   │   ├── mpi_moments.py
│   │   ├── mpi_running_mean_std.py
│   │   ├── mpi_util.py
│   │   ├── plot_util.py
│   │   ├── policies.py
│   │   ├── retro_wrappers.py
│   │   ├── runners.py
│   │   ├── running_mean_std.py
│   │   ├── schedules.py
│   │   ├── segment_tree.py
│   │   ├── test_mpi_util.py
│   │   ├── tests/
│   │   │   ├── __init__.py
│   │   │   ├── envs/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── fixed_sequence_env.py
│   │   │   │   ├── identity_env.py
│   │   │   │   ├── identity_env_test.py
│   │   │   │   └── mnist_env.py
│   │   │   ├── test_cartpole.py
│   │   │   ├── test_doc_examples.py
│   │   │   ├── test_env_after_learn.py
│   │   │   ├── test_fetchreach.py
│   │   │   ├── test_fixed_sequence.py
│   │   │   ├── test_identity.py
│   │   │   ├── test_mnist.py
│   │   │   ├── test_plot_util.py
│   │   │   ├── test_schedules.py
│   │   │   ├── test_segment_tree.py
│   │   │   ├── test_serialization.py
│   │   │   ├── test_tf_util.py
│   │   │   ├── test_with_mpi.py
│   │   │   └── util.py
│   │   ├── tf_util.py
│   │   ├── tile_images.py
│   │   ├── vec_env/
│   │   │   ├── __init__.py
│   │   │   ├── dummy_vec_env.py
│   │   │   ├── shmem_vec_env.py
│   │   │   ├── subproc_vec_env.py
│   │   │   ├── test_vec_env.py
│   │   │   ├── test_video_recorder.py
│   │   │   ├── util.py
│   │   │   ├── vec_env.py
│   │   │   ├── vec_frame_stack.py
│   │   │   ├── vec_monitor.py
│   │   │   ├── vec_normalize.py
│   │   │   ├── vec_remove_dict_obs.py
│   │   │   └── vec_video_recorder.py
│   │   └── wrappers.py
│   ├── ddpg/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── ddpg.py
│   │   ├── ddpg_learner.py
│   │   ├── memory.py
│   │   ├── models.py
│   │   ├── noise.py
│   │   └── test_smoke.py
│   ├── deepq/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── build_graph.py
│   │   ├── deepq.py
│   │   ├── defaults.py
│   │   ├── experiments/
│   │   │   ├── __init__.py
│   │   │   ├── custom_cartpole.py
│   │   │   ├── enjoy_cartpole.py
│   │   │   ├── enjoy_mountaincar.py
│   │   │   ├── enjoy_pong.py
│   │   │   ├── train_cartpole.py
│   │   │   ├── train_mountaincar.py
│   │   │   └── train_pong.py
│   │   ├── models.py
│   │   ├── replay_buffer.py
│   │   └── utils.py
│   ├── gail/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── adversary.py
│   │   ├── behavior_clone.py
│   │   ├── dataset/
│   │   │   ├── __init__.py
│   │   │   └── mujoco_dset.py
│   │   ├── gail-eval.py
│   │   ├── mlp_policy.py
│   │   ├── result/
│   │   │   └── gail-result.md
│   │   ├── run_mujoco.py
│   │   ├── statistics.py
│   │   └── trpo_mpi.py
│   ├── her/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── actor_critic.py
│   │   ├── ddpg.py
│   │   ├── experiment/
│   │   │   ├── __init__.py
│   │   │   ├── config.py
│   │   │   ├── data_generation/
│   │   │   │   └── fetch_data_generation.py
│   │   │   ├── play.py
│   │   │   └── plot.py
│   │   ├── her.py
│   │   ├── her_sampler.py
│   │   ├── normalizer.py
│   │   ├── replay_buffer.py
│   │   ├── rollout.py
│   │   └── util.py
│   ├── logger.py
│   ├── ppo1/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── cnn_policy.py
│   │   ├── mlp_policy.py
│   │   ├── pposgd_simple.py
│   │   ├── run_atari.py
│   │   ├── run_humanoid.py
│   │   ├── run_mujoco.py
│   │   └── run_robotics.py
│   ├── ppo2/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── defaults.py
│   │   ├── microbatched_model.py
│   │   ├── model.py
│   │   ├── ppo2.py
│   │   ├── runner.py
│   │   └── test_microbatches.py
│   ├── results_plotter.py
│   ├── run.py
│   └── trpo_mpi/
│       ├── README.md
│       ├── __init__.py
│       ├── defaults.py
│       └── trpo_mpi.py
├── benchmarks_atari10M.htm
├── benchmarks_mujoco1M.htm
├── docs/
│   └── viz/
│       └── viz.ipynb
├── setup.cfg
└── setup.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .benchmark_pattern
================================================



================================================
FILE: .gitignore
================================================
*.swp
*.pyc
*.pkl
*.py~
.pytest_cache
.DS_Store
.idea

# Setuptools distribution and build folders.
/dist/
/build
keys/

# Virtualenv
/env


*.sublime-project
*.sublime-workspace

.idea

logs/

.ipynb_checkpoints
ghostdriver.log

htmlcov

junk
src

*.egg-info
.cache

MUJOCO_LOG.TXT


================================================
FILE: .travis.yml
================================================
language: python
python:
    - "3.6"

services:
    - docker

install:
    - pip install flake8
    - docker build . -t baselines-test

script:
    - flake8 . --show-source --statistics
    - docker run -e RUNSLOW=1 baselines-test pytest -v .


================================================
FILE: Dockerfile
================================================
FROM python:3.6

RUN apt-get -y update && apt-get -y install ffmpeg
# RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake python-opencv

ENV CODE_DIR /root/code

COPY . $CODE_DIR/baselines
WORKDIR $CODE_DIR/baselines

# Clean up pycache and pyc files
RUN rm -rf __pycache__ && \
    find . -name "*.pyc" -delete && \
    pip install 'tensorflow < 2' && \
    pip install -e .[test]


CMD /bin/bash


================================================
FILE: LICENSE
================================================
The MIT License

Copyright (c) 2017 OpenAI (http://openai.com)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.


================================================
FILE: README.md
================================================
**Status:** Maintenance (expect bug fixes and minor updates)

<img src="data/logo.jpg" width=25% align="right" /> [![Build status](https://travis-ci.org/openai/baselines.svg?branch=master)](https://travis-ci.org/openai/baselines)

# Baselines

OpenAI Baselines is a set of high-quality implementations of reinforcement learning algorithms.

These algorithms will make it easier for the research community to replicate, refine, and identify new ideas, and will create good baselines to build research on top of. Our DQN implementation and its variants are roughly on par with the scores in published papers. We expect they will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones. 

## Prerequisites 
Baselines requires python3 (>=3.5) with the development headers. You'll also need system packages CMake, OpenMPI and zlib. Those can be installed as follows
### Ubuntu 
    
```bash
sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev
```
    
### Mac OS X
Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the following:
```bash
brew install cmake openmpi
```
    
## Virtual environment
From the general python package sanity perspective, it is a good idea to use virtual environments (virtualenvs) to make sure packages from different projects do not interfere with each other. You can install virtualenv (which is itself a pip package) via
```bash
pip install virtualenv
```
Virtualenvs are essentially folders that have copies of python executable and all python packages.
To create a virtualenv called venv with python3, one runs 
```bash
virtualenv /path/to/venv --python=python3
```
To activate a virtualenv: 
```
. /path/to/venv/bin/activate
```
More thorough tutorial on virtualenvs and options can be found [here](https://virtualenv.pypa.io/en/stable/) 


## Tensorflow versions
The master branch supports Tensorflow from version 1.4 to 1.14. For Tensorflow 2.0 support, please use tf2 branch.

## Installation
- Clone the repo and cd into it:
    ```bash
    git clone https://github.com/openai/baselines.git
    cd baselines
    ```
- If you don't have TensorFlow installed already, install your favourite flavor of TensorFlow. In most cases, you may use
    ```bash 
    pip install tensorflow-gpu==1.14 # if you have a CUDA-compatible gpu and proper drivers
    ```
    or 
    ```bash
    pip install tensorflow==1.14
    ```
    to install Tensorflow 1.14, which is the latest version of Tensorflow supported by the master branch. Refer to [TensorFlow installation guide](https://www.tensorflow.org/install/)
    for more details. 

- Install baselines package
    ```bash
    pip install -e .
    ```

### MuJoCo
Some of the baselines examples use [MuJoCo](http://www.mujoco.org) (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from [www.mujoco.org](http://www.mujoco.org)). Instructions on setting up MuJoCo can be found [here](https://github.com/openai/mujoco-py)

## Testing the installation
All unit tests in baselines can be run using pytest runner:
```
pip install pytest
pytest
```

## Training models
Most of the algorithms in baselines repo are used as follows:
```bash
python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]
```
### Example 1. PPO with MuJoCo Humanoid
For instance, to train a fully-connected network controlling MuJoCo humanoid using PPO2 for 20M timesteps
```bash
python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
```
Note that for mujoco environments fully-connected network is default, so we can omit `--network=mlp`
The hyperparameters for both network and the learning algorithm can be controlled via the command line, for instance:
```bash
python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
```
will set entropy coefficient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)

See docstrings in [common/models.py](baselines/common/models.py) for description of network parameters for each type of model, and 
docstring for [baselines/ppo2/ppo2.py/learn()](baselines/ppo2/ppo2.py#L152) for the description of the ppo2 hyperparameters. 

### Example 2. DQN on Atari 
DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
```
python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=1e6
```

## Saving, loading and visualizing models

### Saving and loading the model
The algorithms serialization API is not properly unified yet; however, there is a simple method to save / restore trained models. 
`--save_path` and `--load_path` command-line option loads the tensorflow state from a given path before training, and saves it after the training, respectively. 
Let's imagine you'd like to train ppo2 on Atari Pong,  save the model and then later visualize what has it learnt.
```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2
```
This should get to the mean reward per episode about 20. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize: 
```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
```

*NOTE:* Mujoco environments require normalization to work properly, so we wrap them with VecNormalize wrapper. Currently, to ensure the models are saved with normalization (so that trained models can be restored and run without further training) the normalization coefficients are saved as tensorflow variables. This can decrease the performance somewhat, so if you require high-throughput steps with Mujoco and do not need saving/restoring the models, it may make sense to use numpy normalization instead. To do that, set 'use_tf=False` in [baselines/run.py](baselines/run.py#L116). 

### Logging and vizualizing learning curves and other training metrics
By default, all summary data, including progress, standard output, is saved to a unique directory in a temp folder, specified by a call to Python's [tempfile.gettempdir()](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir).
The directory can be changed with the `--log_path` command-line option.
```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2 --log_path=~/logs/Pong/
```
*NOTE:* Please be aware that the logger will overwrite files of the same name in an existing directory, thus it's recommended that folder names be given a unique timestamp to prevent overwritten logs.

Another way the temp directory can be changed is through the use of the `$OPENAI_LOGDIR` environment variable.

For examples on how to load and display the training data, see [here](docs/viz/viz.ipynb).

## Subpackages

- [A2C](baselines/a2c)
- [ACER](baselines/acer)
- [ACKTR](baselines/acktr)
- [DDPG](baselines/ddpg)
- [DQN](baselines/deepq)
- [GAIL](baselines/gail)
- [HER](baselines/her)
- [PPO1](baselines/ppo1) (obsolete version, left here temporarily)
- [PPO2](baselines/ppo2) 
- [TRPO](baselines/trpo_mpi)



## Benchmarks
Results of benchmarks on Mujoco (1M timesteps) and Atari (10M timesteps) are available 
[here for Mujoco](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_mujoco1M.htm) 
and
[here for Atari](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_atari10M.htm) 
respectively. Note that these results may be not on the latest version of the code, particular commit hash with which results were obtained is specified on the benchmarks page. 

To cite this repository in publications:

    @misc{baselines,
      author = {Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai and Zhokhov, Peter},
      title = {OpenAI Baselines},
      year = {2017},
      publisher = {GitHub},
      journal = {GitHub repository},
      howpublished = {\url{https://github.com/openai/baselines}},
    }



================================================
FILE: baselines/__init__.py
================================================


================================================
FILE: baselines/a2c/README.md
================================================
# A2C

- Original paper: https://arxiv.org/abs/1602.01783
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.run --alg=a2c --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options
- also refer to the repo-wide [README.md](../../README.md#training-models)

## Files
- `run_atari`: file used to run the algorithm.
- `policies.py`: contains the different versions of the A2C architecture (MlpPolicy, CNNPolicy, LstmPolicy...).
- `a2c.py`: - Model : class used to initialize the step_model (sampling) and train_model (training)
	- learn : Main entrypoint for A2C algorithm. Train a policy with given network architecture on a given environment using a2c algorithm.
- `runner.py`: class used to generates a batch of experiences


================================================
FILE: baselines/a2c/__init__.py
================================================


================================================
FILE: baselines/a2c/a2c.py
================================================
import time
import functools
import tensorflow as tf

from baselines import logger

from baselines.common import set_global_seeds, explained_variance
from baselines.common import tf_util
from baselines.common.policies import build_policy


from baselines.a2c.utils import Scheduler, find_trainable_variables
from baselines.a2c.runner import Runner
from baselines.ppo2.ppo2 import safemean
from collections import deque

from tensorflow import losses

class Model(object):

    """
    We use this class to :
        __init__:
        - Creates the step_model
        - Creates the train_model

        train():
        - Make the training part (feedforward and retropropagation of gradients)

        save/load():
        - Save load the model
    """
    def __init__(self, policy, env, nsteps,
            ent_coef=0.01, vf_coef=0.5, max_grad_norm=0.5, lr=7e-4,
            alpha=0.99, epsilon=1e-5, total_timesteps=int(80e6), lrschedule='linear'):

        sess = tf_util.get_session()
        nenvs = env.num_envs
        nbatch = nenvs*nsteps


        with tf.variable_scope('a2c_model', reuse=tf.AUTO_REUSE):
            # step_model is used for sampling
            step_model = policy(nenvs, 1, sess)

            # train_model is used to train our network
            train_model = policy(nbatch, nsteps, sess)

        A = tf.placeholder(train_model.action.dtype, train_model.action.shape)
        ADV = tf.placeholder(tf.float32, [nbatch])
        R = tf.placeholder(tf.float32, [nbatch])
        LR = tf.placeholder(tf.float32, [])

        # Calculate the loss
        # Total loss = Policy gradient loss - entropy * entropy coefficient + Value coefficient * value loss

        # Policy loss
        neglogpac = train_model.pd.neglogp(A)
        # L = A(s,a) * -logpi(a|s)
        pg_loss = tf.reduce_mean(ADV * neglogpac)

        # Entropy is used to improve exploration by limiting the premature convergence to suboptimal policy.
        entropy = tf.reduce_mean(train_model.pd.entropy())

        # Value loss
        vf_loss = losses.mean_squared_error(tf.squeeze(train_model.vf), R)

        loss = pg_loss - entropy*ent_coef + vf_loss * vf_coef

        # Update parameters using loss
        # 1. Get the model parameters
        params = find_trainable_variables("a2c_model")

        # 2. Calculate the gradients
        grads = tf.gradients(loss, params)
        if max_grad_norm is not None:
            # Clip the gradients (normalize)
            grads, grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
        grads = list(zip(grads, params))
        # zip aggregate each gradient with parameters associated
        # For instance zip(ABCD, xyza) => Ax, By, Cz, Da

        # 3. Make op for one policy and value update step of A2C
        trainer = tf.train.RMSPropOptimizer(learning_rate=LR, decay=alpha, epsilon=epsilon)

        _train = trainer.apply_gradients(grads)

        lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)

        def train(obs, states, rewards, masks, actions, values):
            # Here we calculate advantage A(s,a) = R + yV(s') - V(s)
            # rewards = R + yV(s')
            advs = rewards - values
            for step in range(len(obs)):
                cur_lr = lr.value()

            td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, LR:cur_lr}
            if states is not None:
                td_map[train_model.S] = states
                td_map[train_model.M] = masks
            policy_loss, value_loss, policy_entropy, _ = sess.run(
                [pg_loss, vf_loss, entropy, _train],
                td_map
            )
            return policy_loss, value_loss, policy_entropy


        self.train = train
        self.train_model = train_model
        self.step_model = step_model
        self.step = step_model.step
        self.value = step_model.value
        self.initial_state = step_model.initial_state
        self.save = functools.partial(tf_util.save_variables, sess=sess)
        self.load = functools.partial(tf_util.load_variables, sess=sess)
        tf.global_variables_initializer().run(session=sess)


def learn(
    network,
    env,
    seed=None,
    nsteps=5,
    total_timesteps=int(80e6),
    vf_coef=0.5,
    ent_coef=0.01,
    max_grad_norm=0.5,
    lr=7e-4,
    lrschedule='linear',
    epsilon=1e-5,
    alpha=0.99,
    gamma=0.99,
    log_interval=100,
    load_path=None,
    **network_kwargs):

    '''
    Main entrypoint for A2C algorithm. Train a policy with given network architecture on a given environment using a2c algorithm.

    Parameters:
    -----------

    network:            policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
                        specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
                        tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
                        neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
                        See baselines.common/policies.py/lstm for more details on using recurrent nets in policies


    env:                RL environment. Should implement interface similar to VecEnv (baselines.common/vec_env) or be wrapped with DummyVecEnv (baselines.common/vec_env/dummy_vec_env.py)


    seed:               seed to make random number sequence in the alorightm reproducible. By default is None which means seed from system noise generator (not reproducible)

    nsteps:             int, number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
                        nenv is number of environment copies simulated in parallel)

    total_timesteps:    int, total number of timesteps to train on (default: 80M)

    vf_coef:            float, coefficient in front of value function loss in the total loss function (default: 0.5)

    ent_coef:           float, coeffictiant in front of the policy entropy in the total loss function (default: 0.01)

    max_gradient_norm:  float, gradient is clipped to have global L2 norm no more than this value (default: 0.5)

    lr:                 float, learning rate for RMSProp (current implementation has RMSProp hardcoded in) (default: 7e-4)

    lrschedule:         schedule of learning rate. Can be 'linear', 'constant', or a function [0..1] -> [0..1] that takes fraction of the training progress as input and
                        returns fraction of the learning rate (specified as lr) as output

    epsilon:            float, RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)

    alpha:              float, RMSProp decay parameter (default: 0.99)

    gamma:              float, reward discounting parameter (default: 0.99)

    log_interval:       int, specifies how frequently the logs are printed out (default: 100)

    **network_kwargs:   keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
                        For instance, 'mlp' network architecture has arguments num_hidden and num_layers.

    '''



    set_global_seeds(seed)

    # Get the nb of env
    nenvs = env.num_envs
    policy = build_policy(env, network, **network_kwargs)

    # Instantiate the model object (that creates step_model and train_model)
    model = Model(policy=policy, env=env, nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
        max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps, lrschedule=lrschedule)
    if load_path is not None:
        model.load(load_path)

    # Instantiate the runner object
    runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
    epinfobuf = deque(maxlen=100)

    # Calculate the batch_size
    nbatch = nenvs*nsteps

    # Start total timer
    tstart = time.time()

    for update in range(1, total_timesteps//nbatch+1):
        # Get mini batch of experiences
        obs, states, rewards, masks, actions, values, epinfos = runner.run()
        epinfobuf.extend(epinfos)

        policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
        nseconds = time.time()-tstart

        # Calculate the fps (frame per second)
        fps = int((update*nbatch)/nseconds)
        if update % log_interval == 0 or update == 1:
            # Calculates if value function is a good predicator of the returns (ev > 1)
            # or if it's just worse than predicting nothing (ev =< 0)
            ev = explained_variance(values, rewards)
            logger.record_tabular("nupdates", update)
            logger.record_tabular("total_timesteps", update*nbatch)
            logger.record_tabular("fps", fps)
            logger.record_tabular("policy_entropy", float(policy_entropy))
            logger.record_tabular("value_loss", float(value_loss))
            logger.record_tabular("explained_variance", float(ev))
            logger.record_tabular("eprewmean", safemean([epinfo['r'] for epinfo in epinfobuf]))
            logger.record_tabular("eplenmean", safemean([epinfo['l'] for epinfo in epinfobuf]))
            logger.dump_tabular()
    return model



================================================
FILE: baselines/a2c/runner.py
================================================
import numpy as np
from baselines.a2c.utils import discount_with_dones
from baselines.common.runners import AbstractEnvRunner

class Runner(AbstractEnvRunner):
    """
    We use this class to generate batches of experiences

    __init__:
    - Initialize the runner

    run():
    - Make a mini batch of experiences
    """
    def __init__(self, env, model, nsteps=5, gamma=0.99):
        super().__init__(env=env, model=model, nsteps=nsteps)
        self.gamma = gamma
        self.batch_action_shape = [x if x is not None else -1 for x in model.train_model.action.shape.as_list()]
        self.ob_dtype = model.train_model.X.dtype.as_numpy_dtype

    def run(self):
        # We initialize the lists that will contain the mb of experiences
        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
        mb_states = self.states
        epinfos = []
        for n in range(self.nsteps):
            # Given observations, take action and value (V(s))
            # We already have self.obs because Runner superclass run self.obs[:] = env.reset() on init
            actions, values, states, _ = self.model.step(self.obs, S=self.states, M=self.dones)

            # Append the experiences
            mb_obs.append(np.copy(self.obs))
            mb_actions.append(actions)
            mb_values.append(values)
            mb_dones.append(self.dones)

            # Take actions in env and look the results
            obs, rewards, dones, infos = self.env.step(actions)
            for info in infos:
                maybeepinfo = info.get('episode')
                if maybeepinfo: epinfos.append(maybeepinfo)
            self.states = states
            self.dones = dones
            self.obs = obs
            mb_rewards.append(rewards)
        mb_dones.append(self.dones)

        # Batch of steps to batch of rollouts
        mb_obs = np.asarray(mb_obs, dtype=self.ob_dtype).swapaxes(1, 0).reshape(self.batch_ob_shape)
        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
        mb_actions = np.asarray(mb_actions, dtype=self.model.train_model.action.dtype.name).swapaxes(1, 0)
        mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
        mb_masks = mb_dones[:, :-1]
        mb_dones = mb_dones[:, 1:]


        if self.gamma > 0.0:
            # Discount/bootstrap off value fn
            last_values = self.model.value(self.obs, S=self.states, M=self.dones).tolist()
            for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
                rewards = rewards.tolist()
                dones = dones.tolist()
                if dones[-1] == 0:
                    rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
                else:
                    rewards = discount_with_dones(rewards, dones, self.gamma)

                mb_rewards[n] = rewards

        mb_actions = mb_actions.reshape(self.batch_action_shape)

        mb_rewards = mb_rewards.flatten()
        mb_values = mb_values.flatten()
        mb_masks = mb_masks.flatten()
        return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values, epinfos


================================================
FILE: baselines/a2c/utils.py
================================================
import os
import numpy as np
import tensorflow as tf
from collections import deque

def sample(logits):
    noise = tf.random_uniform(tf.shape(logits))
    return tf.argmax(logits - tf.log(-tf.log(noise)), 1)

def cat_entropy(logits):
    a0 = logits - tf.reduce_max(logits, 1, keepdims=True)
    ea0 = tf.exp(a0)
    z0 = tf.reduce_sum(ea0, 1, keepdims=True)
    p0 = ea0 / z0
    return tf.reduce_sum(p0 * (tf.log(z0) - a0), 1)

def cat_entropy_softmax(p0):
    return - tf.reduce_sum(p0 * tf.log(p0 + 1e-6), axis = 1)

def ortho_init(scale=1.0):
    def _ortho_init(shape, dtype, partition_info=None):
        #lasagne ortho init for tf
        shape = tuple(shape)
        if len(shape) == 2:
            flat_shape = shape
        elif len(shape) == 4: # assumes NHWC
            flat_shape = (np.prod(shape[:-1]), shape[-1])
        else:
            raise NotImplementedError
        a = np.random.normal(0.0, 1.0, flat_shape)
        u, _, v = np.linalg.svd(a, full_matrices=False)
        q = u if u.shape == flat_shape else v # pick the one with the correct shape
        q = q.reshape(shape)
        return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
    return _ortho_init

def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='NHWC', one_dim_bias=False):
    if data_format == 'NHWC':
        channel_ax = 3
        strides = [1, stride, stride, 1]
        bshape = [1, 1, 1, nf]
    elif data_format == 'NCHW':
        channel_ax = 1
        strides = [1, 1, stride, stride]
        bshape = [1, nf, 1, 1]
    else:
        raise NotImplementedError
    bias_var_shape = [nf] if one_dim_bias else [1, nf, 1, 1]
    nin = x.get_shape()[channel_ax].value
    wshape = [rf, rf, nin, nf]
    with tf.variable_scope(scope):
        w = tf.get_variable("w", wshape, initializer=ortho_init(init_scale))
        b = tf.get_variable("b", bias_var_shape, initializer=tf.constant_initializer(0.0))
        if not one_dim_bias and data_format == 'NHWC':
            b = tf.reshape(b, bshape)
        return tf.nn.conv2d(x, w, strides=strides, padding=pad, data_format=data_format) + b

def fc(x, scope, nh, *, init_scale=1.0, init_bias=0.0):
    with tf.variable_scope(scope):
        nin = x.get_shape()[1].value
        w = tf.get_variable("w", [nin, nh], initializer=ortho_init(init_scale))
        b = tf.get_variable("b", [nh], initializer=tf.constant_initializer(init_bias))
        return tf.matmul(x, w)+b

def batch_to_seq(h, nbatch, nsteps, flat=False):
    if flat:
        h = tf.reshape(h, [nbatch, nsteps])
    else:
        h = tf.reshape(h, [nbatch, nsteps, -1])
    return [tf.squeeze(v, [1]) for v in tf.split(axis=1, num_or_size_splits=nsteps, value=h)]

def seq_to_batch(h, flat = False):
    shape = h[0].get_shape().as_list()
    if not flat:
        assert(len(shape) > 1)
        nh = h[0].get_shape()[-1].value
        return tf.reshape(tf.concat(axis=1, values=h), [-1, nh])
    else:
        return tf.reshape(tf.stack(values=h, axis=1), [-1])

def lstm(xs, ms, s, scope, nh, init_scale=1.0):
    nbatch, nin = [v.value for v in xs[0].get_shape()]
    with tf.variable_scope(scope):
        wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
        wh = tf.get_variable("wh", [nh, nh*4], initializer=ortho_init(init_scale))
        b = tf.get_variable("b", [nh*4], initializer=tf.constant_initializer(0.0))

    c, h = tf.split(axis=1, num_or_size_splits=2, value=s)
    for idx, (x, m) in enumerate(zip(xs, ms)):
        c = c*(1-m)
        h = h*(1-m)
        z = tf.matmul(x, wx) + tf.matmul(h, wh) + b
        i, f, o, u = tf.split(axis=1, num_or_size_splits=4, value=z)
        i = tf.nn.sigmoid(i)
        f = tf.nn.sigmoid(f)
        o = tf.nn.sigmoid(o)
        u = tf.tanh(u)
        c = f*c + i*u
        h = o*tf.tanh(c)
        xs[idx] = h
    s = tf.concat(axis=1, values=[c, h])
    return xs, s

def _ln(x, g, b, e=1e-5, axes=[1]):
    u, s = tf.nn.moments(x, axes=axes, keep_dims=True)
    x = (x-u)/tf.sqrt(s+e)
    x = x*g+b
    return x

def lnlstm(xs, ms, s, scope, nh, init_scale=1.0):
    nbatch, nin = [v.value for v in xs[0].get_shape()]
    with tf.variable_scope(scope):
        wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
        gx = tf.get_variable("gx", [nh*4], initializer=tf.constant_initializer(1.0))
        bx = tf.get_variable("bx", [nh*4], initializer=tf.constant_initializer(0.0))

        wh = tf.get_variable("wh", [nh, nh*4], initializer=ortho_init(init_scale))
        gh = tf.get_variable("gh", [nh*4], initializer=tf.constant_initializer(1.0))
        bh = tf.get_variable("bh", [nh*4], initializer=tf.constant_initializer(0.0))

        b = tf.get_variable("b", [nh*4], initializer=tf.constant_initializer(0.0))

        gc = tf.get_variable("gc", [nh], initializer=tf.constant_initializer(1.0))
        bc = tf.get_variable("bc", [nh], initializer=tf.constant_initializer(0.0))

    c, h = tf.split(axis=1, num_or_size_splits=2, value=s)
    for idx, (x, m) in enumerate(zip(xs, ms)):
        c = c*(1-m)
        h = h*(1-m)
        z = _ln(tf.matmul(x, wx), gx, bx) + _ln(tf.matmul(h, wh), gh, bh) + b
        i, f, o, u = tf.split(axis=1, num_or_size_splits=4, value=z)
        i = tf.nn.sigmoid(i)
        f = tf.nn.sigmoid(f)
        o = tf.nn.sigmoid(o)
        u = tf.tanh(u)
        c = f*c + i*u
        h = o*tf.tanh(_ln(c, gc, bc))
        xs[idx] = h
    s = tf.concat(axis=1, values=[c, h])
    return xs, s

def conv_to_fc(x):
    nh = np.prod([v.value for v in x.get_shape()[1:]])
    x = tf.reshape(x, [-1, nh])
    return x

def discount_with_dones(rewards, dones, gamma):
    discounted = []
    r = 0
    for reward, done in zip(rewards[::-1], dones[::-1]):
        r = reward + gamma*r*(1.-done) # fixed off by one bug
        discounted.append(r)
    return discounted[::-1]

def find_trainable_variables(key):
    return tf.trainable_variables(key)

def make_path(f):
    return os.makedirs(f, exist_ok=True)

def constant(p):
    return 1

def linear(p):
    return 1-p

def middle_drop(p):
    eps = 0.75
    if 1-p<eps:
        return eps*0.1
    return 1-p

def double_linear_con(p):
    p *= 2
    eps = 0.125
    if 1-p<eps:
        return eps
    return 1-p

def double_middle_drop(p):
    eps1 = 0.75
    eps2 = 0.25
    if 1-p<eps1:
        if 1-p<eps2:
            return eps2*0.5
        return eps1*0.1
    return 1-p

schedules = {
    'linear':linear,
    'constant':constant,
    'double_linear_con': double_linear_con,
    'middle_drop': middle_drop,
    'double_middle_drop': double_middle_drop
}

class Scheduler(object):

    def __init__(self, v, nvalues, schedule):
        self.n = 0.
        self.v = v
        self.nvalues = nvalues
        self.schedule = schedules[schedule]

    def value(self):
        current_value = self.v*self.schedule(self.n/self.nvalues)
        self.n += 1.
        return current_value

    def value_steps(self, steps):
        return self.v*self.schedule(steps/self.nvalues)


class EpisodeStats:
    def __init__(self, nsteps, nenvs):
        self.episode_rewards = []
        for i in range(nenvs):
            self.episode_rewards.append([])
        self.lenbuffer = deque(maxlen=40)  # rolling buffer for episode lengths
        self.rewbuffer = deque(maxlen=40)  # rolling buffer for episode rewards
        self.nsteps = nsteps
        self.nenvs = nenvs

    def feed(self, rewards, masks):
        rewards = np.reshape(rewards, [self.nenvs, self.nsteps])
        masks = np.reshape(masks, [self.nenvs, self.nsteps])
        for i in range(0, self.nenvs):
            for j in range(0, self.nsteps):
                self.episode_rewards[i].append(rewards[i][j])
                if masks[i][j]:
                    l = len(self.episode_rewards[i])
                    s = sum(self.episode_rewards[i])
                    self.lenbuffer.append(l)
                    self.rewbuffer.append(s)
                    self.episode_rewards[i] = []

    def mean_length(self):
        if self.lenbuffer:
            return np.mean(self.lenbuffer)
        else:
            return 0  # on the first params dump, no episodes are finished

    def mean_reward(self):
        if self.rewbuffer:
            return np.mean(self.rewbuffer)
        else:
            return 0


# For ACER
def get_by_index(x, idx):
    assert(len(x.get_shape()) == 2)
    assert(len(idx.get_shape()) == 1)
    idx_flattened = tf.range(0, x.shape[0]) * x.shape[1] + idx
    y = tf.gather(tf.reshape(x, [-1]),  # flatten input
                  idx_flattened)  # use flattened indices
    return y

def check_shape(ts,shapes):
    i = 0
    for (t,shape) in zip(ts,shapes):
        assert t.get_shape().as_list()==shape, "id " + str(i) + " shape " + str(t.get_shape()) + str(shape)
        i += 1

def avg_norm(t):
    return tf.reduce_mean(tf.sqrt(tf.reduce_sum(tf.square(t), axis=-1)))

def gradient_add(g1, g2, param):
    print([g1, g2, param.name])
    assert (not (g1 is None and g2 is None)), param.name
    if g1 is None:
        return g2
    elif g2 is None:
        return g1
    else:
        return g1 + g2

def q_explained_variance(qpred, q):
    _, vary = tf.nn.moments(q, axes=[0, 1])
    _, varpred = tf.nn.moments(q - qpred, axes=[0, 1])
    check_shape([vary, varpred], [[]] * 2)
    return 1.0 - (varpred / vary)


================================================
FILE: baselines/acer/README.md
================================================
# ACER

- Original paper: https://arxiv.org/abs/1611.01224
- `python -m baselines.run --alg=acer --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
- also refer to the repo-wide [README.md](../../README.md#training-models)



================================================
FILE: baselines/acer/__init__.py
================================================


================================================
FILE: baselines/acer/acer.py
================================================
import time
import functools
import numpy as np
import tensorflow as tf
from baselines import logger

from baselines.common import set_global_seeds
from baselines.common.policies import build_policy
from baselines.common.tf_util import get_session, save_variables, load_variables
from baselines.common.vec_env.vec_frame_stack import VecFrameStack

from baselines.a2c.utils import batch_to_seq, seq_to_batch
from baselines.a2c.utils import cat_entropy_softmax
from baselines.a2c.utils import Scheduler, find_trainable_variables
from baselines.a2c.utils import EpisodeStats
from baselines.a2c.utils import get_by_index, check_shape, avg_norm, gradient_add, q_explained_variance
from baselines.acer.buffer import Buffer
from baselines.acer.runner import Runner

# remove last step
def strip(var, nenvs, nsteps, flat = False):
    vars = batch_to_seq(var, nenvs, nsteps + 1, flat)
    return seq_to_batch(vars[:-1], flat)

def q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma):
    """
    Calculates q_retrace targets

    :param R: Rewards
    :param D: Dones
    :param q_i: Q values for actions taken
    :param v: V values
    :param rho_i: Importance weight for each action
    :return: Q_retrace values
    """
    rho_bar = batch_to_seq(tf.minimum(1.0, rho_i), nenvs, nsteps, True)  # list of len steps, shape [nenvs]
    rs = batch_to_seq(R, nenvs, nsteps, True)  # list of len steps, shape [nenvs]
    ds = batch_to_seq(D, nenvs, nsteps, True)  # list of len steps, shape [nenvs]
    q_is = batch_to_seq(q_i, nenvs, nsteps, True)
    vs = batch_to_seq(v, nenvs, nsteps + 1, True)
    v_final = vs[-1]
    qret = v_final
    qrets = []
    for i in range(nsteps - 1, -1, -1):
        check_shape([qret, ds[i], rs[i], rho_bar[i], q_is[i], vs[i]], [[nenvs]] * 6)
        qret = rs[i] + gamma * qret * (1.0 - ds[i])
        qrets.append(qret)
        qret = (rho_bar[i] * (qret - q_is[i])) + vs[i]
    qrets = qrets[::-1]
    qret = seq_to_batch(qrets, flat=True)
    return qret

# For ACER with PPO clipping instead of trust region
# def clip(ratio, eps_clip):
#     # assume 0 <= eps_clip <= 1
#     return tf.minimum(1 + eps_clip, tf.maximum(1 - eps_clip, ratio))

class Model(object):
    def __init__(self, policy, ob_space, ac_space, nenvs, nsteps, ent_coef, q_coef, gamma, max_grad_norm, lr,
                 rprop_alpha, rprop_epsilon, total_timesteps, lrschedule,
                 c, trust_region, alpha, delta):

        sess = get_session()
        nact = ac_space.n
        nbatch = nenvs * nsteps

        A = tf.placeholder(tf.int32, [nbatch]) # actions
        D = tf.placeholder(tf.float32, [nbatch]) # dones
        R = tf.placeholder(tf.float32, [nbatch]) # rewards, not returns
        MU = tf.placeholder(tf.float32, [nbatch, nact]) # mu's
        LR = tf.placeholder(tf.float32, [])
        eps = 1e-6

        step_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs,) + ob_space.shape)
        train_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs*(nsteps+1),) + ob_space.shape)
        with tf.variable_scope('acer_model', reuse=tf.AUTO_REUSE):

            step_model = policy(nbatch=nenvs, nsteps=1, observ_placeholder=step_ob_placeholder, sess=sess)
            train_model = policy(nbatch=nbatch, nsteps=nsteps, observ_placeholder=train_ob_placeholder, sess=sess)


        params = find_trainable_variables("acer_model")
        print("Params {}".format(len(params)))
        for var in params:
            print(var)

        # create polyak averaged model
        ema = tf.train.ExponentialMovingAverage(alpha)
        ema_apply_op = ema.apply(params)

        def custom_getter(getter, *args, **kwargs):
            v = ema.average(getter(*args, **kwargs))
            print(v.name)
            return v

        with tf.variable_scope("acer_model", custom_getter=custom_getter, reuse=True):
            polyak_model = policy(nbatch=nbatch, nsteps=nsteps, observ_placeholder=train_ob_placeholder, sess=sess)

        # Notation: (var) = batch variable, (var)s = seqeuence variable, (var)_i = variable index by action at step i

        # action probability distributions according to train_model, polyak_model and step_model
        # poilcy.pi is probability distribution parameters; to obtain distribution that sums to 1 need to take softmax
        train_model_p = tf.nn.softmax(train_model.pi)
        polyak_model_p = tf.nn.softmax(polyak_model.pi)
        step_model_p = tf.nn.softmax(step_model.pi)
        v = tf.reduce_sum(train_model_p * train_model.q, axis = -1) # shape is [nenvs * (nsteps + 1)]

        # strip off last step
        f, f_pol, q = map(lambda var: strip(var, nenvs, nsteps), [train_model_p, polyak_model_p, train_model.q])
        # Get pi and q values for actions taken
        f_i = get_by_index(f, A)
        q_i = get_by_index(q, A)

        # Compute ratios for importance truncation
        rho = f / (MU + eps)
        rho_i = get_by_index(rho, A)

        # Calculate Q_retrace targets
        qret = q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma)

        # Calculate losses
        # Entropy
        # entropy = tf.reduce_mean(strip(train_model.pd.entropy(), nenvs, nsteps))
        entropy = tf.reduce_mean(cat_entropy_softmax(f))

        # Policy Graident loss, with truncated importance sampling & bias correction
        v = strip(v, nenvs, nsteps, True)
        check_shape([qret, v, rho_i, f_i], [[nenvs * nsteps]] * 4)
        check_shape([rho, f, q], [[nenvs * nsteps, nact]] * 2)

        # Truncated importance sampling
        adv = qret - v
        logf = tf.log(f_i + eps)
        gain_f = logf * tf.stop_gradient(adv * tf.minimum(c, rho_i))  # [nenvs * nsteps]
        loss_f = -tf.reduce_mean(gain_f)

        # Bias correction for the truncation
        adv_bc = (q - tf.reshape(v, [nenvs * nsteps, 1]))  # [nenvs * nsteps, nact]
        logf_bc = tf.log(f + eps) # / (f_old + eps)
        check_shape([adv_bc, logf_bc], [[nenvs * nsteps, nact]]*2)
        gain_bc = tf.reduce_sum(logf_bc * tf.stop_gradient(adv_bc * tf.nn.relu(1.0 - (c / (rho + eps))) * f), axis = 1) #IMP: This is sum, as expectation wrt f
        loss_bc= -tf.reduce_mean(gain_bc)

        loss_policy = loss_f + loss_bc

        # Value/Q function loss, and explained variance
        check_shape([qret, q_i], [[nenvs * nsteps]]*2)
        ev = q_explained_variance(tf.reshape(q_i, [nenvs, nsteps]), tf.reshape(qret, [nenvs, nsteps]))
        loss_q = tf.reduce_mean(tf.square(tf.stop_gradient(qret) - q_i)*0.5)

        # Net loss
        check_shape([loss_policy, loss_q, entropy], [[]] * 3)
        loss = loss_policy + q_coef * loss_q - ent_coef * entropy

        if trust_region:
            g = tf.gradients(- (loss_policy - ent_coef * entropy) * nsteps * nenvs, f) #[nenvs * nsteps, nact]
            # k = tf.gradients(KL(f_pol || f), f)
            k = - f_pol / (f + eps) #[nenvs * nsteps, nact] # Directly computed gradient of KL divergence wrt f
            k_dot_g = tf.reduce_sum(k * g, axis=-1)
            adj = tf.maximum(0.0, (tf.reduce_sum(k * g, axis=-1) - delta) / (tf.reduce_sum(tf.square(k), axis=-1) + eps)) #[nenvs * nsteps]

            # Calculate stats (before doing adjustment) for logging.
            avg_norm_k = avg_norm(k)
            avg_norm_g = avg_norm(g)
            avg_norm_k_dot_g = tf.reduce_mean(tf.abs(k_dot_g))
            avg_norm_adj = tf.reduce_mean(tf.abs(adj))

            g = g - tf.reshape(adj, [nenvs * nsteps, 1]) * k
            grads_f = -g/(nenvs*nsteps) # These are turst region adjusted gradients wrt f ie statistics of policy pi
            grads_policy = tf.gradients(f, params, grads_f)
            grads_q = tf.gradients(loss_q * q_coef, params)
            grads = [gradient_add(g1, g2, param) for (g1, g2, param) in zip(grads_policy, grads_q, params)]

            avg_norm_grads_f = avg_norm(grads_f) * (nsteps * nenvs)
            norm_grads_q = tf.global_norm(grads_q)
            norm_grads_policy = tf.global_norm(grads_policy)
        else:
            grads = tf.gradients(loss, params)

        if max_grad_norm is not None:
            grads, norm_grads = tf.clip_by_global_norm(grads, max_grad_norm)
        grads = list(zip(grads, params))
        trainer = tf.train.RMSPropOptimizer(learning_rate=LR, decay=rprop_alpha, epsilon=rprop_epsilon)
        _opt_op = trainer.apply_gradients(grads)

        # so when you call _train, you first do the gradient step, then you apply ema
        with tf.control_dependencies([_opt_op]):
            _train = tf.group(ema_apply_op)

        lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)

        # Ops/Summaries to run, and their names for logging
        run_ops = [_train, loss, loss_q, entropy, loss_policy, loss_f, loss_bc, ev, norm_grads]
        names_ops = ['loss', 'loss_q', 'entropy', 'loss_policy', 'loss_f', 'loss_bc', 'explained_variance',
                     'norm_grads']
        if trust_region:
            run_ops = run_ops + [norm_grads_q, norm_grads_policy, avg_norm_grads_f, avg_norm_k, avg_norm_g, avg_norm_k_dot_g,
                                 avg_norm_adj]
            names_ops = names_ops + ['norm_grads_q', 'norm_grads_policy', 'avg_norm_grads_f', 'avg_norm_k', 'avg_norm_g',
                                     'avg_norm_k_dot_g', 'avg_norm_adj']

        def train(obs, actions, rewards, dones, mus, states, masks, steps):
            cur_lr = lr.value_steps(steps)
            td_map = {train_model.X: obs, polyak_model.X: obs, A: actions, R: rewards, D: dones, MU: mus, LR: cur_lr}
            if states is not None:
                td_map[train_model.S] = states
                td_map[train_model.M] = masks
                td_map[polyak_model.S] = states
                td_map[polyak_model.M] = masks

            return names_ops, sess.run(run_ops, td_map)[1:]  # strip off _train

        def _step(observation, **kwargs):
            return step_model._evaluate([step_model.action, step_model_p, step_model.state], observation, **kwargs)



        self.train = train
        self.save = functools.partial(save_variables, sess=sess)
        self.load = functools.partial(load_variables, sess=sess)
        self.train_model = train_model
        self.step_model = step_model
        self._step = _step
        self.step = self.step_model.step

        self.initial_state = step_model.initial_state
        tf.global_variables_initializer().run(session=sess)


class Acer():
    def __init__(self, runner, model, buffer, log_interval):
        self.runner = runner
        self.model = model
        self.buffer = buffer
        self.log_interval = log_interval
        self.tstart = None
        self.episode_stats = EpisodeStats(runner.nsteps, runner.nenv)
        self.steps = None

    def call(self, on_policy):
        runner, model, buffer, steps = self.runner, self.model, self.buffer, self.steps
        if on_policy:
            enc_obs, obs, actions, rewards, mus, dones, masks = runner.run()
            self.episode_stats.feed(rewards, dones)
            if buffer is not None:
                buffer.put(enc_obs, actions, rewards, mus, dones, masks)
        else:
            # get obs, actions, rewards, mus, dones from buffer.
            obs, actions, rewards, mus, dones, masks = buffer.get()


        # reshape stuff correctly
        obs = obs.reshape(runner.batch_ob_shape)
        actions = actions.reshape([runner.nbatch])
        rewards = rewards.reshape([runner.nbatch])
        mus = mus.reshape([runner.nbatch, runner.nact])
        dones = dones.reshape([runner.nbatch])
        masks = masks.reshape([runner.batch_ob_shape[0]])

        names_ops, values_ops = model.train(obs, actions, rewards, dones, mus, model.initial_state, masks, steps)

        if on_policy and (int(steps/runner.nbatch) % self.log_interval == 0):
            logger.record_tabular("total_timesteps", steps)
            logger.record_tabular("fps", int(steps/(time.time() - self.tstart)))
            # IMP: In EpisodicLife env, during training, we get done=True at each loss of life, not just at the terminal state.
            # Thus, this is mean until end of life, not end of episode.
            # For true episode rewards, see the monitor files in the log folder.
            logger.record_tabular("mean_episode_length", self.episode_stats.mean_length())
            logger.record_tabular("mean_episode_reward", self.episode_stats.mean_reward())
            for name, val in zip(names_ops, values_ops):
                logger.record_tabular(name, float(val))
            logger.dump_tabular()


def learn(network, env, seed=None, nsteps=20, total_timesteps=int(80e6), q_coef=0.5, ent_coef=0.01,
          max_grad_norm=10, lr=7e-4, lrschedule='linear', rprop_epsilon=1e-5, rprop_alpha=0.99, gamma=0.99,
          log_interval=100, buffer_size=50000, replay_ratio=4, replay_start=10000, c=10.0,
          trust_region=True, alpha=0.99, delta=1, load_path=None, **network_kwargs):

    '''
    Main entrypoint for ACER (Actor-Critic with Experience Replay) algorithm (https://arxiv.org/pdf/1611.01224.pdf)
    Train an agent with given network architecture on a given environment using ACER.

    Parameters:
    ----------

    network:            policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
                        specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
                        tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
                        neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
                        See baselines.common/policies.py/lstm for more details on using recurrent nets in policies

    env:                environment. Needs to be vectorized for parallel environment simulation.
                        The environments produced by gym.make can be wrapped using baselines.common.vec_env.DummyVecEnv class.

    nsteps:             int, number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
                        nenv is number of environment copies simulated in parallel) (default: 20)

    nstack:             int, size of the frame stack, i.e. number of the frames passed to the step model. Frames are stacked along channel dimension
                        (last image dimension) (default: 4)

    total_timesteps:    int, number of timesteps (i.e. number of actions taken in the environment) (default: 80M)

    q_coef:             float, value function loss coefficient in the optimization objective (analog of vf_coef for other actor-critic methods)

    ent_coef:           float, policy entropy coefficient in the optimization objective (default: 0.01)

    max_grad_norm:      float, gradient norm clipping coefficient. If set to None, no clipping. (default: 10),

    lr:                 float, learning rate for RMSProp (current implementation has RMSProp hardcoded in) (default: 7e-4)

    lrschedule:         schedule of learning rate. Can be 'linear', 'constant', or a function [0..1] -> [0..1] that takes fraction of the training progress as input and
                        returns fraction of the learning rate (specified as lr) as output

    rprop_epsilon:      float, RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)

    rprop_alpha:        float, RMSProp decay parameter (default: 0.99)

    gamma:              float, reward discounting factor (default: 0.99)

    log_interval:       int, number of updates between logging events (default: 100)

    buffer_size:        int, size of the replay buffer (default: 50k)

    replay_ratio:       int, now many (on average) batches of data to sample from the replay buffer take after batch from the environment (default: 4)

    replay_start:       int, the sampling from the replay buffer does not start until replay buffer has at least that many samples (default: 10k)

    c:                  float, importance weight clipping factor (default: 10)

    trust_region        bool, whether or not algorithms estimates the gradient KL divergence between the old and updated policy and uses it to determine step size  (default: True)

    delta:              float, max KL divergence between the old policy and updated policy (default: 1)

    alpha:              float, momentum factor in the Polyak (exponential moving average) averaging of the model parameters (default: 0.99)

    load_path:          str, path to load the model from (default: None)

    **network_kwargs:               keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
                                    For instance, 'mlp' network architecture has arguments num_hidden and num_layers.

    '''

    print("Running Acer Simple")
    print(locals())
    set_global_seeds(seed)
    if not isinstance(env, VecFrameStack):
        env = VecFrameStack(env, 1)

    policy = build_policy(env, network, estimate_q=True, **network_kwargs)
    nenvs = env.num_envs
    ob_space = env.observation_space
    ac_space = env.action_space

    nstack = env.nstack
    model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps,
                  ent_coef=ent_coef, q_coef=q_coef, gamma=gamma,
                  max_grad_norm=max_grad_norm, lr=lr, rprop_alpha=rprop_alpha, rprop_epsilon=rprop_epsilon,
                  total_timesteps=total_timesteps, lrschedule=lrschedule, c=c,
                  trust_region=trust_region, alpha=alpha, delta=delta)

    if load_path is not None:
        model.load(load_path)

    runner = Runner(env=env, model=model, nsteps=nsteps)
    if replay_ratio > 0:
        buffer = Buffer(env=env, nsteps=nsteps, size=buffer_size)
    else:
        buffer = None
    nbatch = nenvs*nsteps
    acer = Acer(runner, model, buffer, log_interval)
    acer.tstart = time.time()

    for acer.steps in range(0, total_timesteps, nbatch): #nbatch samples, 1 on_policy call and multiple off-policy calls
        acer.call(on_policy=True)
        if replay_ratio > 0 and buffer.has_atleast(replay_start):
            n = np.random.poisson(replay_ratio)
            for _ in range(n):
                acer.call(on_policy=False)  # no simulation steps in this

    return model


================================================
FILE: baselines/acer/buffer.py
================================================
import numpy as np

class Buffer(object):
    # gets obs, actions, rewards, mu's, (states, masks), dones
    def __init__(self, env, nsteps, size=50000):
        self.nenv = env.num_envs
        self.nsteps = nsteps
        # self.nh, self.nw, self.nc = env.observation_space.shape
        self.obs_shape = env.observation_space.shape
        self.obs_dtype = env.observation_space.dtype
        self.ac_dtype = env.action_space.dtype
        self.nc = self.obs_shape[-1]
        self.nstack = env.nstack
        self.nc //= self.nstack
        self.nbatch = self.nenv * self.nsteps
        self.size = size // (self.nsteps)  # Each loc contains nenv * nsteps frames, thus total buffer is nenv * size frames

        # Memory
        self.enc_obs = None
        self.actions = None
        self.rewards = None
        self.mus = None
        self.dones = None
        self.masks = None

        # Size indexes
        self.next_idx = 0
        self.num_in_buffer = 0

    def has_atleast(self, frames):
        # Frames per env, so total (nenv * frames) Frames needed
        # Each buffer loc has nenv * nsteps frames
        return self.num_in_buffer >= (frames // self.nsteps)

    def can_sample(self):
        return self.num_in_buffer > 0

    # Generate stacked frames
    def decode(self, enc_obs, dones):
        # enc_obs has shape [nenvs, nsteps + nstack, nh, nw, nc]
        # dones has shape [nenvs, nsteps]
        # returns stacked obs of shape [nenv, (nsteps + 1), nh, nw, nstack*nc]

        return _stack_obs(enc_obs, dones,
                          nsteps=self.nsteps)

    def put(self, enc_obs, actions, rewards, mus, dones, masks):
        # enc_obs [nenv, (nsteps + nstack), nh, nw, nc]
        # actions, rewards, dones [nenv, nsteps]
        # mus [nenv, nsteps, nact]

        if self.enc_obs is None:
            self.enc_obs = np.empty([self.size] + list(enc_obs.shape), dtype=self.obs_dtype)
            self.actions = np.empty([self.size] + list(actions.shape), dtype=self.ac_dtype)
            self.rewards = np.empty([self.size] + list(rewards.shape), dtype=np.float32)
            self.mus = np.empty([self.size] + list(mus.shape), dtype=np.float32)
            self.dones = np.empty([self.size] + list(dones.shape), dtype=np.bool)
            self.masks = np.empty([self.size] + list(masks.shape), dtype=np.bool)

        self.enc_obs[self.next_idx] = enc_obs
        self.actions[self.next_idx] = actions
        self.rewards[self.next_idx] = rewards
        self.mus[self.next_idx] = mus
        self.dones[self.next_idx] = dones
        self.masks[self.next_idx] = masks

        self.next_idx = (self.next_idx + 1) % self.size
        self.num_in_buffer = min(self.size, self.num_in_buffer + 1)

    def take(self, x, idx, envx):
        nenv = self.nenv
        out = np.empty([nenv] + list(x.shape[2:]), dtype=x.dtype)
        for i in range(nenv):
            out[i] = x[idx[i], envx[i]]
        return out

    def get(self):
        # returns
        # obs [nenv, (nsteps + 1), nh, nw, nstack*nc]
        # actions, rewards, dones [nenv, nsteps]
        # mus [nenv, nsteps, nact]
        nenv = self.nenv
        assert self.can_sample()

        # Sample exactly one id per env. If you sample across envs, then higher correlation in samples from same env.
        idx = np.random.randint(0, self.num_in_buffer, nenv)
        envx = np.arange(nenv)

        take = lambda x: self.take(x, idx, envx)  # for i in range(nenv)], axis = 0)
        dones = take(self.dones)
        enc_obs = take(self.enc_obs)
        obs = self.decode(enc_obs, dones)
        actions = take(self.actions)
        rewards = take(self.rewards)
        mus = take(self.mus)
        masks = take(self.masks)
        return obs, actions, rewards, mus, dones, masks



def _stack_obs_ref(enc_obs, dones, nsteps):
    nenv = enc_obs.shape[0]
    nstack = enc_obs.shape[1] - nsteps
    nh, nw, nc = enc_obs.shape[2:]
    obs_dtype = enc_obs.dtype
    obs_shape = (nh, nw, nc*nstack)

    mask = np.empty([nsteps + nstack - 1, nenv, 1, 1, 1], dtype=np.float32)
    obs = np.zeros([nstack, nsteps + nstack, nenv, nh, nw, nc], dtype=obs_dtype)
    x = np.reshape(enc_obs, [nenv, nsteps + nstack, nh, nw, nc]).swapaxes(1, 0)  # [nsteps + nstack, nenv, nh, nw, nc]

    mask[nstack-1:] = np.reshape(1.0 - dones, [nenv, nsteps, 1, 1, 1]).swapaxes(1, 0)  # keep
    mask[:nstack-1] = 1.0

    # y = np.reshape(1 - dones, [nenvs, nsteps, 1, 1, 1])
    for i in range(nstack):
        obs[-(i + 1), i:] = x
        # obs[:,i:,:,:,-(i+1),:] = x
        x = x[:-1] * mask
        mask = mask[1:]

    return np.reshape(obs[:, (nstack-1):].transpose((2, 1, 3, 4, 0, 5)), (nenv, (nsteps + 1)) + obs_shape)

def _stack_obs(enc_obs, dones, nsteps):
    nenv = enc_obs.shape[0]
    nstack = enc_obs.shape[1] - nsteps
    nc = enc_obs.shape[-1]

    obs_ = np.zeros((nenv, nsteps + 1) + enc_obs.shape[2:-1] + (enc_obs.shape[-1] * nstack, ), dtype=enc_obs.dtype)
    mask = np.ones((nenv, nsteps+1), dtype=enc_obs.dtype)
    mask[:, 1:] = 1.0 - dones
    mask = mask.reshape(mask.shape + tuple(np.ones(len(enc_obs.shape)-2, dtype=np.uint8)))

    for i in range(nstack-1, -1, -1):
        obs_[..., i * nc : (i + 1) * nc] = enc_obs[:, i : i + nsteps + 1, :]
        if i < nstack-1:
            obs_[..., i * nc : (i + 1) * nc] *= mask
            mask[:, 1:, ...] *= mask[:, :-1, ...]

    return obs_

def test_stack_obs():
    nstack = 7
    nenv = 1
    nsteps = 5

    obs_shape = (2, 3, nstack)

    enc_obs_shape = (nenv, nsteps + nstack) + obs_shape[:-1] + (1,)
    enc_obs = np.random.random(enc_obs_shape)
    dones = np.random.randint(low=0, high=2, size=(nenv, nsteps))

    stacked_obs_ref = _stack_obs_ref(enc_obs, dones, nsteps=nsteps)
    stacked_obs_test = _stack_obs(enc_obs, dones, nsteps=nsteps)

    np.testing.assert_allclose(stacked_obs_ref, stacked_obs_test)


================================================
FILE: baselines/acer/defaults.py
================================================
def atari():
    return dict(
        lrschedule='constant'
    )


================================================
FILE: baselines/acer/policies.py
================================================
import numpy as np
import tensorflow as tf
from baselines.common.policies import nature_cnn
from baselines.a2c.utils import fc, batch_to_seq, seq_to_batch, lstm, sample


class AcerCnnPolicy(object):

    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False):
        nbatch = nenv * nsteps
        nh, nw, nc = ob_space.shape
        ob_shape = (nbatch, nh, nw, nc * nstack)
        nact = ac_space.n
        X = tf.placeholder(tf.uint8, ob_shape)  # obs
        with tf.variable_scope("model", reuse=reuse):
            h = nature_cnn(X)
            pi_logits = fc(h, 'pi', nact, init_scale=0.01)
            pi = tf.nn.softmax(pi_logits)
            q = fc(h, 'q', nact)

        a = sample(tf.nn.softmax(pi_logits))  # could change this to use self.pi instead
        self.initial_state = []  # not stateful
        self.X = X
        self.pi = pi  # actual policy params now
        self.pi_logits = pi_logits
        self.q = q
        self.vf = q

        def step(ob, *args, **kwargs):
            # returns actions, mus, states
            a0, pi0 = sess.run([a, pi], {X: ob})
            return a0, pi0, []  # dummy state

        def out(ob, *args, **kwargs):
            pi0, q0 = sess.run([pi, q], {X: ob})
            return pi0, q0

        def act(ob, *args, **kwargs):
            return sess.run(a, {X: ob})

        self.step = step
        self.out = out
        self.act = act

class AcerLstmPolicy(object):

    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False, nlstm=256):
        nbatch = nenv * nsteps
        nh, nw, nc = ob_space.shape
        ob_shape = (nbatch, nh, nw, nc * nstack)
        nact = ac_space.n
        X = tf.placeholder(tf.uint8, ob_shape)  # obs
        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
        with tf.variable_scope("model", reuse=reuse):
            h = nature_cnn(X)

            # lstm
            xs = batch_to_seq(h, nenv, nsteps)
            ms = batch_to_seq(M, nenv, nsteps)
            h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
            h5 = seq_to_batch(h5)

            pi_logits = fc(h5, 'pi', nact, init_scale=0.01)
            pi = tf.nn.softmax(pi_logits)
            q = fc(h5, 'q', nact)

        a = sample(pi_logits)  # could change this to use self.pi instead
        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
        self.X = X
        self.M = M
        self.S = S
        self.pi = pi  # actual policy params now
        self.q = q

        def step(ob, state, mask, *args, **kwargs):
            # returns actions, mus, states
            a0, pi0, s = sess.run([a, pi, snew], {X: ob, S: state, M: mask})
            return a0, pi0, s

        self.step = step


================================================
FILE: baselines/acer/runner.py
================================================
import numpy as np
from baselines.common.runners import AbstractEnvRunner
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
from gym import spaces


class Runner(AbstractEnvRunner):

    def __init__(self, env, model, nsteps):
        super().__init__(env=env, model=model, nsteps=nsteps)
        assert isinstance(env.action_space, spaces.Discrete), 'This ACER implementation works only with discrete action spaces!'
        assert isinstance(env, VecFrameStack)

        self.nact = env.action_space.n
        nenv = self.nenv
        self.nbatch = nenv * nsteps
        self.batch_ob_shape = (nenv*(nsteps+1),) + env.observation_space.shape

        self.obs = env.reset()
        self.obs_dtype = env.observation_space.dtype
        self.ac_dtype = env.action_space.dtype
        self.nstack = self.env.nstack
        self.nc = self.batch_ob_shape[-1] // self.nstack


    def run(self):
        # enc_obs = np.split(self.obs, self.nstack, axis=3)  # so now list of obs steps
        enc_obs = np.split(self.env.stackedobs, self.env.nstack, axis=-1)
        mb_obs, mb_actions, mb_mus, mb_dones, mb_rewards = [], [], [], [], []
        for _ in range(self.nsteps):
            actions, mus, states = self.model._step(self.obs, S=self.states, M=self.dones)
            mb_obs.append(np.copy(self.obs))
            mb_actions.append(actions)
            mb_mus.append(mus)
            mb_dones.append(self.dones)
            obs, rewards, dones, _ = self.env.step(actions)
            # states information for statefull models like LSTM
            self.states = states
            self.dones = dones
            self.obs = obs
            mb_rewards.append(rewards)
            enc_obs.append(obs[..., -self.nc:])
        mb_obs.append(np.copy(self.obs))
        mb_dones.append(self.dones)

        enc_obs = np.asarray(enc_obs, dtype=self.obs_dtype).swapaxes(1, 0)
        mb_obs = np.asarray(mb_obs, dtype=self.obs_dtype).swapaxes(1, 0)
        mb_actions = np.asarray(mb_actions, dtype=self.ac_dtype).swapaxes(1, 0)
        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
        mb_mus = np.asarray(mb_mus, dtype=np.float32).swapaxes(1, 0)

        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)

        mb_masks = mb_dones # Used for statefull models like LSTM's to mask state when done
        mb_dones = mb_dones[:, 1:] # Used for calculating returns. The dones array is now aligned with rewards

        # shapes are now [nenv, nsteps, []]
        # When pulling from buffer, arrays will now be reshaped in place, preventing a deep copy.

        return enc_obs, mb_obs, mb_actions, mb_rewards, mb_mus, mb_dones, mb_masks



================================================
FILE: baselines/acktr/README.md
================================================
# ACKTR

- Original paper: https://arxiv.org/abs/1708.05144
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.run --alg=acktr --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
- also refer to the repo-wide [README.md](../../README.md#training-models)

## ACKTR with continuous action spaces
The code of ACKTR has been refactored to handle both discrete and continuous action spaces uniformly. In the original version, discrete and continuous action spaces were handled by different code (actkr_disc.py and acktr_cont.py) with little overlap. If interested in the original version of the acktr for continuous action spaces, use `old_acktr_cont` branch. Note that original code performs better on the mujoco tasks than the refactored version; we are still investigating why. 


================================================
FILE: baselines/acktr/__init__.py
================================================


================================================
FILE: baselines/acktr/acktr.py
================================================
import os.path as osp
import time
import functools
import tensorflow as tf
from baselines import logger

from baselines.common import set_global_seeds, explained_variance
from baselines.common.policies import build_policy
from baselines.common.tf_util import get_session, save_variables, load_variables

from baselines.a2c.runner import Runner
from baselines.a2c.utils import Scheduler, find_trainable_variables
from baselines.acktr import kfac
from baselines.ppo2.ppo2 import safemean
from collections import deque


class Model(object):

    def __init__(self, policy, ob_space, ac_space, nenvs,total_timesteps, nprocs=32, nsteps=20,
                 ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
                 kfac_clip=0.001, lrschedule='linear', is_async=True):

        self.sess = sess = get_session()
        nbatch = nenvs * nsteps
        with tf.variable_scope('acktr_model', reuse=tf.AUTO_REUSE):
            self.model = step_model = policy(nenvs, 1, sess=sess)
            self.model2 = train_model = policy(nenvs*nsteps, nsteps, sess=sess)

        A = train_model.pdtype.sample_placeholder([None])
        ADV = tf.placeholder(tf.float32, [nbatch])
        R = tf.placeholder(tf.float32, [nbatch])
        PG_LR = tf.placeholder(tf.float32, [])
        VF_LR = tf.placeholder(tf.float32, [])

        neglogpac = train_model.pd.neglogp(A)
        self.logits = train_model.pi

        ##training loss
        pg_loss = tf.reduce_mean(ADV*neglogpac)
        entropy = tf.reduce_mean(train_model.pd.entropy())
        pg_loss = pg_loss - ent_coef * entropy
        vf_loss = tf.losses.mean_squared_error(tf.squeeze(train_model.vf), R)
        train_loss = pg_loss + vf_coef * vf_loss


        ##Fisher loss construction
        self.pg_fisher = pg_fisher_loss = -tf.reduce_mean(neglogpac)
        sample_net = train_model.vf + tf.random_normal(tf.shape(train_model.vf))
        self.vf_fisher = vf_fisher_loss = - vf_fisher_coef*tf.reduce_mean(tf.pow(train_model.vf - tf.stop_gradient(sample_net), 2))
        self.joint_fisher = joint_fisher_loss = pg_fisher_loss + vf_fisher_loss

        self.params=params = find_trainable_variables("acktr_model")

        self.grads_check = grads = tf.gradients(train_loss,params)

        with tf.device('/gpu:0'):
            self.optim = optim = kfac.KfacOptimizer(learning_rate=PG_LR, clip_kl=kfac_clip,\
                momentum=0.9, kfac_update=1, epsilon=0.01,\
                stats_decay=0.99, is_async=is_async, cold_iter=10, max_grad_norm=max_grad_norm)

            # update_stats_op = optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
            optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
            train_op, q_runner = optim.apply_gradients(list(zip(grads,params)))
        self.q_runner = q_runner
        self.lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)

        def train(obs, states, rewards, masks, actions, values):
            advs = rewards - values
            for step in range(len(obs)):
                cur_lr = self.lr.value()

            td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, PG_LR:cur_lr, VF_LR:cur_lr}
            if states is not None:
                td_map[train_model.S] = states
                td_map[train_model.M] = masks

            policy_loss, value_loss, policy_entropy, _ = sess.run(
                [pg_loss, vf_loss, entropy, train_op],
                td_map
            )
            return policy_loss, value_loss, policy_entropy


        self.train = train
        self.save = functools.partial(save_variables, sess=sess)
        self.load = functools.partial(load_variables, sess=sess)
        self.train_model = train_model
        self.step_model = step_model
        self.step = step_model.step
        self.value = step_model.value
        self.initial_state = step_model.initial_state
        tf.global_variables_initializer().run(session=sess)

def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=100, nprocs=32, nsteps=20,
                 ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
                 kfac_clip=0.001, save_interval=None, lrschedule='linear', load_path=None, is_async=True, **network_kwargs):
    set_global_seeds(seed)


    if network == 'cnn':
        network_kwargs['one_dim_bias'] = True

    policy = build_policy(env, network, **network_kwargs)

    nenvs = env.num_envs
    ob_space = env.observation_space
    ac_space = env.action_space
    make_model = lambda : Model(policy, ob_space, ac_space, nenvs, total_timesteps, nprocs=nprocs, nsteps
                                =nsteps, ent_coef=ent_coef, vf_coef=vf_coef, vf_fisher_coef=
                                vf_fisher_coef, lr=lr, max_grad_norm=max_grad_norm, kfac_clip=kfac_clip,
                                lrschedule=lrschedule, is_async=is_async)
    if save_interval and logger.get_dir():
        import cloudpickle
        with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
            fh.write(cloudpickle.dumps(make_model))
    model = make_model()

    if load_path is not None:
        model.load(load_path)

    runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
    epinfobuf = deque(maxlen=100)
    nbatch = nenvs*nsteps
    tstart = time.time()
    coord = tf.train.Coordinator()
    if is_async:
        enqueue_threads = model.q_runner.create_threads(model.sess, coord=coord, start=True)
    else:
        enqueue_threads = []

    for update in range(1, total_timesteps//nbatch+1):
        obs, states, rewards, masks, actions, values, epinfos = runner.run()
        epinfobuf.extend(epinfos)
        policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
        model.old_obs = obs
        nseconds = time.time()-tstart
        fps = int((update*nbatch)/nseconds)
        if update % log_interval == 0 or update == 1:
            ev = explained_variance(values, rewards)
            logger.record_tabular("nupdates", update)
            logger.record_tabular("total_timesteps", update*nbatch)
            logger.record_tabular("fps", fps)
            logger.record_tabular("policy_entropy", float(policy_entropy))
            logger.record_tabular("policy_loss", float(policy_loss))
            logger.record_tabular("value_loss", float(value_loss))
            logger.record_tabular("explained_variance", float(ev))
            logger.record_tabular("eprewmean", safemean([epinfo['r'] for epinfo in epinfobuf]))
            logger.record_tabular("eplenmean", safemean([epinfo['l'] for epinfo in epinfobuf]))
            logger.dump_tabular()

        if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir():
            savepath = osp.join(logger.get_dir(), 'checkpoint%.5i'%update)
            print('Saving to', savepath)
            model.save(savepath)
    coord.request_stop()
    coord.join(enqueue_threads)
    return model


================================================
FILE: baselines/acktr/defaults.py
================================================
def mujoco():
    return dict(
        nsteps=2500,
        value_network='copy'
    )


================================================
FILE: baselines/acktr/kfac.py
================================================
import tensorflow as tf
import numpy as np
import re

 # flake8: noqa F403, F405
from baselines.acktr.kfac_utils import *
from functools import reduce

KFAC_OPS = ['MatMul', 'Conv2D', 'BiasAdd']
KFAC_DEBUG = False


class KfacOptimizer():
    # note that KfacOptimizer will be truly synchronous (and thus deterministic) only if a single-threaded session is used
    def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, is_async=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
        self.max_grad_norm = max_grad_norm
        self._lr = learning_rate
        self._momentum = momentum
        self._clip_kl = clip_kl
        self._channel_fac = channel_fac
        self._kfac_update = kfac_update
        self._async = is_async
        self._async_stats = async_stats
        self._epsilon = epsilon
        self._stats_decay = stats_decay
        self._blockdiag_bias = blockdiag_bias
        self._approxT2 = approxT2
        self._use_float64 = use_float64
        self._factored_damping = factored_damping
        self._cold_iter = cold_iter
        if cold_lr == None:
            # good heuristics
            self._cold_lr = self._lr# * 3.
        else:
            self._cold_lr = cold_lr
        self._stats_accum_iter = stats_accum_iter
        self._weight_decay_dict = weight_decay_dict
        self._diag_init_coeff = 0.
        self._full_stats_init = full_stats_init
        if not self._full_stats_init:
            self._stats_accum_iter = self._cold_iter

        self.sgd_step = tf.Variable(0, name='KFAC/sgd_step', trainable=False)
        self.global_step = tf.Variable(
            0, name='KFAC/global_step', trainable=False)
        self.cold_step = tf.Variable(0, name='KFAC/cold_step', trainable=False)
        self.factor_step = tf.Variable(
            0, name='KFAC/factor_step', trainable=False)
        self.stats_step = tf.Variable(
            0, name='KFAC/stats_step', trainable=False)
        self.vFv = tf.Variable(0., name='KFAC/vFv', trainable=False)

        self.factors = {}
        self.param_vars = []
        self.stats = {}
        self.stats_eigen = {}

    def getFactors(self, g, varlist):
        graph = tf.get_default_graph()
        factorTensors = {}
        fpropTensors = []
        bpropTensors = []
        opTypes = []
        fops = []

        def searchFactors(gradient, graph):
            # hard coded search stratergy
            bpropOp = gradient.op
            bpropOp_name = bpropOp.name

            bTensors = []
            fTensors = []

            # combining additive gradient, assume they are the same op type and
            # indepedent
            if 'AddN' in bpropOp_name:
                factors = []
                for g in gradient.op.inputs:
                    factors.append(searchFactors(g, graph))
                op_names = [item['opName'] for item in factors]
                # TO-DO: need to check all the attribute of the ops as well
                print (gradient.name)
                print (op_names)
                print (len(np.unique(op_names)))
                assert len(np.unique(op_names)) == 1, gradient.name + \
                    ' is shared among different computation OPs'

                bTensors = reduce(lambda x, y: x + y,
                                  [item['bpropFactors'] for item in factors])
                if len(factors[0]['fpropFactors']) > 0:
                    fTensors = reduce(
                        lambda x, y: x + y, [item['fpropFactors'] for item in factors])
                fpropOp_name = op_names[0]
                fpropOp = factors[0]['op']
            else:
                fpropOp_name = re.search(
                    'gradientsSampled(_[0-9]+|)/(.+?)_grad', bpropOp_name).group(2)
                fpropOp = graph.get_operation_by_name(fpropOp_name)
                if fpropOp.op_def.name in KFAC_OPS:
                    # Known OPs
                    ###
                    bTensor = [
                        i for i in bpropOp.inputs if 'gradientsSampled' in i.name][-1]
                    bTensorShape = fpropOp.outputs[0].get_shape()
                    if bTensor.get_shape()[0].value == None:
                        bTensor.set_shape(bTensorShape)
                    bTensors.append(bTensor)
                    ###
                    if fpropOp.op_def.name == 'BiasAdd':
                        fTensors = []
                    else:
                        fTensors.append(
                            [i for i in fpropOp.inputs if param.op.name not in i.name][0])
                    fpropOp_name = fpropOp.op_def.name
                else:
                    # unknown OPs, block approximation used
                    bInputsList = [i for i in bpropOp.inputs[
                        0].op.inputs if 'gradientsSampled' in i.name if 'Shape' not in i.name]
                    if len(bInputsList) > 0:
                        bTensor = bInputsList[0]
                        bTensorShape = fpropOp.outputs[0].get_shape()
                        if len(bTensor.get_shape()) > 0 and bTensor.get_shape()[0].value == None:
                            bTensor.set_shape(bTensorShape)
                        bTensors.append(bTensor)
                    fpropOp_name = opTypes.append('UNK-' + fpropOp.op_def.name)

            return {'opName': fpropOp_name, 'op': fpropOp, 'fpropFactors': fTensors, 'bpropFactors': bTensors}

        for t, param in zip(g, varlist):
            if KFAC_DEBUG:
                print(('get factor for '+param.name))
            factors = searchFactors(t, graph)
            factorTensors[param] = factors

        ########
        # check associated weights and bias for homogeneous coordinate representation
        # and check redundent factors
        # TO-DO: there may be a bug to detect associate bias and weights for
        # forking layer, e.g. in inception models.
        for param in varlist:
            factorTensors[param]['assnWeights'] = None
            factorTensors[param]['assnBias'] = None
        for param in varlist:
            if factorTensors[param]['opName'] == 'BiasAdd':
                factorTensors[param]['assnWeights'] = None
                for item in varlist:
                    if len(factorTensors[item]['bpropFactors']) > 0:
                        if (set(factorTensors[item]['bpropFactors']) == set(factorTensors[param]['bpropFactors'])) and (len(factorTensors[item]['fpropFactors']) > 0):
                            factorTensors[param]['assnWeights'] = item
                            factorTensors[item]['assnBias'] = param
                            factorTensors[param]['bpropFactors'] = factorTensors[
                                item]['bpropFactors']

        ########

        ########
        # concatenate the additive gradients along the batch dimension, i.e.
        # assuming independence structure
        for key in ['fpropFactors', 'bpropFactors']:
            for i, param in enumerate(varlist):
                if len(factorTensors[param][key]) > 0:
                    if (key + '_concat') not in factorTensors[param]:
                        name_scope = factorTensors[param][key][0].name.split(':')[
                            0]
                        with tf.name_scope(name_scope):
                            factorTensors[param][
                                key + '_concat'] = tf.concat(factorTensors[param][key], 0)
                else:
                    factorTensors[param][key + '_concat'] = None
                for j, param2 in enumerate(varlist[(i + 1):]):
                    if (len(factorTensors[param][key]) > 0) and (set(factorTensors[param2][key]) == set(factorTensors[param][key])):
                        factorTensors[param2][key] = factorTensors[param][key]
                        factorTensors[param2][
                            key + '_concat'] = factorTensors[param][key + '_concat']
        ########

        if KFAC_DEBUG:
            for items in zip(varlist, fpropTensors, bpropTensors, opTypes):
                print((items[0].name, factorTensors[item]))
        self.factors = factorTensors
        return factorTensors

    def getStats(self, factors, varlist):
        if len(self.stats) == 0:
            # initialize stats variables on CPU because eigen decomp is
            # computed on CPU
            with tf.device('/cpu'):
                tmpStatsCache = {}

                # search for tensor factors and
                # use block diag approx for the bias units
                for var in varlist:
                    fpropFactor = factors[var]['fpropFactors_concat']
                    bpropFactor = factors[var]['bpropFactors_concat']
                    opType = factors[var]['opName']
                    if opType == 'Conv2D':
                        Kh = var.get_shape()[0]
                        Kw = var.get_shape()[1]
                        C = fpropFactor.get_shape()[-1]

                        Oh = bpropFactor.get_shape()[1]
                        Ow = bpropFactor.get_shape()[2]
                        if Oh == 1 and Ow == 1 and self._channel_fac:
                            # factorization along the channels do not support
                            # homogeneous coordinate
                            var_assnBias = factors[var]['assnBias']
                            if var_assnBias:
                                factors[var]['assnBias'] = None
                                factors[var_assnBias]['assnWeights'] = None
                ##

                for var in varlist:
                    fpropFactor = factors[var]['fpropFactors_concat']
                    bpropFactor = factors[var]['bpropFactors_concat']
                    opType = factors[var]['opName']
                    self.stats[var] = {'opName': opType,
                                       'fprop_concat_stats': [],
                                       'bprop_concat_stats': [],
                                       'assnWeights': factors[var]['assnWeights'],
                                       'assnBias': factors[var]['assnBias'],
                                       }
                    if fpropFactor is not None:
                        if fpropFactor not in tmpStatsCache:
                            if opType == 'Conv2D':
                                Kh = var.get_shape()[0]
                                Kw = var.get_shape()[1]
                                C = fpropFactor.get_shape()[-1]

                                Oh = bpropFactor.get_shape()[1]
                                Ow = bpropFactor.get_shape()[2]
                                if Oh == 1 and Ow == 1 and self._channel_fac:
                                    # factorization along the channels
                                    # assume independence between input channels and spatial
                                    # 2K-1 x 2K-1 covariance matrix and C x C covariance matrix
                                    # factorization along the channels do not
                                    # support homogeneous coordinate, assnBias
                                    # is always None
                                    fpropFactor2_size = Kh * Kw
                                    slot_fpropFactor_stats2 = tf.Variable(tf.diag(tf.ones(
                                        [fpropFactor2_size])) * self._diag_init_coeff, name='KFAC_STATS/' + fpropFactor.op.name, trainable=False)
                                    self.stats[var]['fprop_concat_stats'].append(
                                        slot_fpropFactor_stats2)

                                    fpropFactor_size = C
                                else:
                                    # 2K-1 x 2K-1 x C x C covariance matrix
                                    # assume BHWC
                                    fpropFactor_size = Kh * Kw * C
                            else:
                                # D x D covariance matrix
                                fpropFactor_size = fpropFactor.get_shape()[-1]

                            # use homogeneous coordinate
                            if not self._blockdiag_bias and self.stats[var]['assnBias']:
                                fpropFactor_size += 1

                            slot_fpropFactor_stats = tf.Variable(tf.diag(tf.ones(
                                [fpropFactor_size])) * self._diag_init_coeff, name='KFAC_STATS/' + fpropFactor.op.name, trainable=False)
                            self.stats[var]['fprop_concat_stats'].append(
                                slot_fpropFactor_stats)
                            if opType != 'Conv2D':
                                tmpStatsCache[fpropFactor] = self.stats[
                                    var]['fprop_concat_stats']
                        else:
                            self.stats[var][
                                'fprop_concat_stats'] = tmpStatsCache[fpropFactor]

                    if bpropFactor is not None:
                        # no need to collect backward stats for bias vectors if
                        # using homogeneous coordinates
                        if not((not self._blockdiag_bias) and self.stats[var]['assnWeights']):
                            if bpropFactor not in tmpStatsCache:
                                slot_bpropFactor_stats = tf.Variable(tf.diag(tf.ones([bpropFactor.get_shape(
                                )[-1]])) * self._diag_init_coeff, name='KFAC_STATS/' + bpropFactor.op.name, trainable=False)
                                self.stats[var]['bprop_concat_stats'].append(
                                    slot_bpropFactor_stats)
                                tmpStatsCache[bpropFactor] = self.stats[
                                    var]['bprop_concat_stats']
                            else:
                                self.stats[var][
                                    'bprop_concat_stats'] = tmpStatsCache[bpropFactor]

        return self.stats

    def compute_and_apply_stats(self, loss_sampled, var_list=None):
        varlist = var_list
        if varlist is None:
            varlist = tf.trainable_variables()

        stats = self.compute_stats(loss_sampled, var_list=varlist)
        return self.apply_stats(stats)

    def compute_stats(self, loss_sampled, var_list=None):
        varlist = var_list
        if varlist is None:
            varlist = tf.trainable_variables()

        gs = tf.gradients(loss_sampled, varlist, name='gradientsSampled')
        self.gs = gs
        factors = self.getFactors(gs, varlist)
        stats = self.getStats(factors, varlist)

        updateOps = []
        statsUpdates = {}
        statsUpdates_cache = {}
        for var in varlist:
            opType = factors[var]['opName']
            fops = factors[var]['op']
            fpropFactor = factors[var]['fpropFactors_concat']
            fpropStats_vars = stats[var]['fprop_concat_stats']
            bpropFactor = factors[var]['bpropFactors_concat']
            bpropStats_vars = stats[var]['bprop_concat_stats']
            SVD_factors = {}
            for stats_var in fpropStats_vars:
                stats_var_dim = int(stats_var.get_shape()[0])
                if stats_var not in statsUpdates_cache:
                    old_fpropFactor = fpropFactor
                    B = (tf.shape(fpropFactor)[0])  # batch size
                    if opType == 'Conv2D':
                        strides = fops.get_attr("strides")
                        padding = fops.get_attr("padding")
                        convkernel_size = var.get_shape()[0:3]

                        KH = int(convkernel_size[0])
                        KW = int(convkernel_size[1])
                        C = int(convkernel_size[2])
                        flatten_size = int(KH * KW * C)

                        Oh = int(bpropFactor.get_shape()[1])
                        Ow = int(bpropFactor.get_shape()[2])

                        if Oh == 1 and Ow == 1 and self._channel_fac:
                                # factorization along the channels
                                # assume independence among input channels
                                # factor = B x 1 x 1 x (KH xKW x C)
                                # patches = B x Oh x Ow x (KH xKW x C)
                            if len(SVD_factors) == 0:
                                if KFAC_DEBUG:
                                    print(('approx %s act factor with rank-1 SVD factors' % (var.name)))
                                # find closest rank-1 approx to the feature map
                                S, U, V = tf.batch_svd(tf.reshape(
                                    fpropFactor, [-1, KH * KW, C]))
                                # get rank-1 approx slides
                                sqrtS1 = tf.expand_dims(tf.sqrt(S[:, 0, 0]), 1)
                                patches_k = U[:, :, 0] * sqrtS1  # B x KH*KW
                                full_factor_shape = fpropFactor.get_shape()
                                patches_k.set_shape(
                                    [full_factor_shape[0], KH * KW])
                                patches_c = V[:, :, 0] * sqrtS1  # B x C
                                patches_c.set_shape([full_factor_shape[0], C])
                                SVD_factors[C] = patches_c
                                SVD_factors[KH * KW] = patches_k
                            fpropFactor = SVD_factors[stats_var_dim]

                        else:
                            # poor mem usage implementation
                            patches = tf.extract_image_patches(fpropFactor, ksizes=[1, convkernel_size[
                                                               0], convkernel_size[1], 1], strides=strides, rates=[1, 1, 1, 1], padding=padding)

                            if self._approxT2:
                                if KFAC_DEBUG:
                                    print(('approxT2 act fisher for %s' % (var.name)))
                                # T^2 terms * 1/T^2, size: B x C
                                fpropFactor = tf.reduce_mean(patches, [1, 2])
                            else:
                                # size: (B x Oh x Ow) x C
                                fpropFactor = tf.reshape(
                                    patches, [-1, flatten_size]) / Oh / Ow
                    fpropFactor_size = int(fpropFactor.get_shape()[-1])
                    if stats_var_dim == (fpropFactor_size + 1) and not self._blockdiag_bias:
                        if opType == 'Conv2D' and not self._approxT2:
                            # correct padding for numerical stability (we
                            # divided out OhxOw from activations for T1 approx)
                            fpropFactor = tf.concat([fpropFactor, tf.ones(
                                [tf.shape(fpropFactor)[0], 1]) / Oh / Ow], 1)
                        else:
                            # use homogeneous coordinates
                            fpropFactor = tf.concat(
                                [fpropFactor, tf.ones([tf.shape(fpropFactor)[0], 1])], 1)

                    # average over the number of data points in a batch
                    # divided by B
                    cov = tf.matmul(fpropFactor, fpropFactor,
                                    transpose_a=True) / tf.cast(B, tf.float32)
                    updateOps.append(cov)
                    statsUpdates[stats_var] = cov
                    if opType != 'Conv2D':
                        # HACK: for convolution we recompute fprop stats for
                        # every layer including forking layers
                        statsUpdates_cache[stats_var] = cov

            for stats_var in bpropStats_vars:
                stats_var_dim = int(stats_var.get_shape()[0])
                if stats_var not in statsUpdates_cache:
                    old_bpropFactor = bpropFactor
                    bpropFactor_shape = bpropFactor.get_shape()
                    B = tf.shape(bpropFactor)[0]  # batch size
                    C = int(bpropFactor_shape[-1])  # num channels
                    if opType == 'Conv2D' or len(bpropFactor_shape) == 4:
                        if fpropFactor is not None:
                            if self._approxT2:
                                if KFAC_DEBUG:
                                    print(('approxT2 grad fisher for %s' % (var.name)))
                                bpropFactor = tf.reduce_sum(
                                    bpropFactor, [1, 2])  # T^2 terms * 1/T^2
                            else:
                                bpropFactor = tf.reshape(
                                    bpropFactor, [-1, C]) * Oh * Ow  # T * 1/T terms
                        else:
                            # just doing block diag approx. spatial independent
                            # structure does not apply here. summing over
                            # spatial locations
                            if KFAC_DEBUG:
                                print(('block diag approx fisher for %s' % (var.name)))
                            bpropFactor = tf.reduce_sum(bpropFactor, [1, 2])

                    # assume sampled loss is averaged. TO-DO:figure out better
                    # way to handle this
                    bpropFactor *= tf.to_float(B)
                    ##

                    cov_b = tf.matmul(
                        bpropFactor, bpropFactor, transpose_a=True) / tf.to_float(tf.shape(bpropFactor)[0])

                    updateOps.append(cov_b)
                    statsUpdates[stats_var] = cov_b
                    statsUpdates_cache[stats_var] = cov_b

        if KFAC_DEBUG:
            aKey = list(statsUpdates.keys())[0]
            statsUpdates[aKey] = tf.Print(statsUpdates[aKey],
                                          [tf.convert_to_tensor('step:'),
                                           self.global_step,
                                           tf.convert_to_tensor(
                                               'computing stats'),
                                           ])
        self.statsUpdates = statsUpdates
        return statsUpdates

    def apply_stats(self, statsUpdates):
        """ compute stats and update/apply the new stats to the running average
        """

        def updateAccumStats():
            if self._full_stats_init:
                return tf.cond(tf.greater(self.sgd_step, self._cold_iter), lambda: tf.group(*self._apply_stats(statsUpdates, accumulate=True, accumulateCoeff=1. / self._stats_accum_iter)), tf.no_op)
            else:
                return tf.group(*self._apply_stats(statsUpdates, accumulate=True, accumulateCoeff=1. / self._stats_accum_iter))

        def updateRunningAvgStats(statsUpdates, fac_iter=1):
            # return tf.cond(tf.greater_equal(self.factor_step,
            # tf.convert_to_tensor(fac_iter)), lambda:
            # tf.group(*self._apply_stats(stats_list, varlist)), tf.no_op)
            return tf.group(*self._apply_stats(statsUpdates))

        if self._async_stats:
            # asynchronous stats update
            update_stats = self._apply_stats(statsUpdates)

            queue = tf.FIFOQueue(1, [item.dtype for item in update_stats], shapes=[
                                 item.get_shape() for item in update_stats])
            enqueue_op = queue.enqueue(update_stats)

            def dequeue_stats_op():
                return queue.dequeue()
            self.qr_stats = tf.train.QueueRunner(queue, [enqueue_op])
            update_stats_op = tf.cond(tf.equal(queue.size(), tf.convert_to_tensor(
                0)), tf.no_op, lambda: tf.group(*[dequeue_stats_op(), ]))
        else:
            # synchronous stats update
            update_stats_op = tf.cond(tf.greater_equal(
                self.stats_step, self._stats_accum_iter), lambda: updateRunningAvgStats(statsUpdates), updateAccumStats)
        self._update_stats_op = update_stats_op
        return update_stats_op

    def _apply_stats(self, statsUpdates, accumulate=False, accumulateCoeff=0.):
        updateOps = []
        # obtain the stats var list
        for stats_var in statsUpdates:
            stats_new = statsUpdates[stats_var]
            if accumulate:
                # simple superbatch averaging
                update_op = tf.assign_add(
                    stats_var, accumulateCoeff * stats_new, use_locking=True)
            else:
                # exponential running averaging
                update_op = tf.assign(
                    stats_var, stats_var * self._stats_decay, use_locking=True)
                update_op = tf.assign_add(
                    update_op, (1. - self._stats_decay) * stats_new, use_locking=True)
            updateOps.append(update_op)

        with tf.control_dependencies(updateOps):
            stats_step_op = tf.assign_add(self.stats_step, 1)

        if KFAC_DEBUG:
            stats_step_op = (tf.Print(stats_step_op,
                                      [tf.convert_to_tensor('step:'),
                                       self.global_step,
                                       tf.convert_to_tensor('fac step:'),
                                       self.factor_step,
                                       tf.convert_to_tensor('sgd step:'),
                                       self.sgd_step,
                                       tf.convert_to_tensor('Accum:'),
                                       tf.convert_to_tensor(accumulate),
                                       tf.convert_to_tensor('Accum coeff:'),
                                       tf.convert_to_tensor(accumulateCoeff),
                                       tf.convert_to_tensor('stat step:'),
                                       self.stats_step, updateOps[0], updateOps[1]]))
        return [stats_step_op, ]

    def getStatsEigen(self, stats=None):
        if len(self.stats_eigen) == 0:
            stats_eigen = {}
            if stats is None:
                stats = self.stats

            tmpEigenCache = {}
            with tf.device('/cpu:0'):
                for var in stats:
                    for key in ['fprop_concat_stats', 'bprop_concat_stats']:
                        for stats_var in stats[var][key]:
                            if stats_var not in tmpEigenCache:
                                stats_dim = stats_var.get_shape()[1].value
                                e = tf.Variable(tf.ones(
                                    [stats_dim]), name='KFAC_FAC/' + stats_var.name.split(':')[0] + '/e', trainable=False)
                                Q = tf.Variable(tf.diag(tf.ones(
                                    [stats_dim])), name='KFAC_FAC/' + stats_var.name.split(':')[0] + '/Q', trainable=False)
                                stats_eigen[stats_var] = {'e': e, 'Q': Q}
                                tmpEigenCache[
                                    stats_var] = stats_eigen[stats_var]
                            else:
                                stats_eigen[stats_var] = tmpEigenCache[
                                    stats_var]
            self.stats_eigen = stats_eigen
        return self.stats_eigen

    def computeStatsEigen(self):
        """ compute the eigen decomp using copied var stats to avoid concurrent read/write from other queue """
        # TO-DO: figure out why this op has delays (possibly moving
        # eigenvectors around?)
        with tf.device('/cpu:0'):
            def removeNone(tensor_list):
                local_list = []
                for item in tensor_list:
                    if item is not None:
                        local_list.append(item)
                return local_list

            def copyStats(var_list):
                print("copying stats to buffer tensors before eigen decomp")
                redundant_stats = {}
                copied_list = []
                for item in var_list:
                    if item is not None:
                        if item not in redundant_stats:
                            if self._use_float64:
                                redundant_stats[item] = tf.cast(
                                    tf.identity(item), tf.float64)
                            else:
                                redundant_stats[item] = tf.identity(item)
                        copied_list.append(redundant_stats[item])
                    else:
                        copied_list.append(None)
                return copied_list
            #stats = [copyStats(self.fStats), copyStats(self.bStats)]
            #stats = [self.fStats, self.bStats]

            stats_eigen = self.stats_eigen
            computedEigen = {}
            eigen_reverse_lookup = {}
            updateOps = []
            # sync copied stats
            # with tf.control_dependencies(removeNone(stats[0]) +
            # removeNone(stats[1])):
            with tf.control_dependencies([]):
                for stats_var in stats_eigen:
                    if stats_var not in computedEigen:
                        eigens = tf.self_adjoint_eig(stats_var)
                        e = eigens[0]
                        Q = eigens[1]
                        if self._use_float64:
                            e = tf.cast(e, tf.float32)
                            Q = tf.cast(Q, tf.float32)
                        updateOps.append(e)
                        updateOps.append(Q)
                        computedEigen[stats_var] = {'e': e, 'Q': Q}
                        eigen_reverse_lookup[e] = stats_eigen[stats_var]['e']
                        eigen_reverse_lookup[Q] = stats_eigen[stats_var]['Q']

            self.eigen_reverse_lookup = eigen_reverse_lookup
            self.eigen_update_list = updateOps

            if KFAC_DEBUG:
                self.eigen_update_list = [item for item in updateOps]
                with tf.control_dependencies(updateOps):
                    updateOps.append(tf.Print(tf.constant(
                        0.), [tf.convert_to_tensor('computed factor eigen')]))

        return updateOps

    def applyStatsEigen(self, eigen_list):
        updateOps = []
        print(('updating %d eigenvalue/vectors' % len(eigen_list)))
        for i, (tensor, mark) in enumerate(zip(eigen_list, self.eigen_update_list)):
            stats_eigen_var = self.eigen_reverse_lookup[mark]
            updateOps.append(
                tf.assign(stats_eigen_var, tensor, use_locking=True))

        with tf.control_dependencies(updateOps):
            factor_step_op = tf.assign_add(self.factor_step, 1)
            updateOps.append(factor_step_op)
            if KFAC_DEBUG:
                updateOps.append(tf.Print(tf.constant(
                    0.), [tf.convert_to_tensor('updated kfac factors')]))
        return updateOps

    def getKfacPrecondUpdates(self, gradlist, varlist):
        updatelist = []
        vg = 0.

        assert len(self.stats) > 0
        assert len(self.stats_eigen) > 0
        assert len(self.factors) > 0
        counter = 0

        grad_dict = {var: grad for grad, var in zip(gradlist, varlist)}

        for grad, var in zip(gradlist, varlist):
            GRAD_RESHAPE = False
            GRAD_TRANSPOSE = False

            fpropFactoredFishers = self.stats[var]['fprop_concat_stats']
            bpropFactoredFishers = self.stats[var]['bprop_concat_stats']

            if (len(fpropFactoredFishers) + len(bpropFactoredFishers)) > 0:
                counter += 1
                GRAD_SHAPE = grad.get_shape()
                if len(grad.get_shape()) > 2:
                    # reshape conv kernel parameters
                    KW = int(grad.get_shape()[0])
                    KH = int(grad.get_shape()[1])
                    C = int(grad.get_shape()[2])
                    D = int(grad.get_shape()[3])

                    if len(fpropFactoredFishers) > 1 and self._channel_fac:
                        # reshape conv kernel parameters into tensor
                        grad = tf.reshape(grad, [KW * KH, C, D])
                    else:
                        # reshape conv kernel parameters into 2D grad
                        grad = tf.reshape(grad, [-1, D])
                    GRAD_RESHAPE = True
                elif len(grad.get_shape()) == 1:
                    # reshape bias or 1D parameters
                    D = int(grad.get_shape()[0])

                    grad = tf.expand_dims(grad, 0)
                    GRAD_RESHAPE = True
                else:
                    # 2D parameters
                    C = int(grad.get_shape()[0])
                    D = int(grad.get_shape()[1])

                if (self.stats[var]['assnBias'] is not None) and not self._blockdiag_bias:
                    # use homogeneous coordinates only works for 2D grad.
                    # TO-DO: figure out how to factorize bias grad
                    # stack bias grad
                    var_assnBias = self.stats[var]['assnBias']
                    grad = tf.concat(
                        [grad, tf.expand_dims(grad_dict[var_assnBias], 0)], 0)

                # project gradient to eigen space and reshape the eigenvalues
                # for broadcasting
                eigVals = []

                for idx, stats in enumerate(self.stats[var]['fprop_concat_stats']):
                    Q = self.stats_eigen[stats]['Q']
                    e = detectMinVal(self.stats_eigen[stats][
                                     'e'], var, name='act', debug=KFAC_DEBUG)

                    Q, e = factorReshape(Q, e, grad, facIndx=idx, ftype='act')
                    eigVals.append(e)
                    grad = gmatmul(Q, grad, transpose_a=True, reduce_dim=idx)

                for idx, stats in enumerate(self.stats[var]['bprop_concat_stats']):
                    Q = self.stats_eigen[stats]['Q']
                    e = detectMinVal(self.stats_eigen[stats][
                                     'e'], var, name='grad', debug=KFAC_DEBUG)

                    Q, e = factorReshape(Q, e, grad, facIndx=idx, ftype='grad')
                    eigVals.append(e)
                    grad = gmatmul(grad, Q, transpose_b=False, reduce_dim=idx)
                ##

                #####
                # whiten using eigenvalues
                weightDecayCoeff = 0.
                if var in self._weight_decay_dict:
                    weightDecayCoeff = self._weight_decay_dict[var]
                    if KFAC_DEBUG:
                        print(('weight decay coeff for %s is %f' % (var.name, weightDecayCoeff)))

                if self._factored_damping:
                    if KFAC_DEBUG:
                        print(('use factored damping for %s' % (var.name)))
                    coeffs = 1.
                    num_factors = len(eigVals)
                    # compute the ratio of two trace norm of the left and right
                    # KFac matrices, and their generalization
                    if len(eigVals) == 1:
                        damping = self._epsilon + weightDecayCoeff
                    else:
                        damping = tf.pow(
                            self._epsilon + weightDecayCoeff, 1. / num_factors)
                    eigVals_tnorm_avg = [tf.reduce_mean(
                        tf.abs(e)) for e in eigVals]
                    for e, e_tnorm in zip(eigVals, eigVals_tnorm_avg):
                        eig_tnorm_negList = [
                            item for item in eigVals_tnorm_avg if item != e_tnorm]
                        if len(eigVals) == 1:
                            adjustment = 1.
                        elif len(eigVals) == 2:
                            adjustment = tf.sqrt(
                                e_tnorm / eig_tnorm_negList[0])
                        else:
                            eig_tnorm_negList_prod = reduce(
                                lambda x, y: x * y, eig_tnorm_negList)
                            adjustment = tf.pow(
                                tf.pow(e_tnorm, num_factors - 1.) / eig_tnorm_negList_prod, 1. / num_factors)
                        coeffs *= (e + adjustment * damping)
                else:
                    coeffs = 1.
                    damping = (self._epsilon + weightDecayCoeff)
                    for e in eigVals:
                        coeffs *= e
                    coeffs += damping

                #grad = tf.Print(grad, [tf.convert_to_tensor('1'), tf.convert_to_tensor(var.name), grad.get_shape()])

                grad /= coeffs

                #grad = tf.Print(grad, [tf.convert_to_tensor('2'), tf.convert_to_tensor(var.name), grad.get_shape()])
                #####
                # project gradient back to euclidean space
                for idx, stats in enumerate(self.stats[var]['fprop_concat_stats']):
                    Q = self.stats_eigen[stats]['Q']
                    grad = gmatmul(Q, grad, transpose_a=False, reduce_dim=idx)

                for idx, stats in enumerate(self.stats[var]['bprop_concat_stats']):
                    Q = self.stats_eigen[stats]['Q']
                    grad = gmatmul(grad, Q, transpose_b=True, reduce_dim=idx)
                ##

                #grad = tf.Print(grad, [tf.convert_to_tensor('3'), tf.convert_to_tensor(var.name), grad.get_shape()])
                if (self.stats[var]['assnBias'] is not None) and not self._blockdiag_bias:
                    # use homogeneous coordinates only works for 2D grad.
                    # TO-DO: figure out how to factorize bias grad
                    # un-stack bias grad
                    var_assnBias = self.stats[var]['assnBias']
                    C_plus_one = int(grad.get_shape()[0])
                    grad_assnBias = tf.reshape(tf.slice(grad,
                                                        begin=[
                                                            C_plus_one - 1, 0],
                                                        size=[1, -1]), var_assnBias.get_shape())
                    grad_assnWeights = tf.slice(grad,
                                                begin=[0, 0],
                                                size=[C_plus_one - 1, -1])
                    grad_dict[var_assnBias] = grad_assnBias
                    grad = grad_assnWeights

                #grad = tf.Print(grad, [tf.convert_to_tensor('4'), tf.convert_to_tensor(var.name), grad.get_shape()])
                if GRAD_RESHAPE:
                    grad = tf.reshape(grad, GRAD_SHAPE)

                grad_dict[var] = grad

        print(('projecting %d gradient matrices' % counter))

        for g, var in zip(gradlist, varlist):
            grad = grad_dict[var]
            ### clipping ###
            if KFAC_DEBUG:
                print(('apply clipping to %s' % (var.name)))
            tf.Print(grad, [tf.sqrt(tf.reduce_sum(tf.pow(grad, 2)))], "Euclidean norm of new grad")
            local_vg = tf.reduce_sum(grad * g * (self._lr * self._lr))
            vg += local_vg

        # recale everything
        if KFAC_DEBUG:
            print('apply vFv clipping')

        scaling = tf.minimum(1., tf.sqrt(self._clip_kl / vg))
        if KFAC_DEBUG:
            scaling = tf.Print(scaling, [tf.convert_to_tensor(
                'clip: '), scaling, tf.convert_to_tensor(' vFv: '), vg])
        with tf.control_dependencies([tf.assign(self.vFv, vg)]):
            updatelist = [grad_dict[var] for var in varlist]
            for i, item in enumerate(updatelist):
                updatelist[i] = scaling * item

        return updatelist

    def compute_gradients(self, loss, var_list=None):
        varlist = var_list
        if varlist is None:
            varlist = tf.trainable_variables()
        g = tf.gradients(loss, varlist)

        return [(a, b) for a, b in zip(g, varlist)]

    def apply_gradients_kfac(self, grads):
        g, varlist = list(zip(*grads))

        if len(self.stats_eigen) == 0:
            self.getStatsEigen()

        qr = None
        # launch eigen-decomp on a queue thread
        if self._async:
            print('Use async eigen decomp')
            # get a list of factor loading tensors
            factorOps_dummy = self.computeStatsEigen()

            # define a queue for the list of factor loading tensors
            queue = tf.FIFOQueue(1, [item.dtype for item in factorOps_dummy], shapes=[
                                 item.get_shape() for item in factorOps_dummy])
            enqueue_op = tf.cond(tf.logical_and(tf.equal(tf.mod(self.stats_step, self._kfac_update), tf.convert_to_tensor(
                0)), tf.greater_equal(self.stats_step, self._stats_accum_iter)), lambda: queue.enqueue(self.computeStatsEigen()), tf.no_op)

            def dequeue_op():
                return queue.dequeue()

            qr = tf.train.QueueRunner(queue, [enqueue_op])

        updateOps = []
        global_step_op = tf.assign_add(self.global_step, 1)
        updateOps.append(global_step_op)

        with tf.control_dependencies([global_step_op]):

            # compute updates
            assert self._update_stats_op != None
            updateOps.append(self._update_stats_op)
            dependency_list = []
            if not self._async:
                dependency_list.append(self._update_stats_op)

            with tf.control_dependencies(dependency_list):
                def no_op_wrapper():
                    return tf.group(*[tf.assign_add(self.cold_step, 1)])

                if not self._async:
                    # synchronous eigen-decomp updates
                    updateFactorOps = tf.cond(tf.logical_and(tf.equal(tf.mod(self.stats_step, self._kfac_update),
                                                                      tf.convert_to_tensor(0)),
                                                             tf.greater_equal(self.stats_step, self._stats_accum_iter)), lambda: tf.group(*self.applyStatsEigen(self.computeStatsEigen())), no_op_wrapper)
                else:
                    # asynchronous eigen-decomp updates using queue
                    updateFactorOps = tf.cond(tf.greater_equal(self.stats_step, self._stats_accum_iter),
                                              lambda: tf.cond(tf.equal(queue.size(), tf.convert_to_tensor(0)),
                                                              tf.no_op,

                                                              lambda: tf.group(
                                                                  *self.applyStatsEigen(dequeue_op())),
                                                              ),
                                              no_op_wrapper)

                updateOps.append(updateFactorOps)

                with tf.control_dependencies([updateFactorOps]):
                    def gradOp():
                        return list(g)

                    def getKfacGradOp():
                        return self.getKfacPrecondUpdates(g, varlist)
                    u = tf.cond(tf.greater(self.factor_step,
                                           tf.convert_to_tensor(0)), getKfacGradOp, gradOp)

                    optim = tf.train.MomentumOptimizer(
                        self._lr * (1. - self._momentum), self._momentum)
                    #optim = tf.train.AdamOptimizer(self._lr, epsilon=0.01)

                    def optimOp():
                        def updateOptimOp():
                            if self._full_stats_init:
                                return tf.cond(tf.greater(self.factor_step, tf.convert_to_tensor(0)), lambda: optim.apply_gradients(list(zip(u, varlist))), tf.no_op)
                            else:
                                return optim.apply_gradients(list(zip(u, varlist)))
                        if self._full_stats_init:
                            return tf.cond(tf.greater_equal(self.stats_step, self._stats_accum_iter), updateOptimOp, tf.no_op)
                        else:
                            return tf.cond(tf.greater_equal(self.sgd_step, self._cold_iter), updateOptimOp, tf.no_op)
                    updateOps.append(optimOp())

        return tf.group(*updateOps), qr

    def apply_gradients(self, grads):
        coldOptim = tf.train.MomentumOptimizer(
            self._cold_lr, self._momentum)

        def coldSGDstart():
            sgd_grads, sgd_var = zip(*grads)

            if self.max_grad_norm != None:
                sgd_grads, sgd_grad_norm = tf.clip_by_global_norm(sgd_grads,self.max_grad_norm)

            sgd_grads = list(zip(sgd_grads,sgd_var))

            sgd_step_op = tf.assign_add(self.sgd_step, 1)
            coldOptim_op = coldOptim.apply_gradients(sgd_grads)
            if KFAC_DEBUG:
                with tf.control_dependencies([sgd_step_op, coldOptim_op]):
                    sgd_step_op = tf.Print(
                        sgd_step_op, [self.sgd_step, tf.convert_to_tensor('doing cold sgd step')])
            return tf.group(*[sgd_step_op, coldOptim_op])

        kfacOptim_op, qr = self.apply_gradients_kfac(grads)

        def warmKFACstart():
            return kfacOptim_op

        return tf.cond(tf.greater(self.sgd_step, self._cold_iter), warmKFACstart, coldSGDstart), qr

    def minimize(self, loss, loss_sampled, var_list=None):
        grads = self.compute_gradients(loss, var_list=var_list)
        update_stats_op = self.compute_and_apply_stats(
            loss_sampled, var_list=var_list)
        return self.apply_gradients(grads)


================================================
FILE: baselines/acktr/kfac_utils.py
================================================
import tensorflow as tf

def gmatmul(a, b, transpose_a=False, transpose_b=False, reduce_dim=None):
    assert reduce_dim is not None

    # weird batch matmul
    if len(a.get_shape()) == 2 and len(b.get_shape()) > 2:
        # reshape reduce_dim to the left most dim in b
        b_shape = b.get_shape()
        if reduce_dim != 0:
            b_dims = list(range(len(b_shape)))
            b_dims.remove(reduce_dim)
            b_dims.insert(0, reduce_dim)
            b = tf.transpose(b, b_dims)
        b_t_shape = b.get_shape()
        b = tf.reshape(b, [int(b_shape[reduce_dim]), -1])
        result = tf.matmul(a, b, transpose_a=transpose_a,
                           transpose_b=transpose_b)
        result = tf.reshape(result, b_t_shape)
        if reduce_dim != 0:
            b_dims = list(range(len(b_shape)))
            b_dims.remove(0)
            b_dims.insert(reduce_dim, 0)
            result = tf.transpose(result, b_dims)
        return result

    elif len(a.get_shape()) > 2 and len(b.get_shape()) == 2:
        # reshape reduce_dim to the right most dim in a
        a_shape = a.get_shape()
        outter_dim = len(a_shape) - 1
        reduce_dim = len(a_shape) - reduce_dim - 1
        if reduce_dim != outter_dim:
            a_dims = list(range(len(a_shape)))
            a_dims.remove(reduce_dim)
            a_dims.insert(outter_dim, reduce_dim)
            a = tf.transpose(a, a_dims)
        a_t_shape = a.get_shape()
        a = tf.reshape(a, [-1, int(a_shape[reduce_dim])])
        result = tf.matmul(a, b, transpose_a=transpose_a,
                           transpose_b=transpose_b)
        result = tf.reshape(result, a_t_shape)
        if reduce_dim != outter_dim:
            a_dims = list(range(len(a_shape)))
            a_dims.remove(outter_dim)
            a_dims.insert(reduce_dim, outter_dim)
            result = tf.transpose(result, a_dims)
        return result

    elif len(a.get_shape()) == 2 and len(b.get_shape()) == 2:
        return tf.matmul(a, b, transpose_a=transpose_a, transpose_b=transpose_b)

    assert False, 'something went wrong'


def clipoutNeg(vec, threshold=1e-6):
    mask = tf.cast(vec > threshold, tf.float32)
    return mask * vec


def detectMinVal(input_mat, var, threshold=1e-6, name='', debug=False):
    eigen_min = tf.reduce_min(input_mat)
    eigen_max = tf.reduce_max(input_mat)
    eigen_ratio = eigen_max / eigen_min
    input_mat_clipped = clipoutNeg(input_mat, threshold)

    if debug:
        input_mat_clipped = tf.cond(tf.logical_or(tf.greater(eigen_ratio, 0.), tf.less(eigen_ratio, -500)), lambda: input_mat_clipped, lambda: tf.Print(
            input_mat_clipped, [tf.convert_to_tensor('screwed ratio ' + name + ' eigen values!!!'), tf.convert_to_tensor(var.name), eigen_min, eigen_max, eigen_ratio]))

    return input_mat_clipped


def factorReshape(Q, e, grad, facIndx=0, ftype='act'):
    grad_shape = grad.get_shape()
    if ftype == 'act':
        assert e.get_shape()[0] == grad_shape[facIndx]
        expanded_shape = [1, ] * len(grad_shape)
        expanded_shape[facIndx] = -1
        e = tf.reshape(e, expanded_shape)
    if ftype == 'grad':
        assert e.get_shape()[0] == grad_shape[len(grad_shape) - facIndx - 1]
        expanded_shape = [1, ] * len(grad_shape)
        expanded_shape[len(grad_shape) - facIndx - 1] = -1
        e = tf.reshape(e, expanded_shape)

    return Q, e


================================================
FILE: baselines/acktr/utils.py
================================================
import tensorflow as tf

def dense(x, size, name, weight_init=None, bias_init=0, weight_loss_dict=None, reuse=None):
    with tf.variable_scope(name, reuse=reuse):
        assert (len(tf.get_variable_scope().name.split('/')) == 2)

        w = tf.get_variable("w", [x.get_shape()[1], size], initializer=weight_init)
        b = tf.get_variable("b", [size], initializer=tf.constant_initializer(bias_init))
        weight_decay_fc = 3e-4

        if weight_loss_dict is not None:
            weight_decay = tf.multiply(tf.nn.l2_loss(w), weight_decay_fc, name='weight_decay_loss')
            if weight_loss_dict is not None:
                weight_loss_dict[w] = weight_decay_fc
                weight_loss_dict[b] = 0.0

            tf.add_to_collection(tf.get_variable_scope().name.split('/')[0] + '_' + 'losses', weight_decay)

        return tf.nn.bias_add(tf.matmul(x, w), b)

def kl_div(action_dist1, action_dist2, action_size):
    mean1, std1 = action_dist1[:, :action_size], action_dist1[:, action_size:]
    mean2, std2 = action_dist2[:, :action_size], action_dist2[:, action_size:]

    numerator = tf.square(mean1 - mean2) + tf.square(std1) - tf.square(std2)
    denominator = 2 * tf.square(std2) + 1e-8
    return tf.reduce_sum(
        numerator/denominator + tf.log(std2) - tf.log(std1),reduction_indices=-1)


================================================
FILE: baselines/bench/__init__.py
================================================
# flake8: noqa F403
from baselines.bench.benchmarks import *
from baselines.bench.monitor import *


================================================
FILE: baselines/bench/benchmarks.py
================================================
import re
import os
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))

_atari7 = ['BeamRider', 'Breakout', 'Enduro', 'Pong', 'Qbert', 'Seaquest', 'SpaceInvaders']
_atariexpl7 = ['Freeway', 'Gravitar', 'MontezumaRevenge', 'Pitfall', 'PrivateEye', 'Solaris', 'Venture']

_BENCHMARKS = []

remove_version_re = re.compile(r'-v\d+$')


def register_benchmark(benchmark):
    for b in _BENCHMARKS:
        if b['name'] == benchmark['name']:
            raise ValueError('Benchmark with name %s already registered!' % b['name'])

    # automatically add a description if it is not present
    if 'tasks' in benchmark:
        for t in benchmark['tasks']:
            if 'desc' not in t:
                t['desc'] = remove_version_re.sub('', t.get('env_id', t.get('id')))
    _BENCHMARKS.append(benchmark)


def list_benchmarks():
    return [b['name'] for b in _BENCHMARKS]


def get_benchmark(benchmark_name):
    for b in _BENCHMARKS:
        if b['name'] == benchmark_name:
            return b
    raise ValueError('%s not found! Known benchmarks: %s' % (benchmark_name, list_benchmarks()))


def get_task(benchmark, env_id):
    """Get a task by env_id. Return None if the benchmark doesn't have the env"""
    return next(filter(lambda task: task['env_id'] == env_id, benchmark['tasks']), None)


def find_task_for_env_id_in_any_benchmark(env_id):
    for bm in _BENCHMARKS:
        for task in bm["tasks"]:
            if task["env_id"] == env_id:
                return bm, task
    return None, None


_ATARI_SUFFIX = 'NoFrameskip-v4'

register_benchmark({
    'name': 'Atari50M',
    'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 50M timesteps',
    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(50e6)} for _game in _atari7]
})

register_benchmark({
    'name': 'Atari10M',
    'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 10M timesteps',
    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 6, 'num_timesteps': int(10e6)} for _game in _atari7]
})

register_benchmark({
    'name': 'Atari1Hr',
    'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 1 hour of walltime',
    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_seconds': 60 * 60} for _game in _atari7]
})

register_benchmark({
    'name': 'AtariExploration10M',
    'description': '7 Atari games emphasizing exploration, with pixel observations, 10M timesteps',
    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atariexpl7]
})


# MuJoCo

_mujocosmall = [
    'InvertedDoublePendulum-v2', 'InvertedPendulum-v2',
    'HalfCheetah-v2', 'Hopper-v2', 'Walker2d-v2',
    'Reacher-v2', 'Swimmer-v2']
register_benchmark({
    'name': 'Mujoco1M',
    'description': 'Some small 2D MuJoCo tasks, run for 1M timesteps',
    'tasks': [{'env_id': _envid, 'trials': 6, 'num_timesteps': int(1e6)} for _envid in _mujocosmall]
})

register_benchmark({
    'name': 'MujocoWalkers',
    'description': 'MuJoCo forward walkers, run for 8M, humanoid 100M',
    'tasks': [
        {'env_id': "Hopper-v1", 'trials': 4, 'num_timesteps': 8 * 1000000},
        {'env_id': "Walker2d-v1", 'trials': 4, 'num_timesteps': 8 * 1000000},
        {'env_id': "Humanoid-v1", 'trials': 4, 'num_timesteps': 100 * 1000000},
    ]
})

# Bullet
_bulletsmall = [
    'InvertedDoublePendulum', 'InvertedPendulum', 'HalfCheetah', 'Reacher', 'Walker2D', 'Hopper', 'Ant'
]
_bulletsmall = [e + 'BulletEnv-v0' for e in _bulletsmall]

register_benchmark({
    'name': 'Bullet1M',
    'description': '6 mujoco-like tasks from bullet, 1M steps',
    'tasks': [{'env_id': e, 'trials': 6, 'num_timesteps': int(1e6)} for e in _bulletsmall]
})


# Roboschool

register_benchmark({
    'name': 'Roboschool8M',
    'description': 'Small 2D tasks, up to 30 minutes to complete on 8 cores',
    'tasks': [
        {'env_id': "RoboschoolReacher-v1", 'trials': 4, 'num_timesteps': 2 * 1000000},
        {'env_id': "RoboschoolAnt-v1", 'trials': 4, 'num_timesteps': 8 * 1000000},
        {'env_id': "RoboschoolHalfCheetah-v1", 'trials': 4, 'num_timesteps': 8 * 1000000},
        {'env_id': "RoboschoolHopper-v1", 'trials': 4, 'num_timesteps': 8 * 1000000},
        {'env_id': "RoboschoolWalker2d-v1", 'trials': 4, 'num_timesteps': 8 * 1000000},
    ]
})
register_benchmark({
    'name': 'RoboschoolHarder',
    'description': 'Test your might!!! Up to 12 hours on 32 cores',
    'tasks': [
        {'env_id': "RoboschoolHumanoid-v1", 'trials': 4, 'num_timesteps': 100 * 1000000},
        {'env_id': "RoboschoolHumanoidFlagrun-v1", 'trials': 4, 'num_timesteps': 200 * 1000000},
        {'env_id': "RoboschoolHumanoidFlagrunHarder-v1", 'trials': 4, 'num_timesteps': 400 * 1000000},
    ]
})

# Other

_atari50 = [  # actually 47
    'Alien', 'Amidar', 'Assault', 'Asterix', 'Asteroids',
    'Atlantis', 'BankHeist', 'BattleZone', 'BeamRider', 'Bowling',
    'Breakout', 'Centipede', 'ChopperCommand', 'CrazyClimber',
    'DemonAttack', 'DoubleDunk', 'Enduro', 'FishingDerby', 'Freeway',
    'Frostbite', 'Gopher', 'Gravitar', 'IceHockey', 'Jamesbond',
    'Kangaroo', 'Krull', 'KungFuMaster', 'MontezumaRevenge', 'MsPacman',
    'NameThisGame', 'Pitfall', 'Pong', 'PrivateEye', 'Qbert',
    'RoadRunner', 'Robotank', 'Seaquest', 'SpaceInvaders', 'StarGunner',
    'Tennis', 'TimePilot', 'Tutankham', 'UpNDown', 'Venture',
    'VideoPinball', 'WizardOfWor', 'Zaxxon',
]

register_benchmark({
    'name': 'Atari50_10M',
    'description': '47 Atari games from Mnih et al. (2013), with pixel observations, 10M timesteps',
    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari50]
})

# HER DDPG

_fetch_tasks = ['FetchReach-v1', 'FetchPush-v1', 'FetchSlide-v1']
register_benchmark({
    'name': 'Fetch1M',
    'description': 'Fetch* benchmarks for 1M timesteps',
    'tasks': [{'trials': 6, 'env_id': env_id, 'num_timesteps': int(1e6)} for env_id in _fetch_tasks]
})



================================================
FILE: baselines/bench/monitor.py
================================================
__all__ = ['Monitor', 'get_monitor_files', 'load_results']

from gym.core import Wrapper
import time
from glob import glob
import csv
import os.path as osp
import json

class Monitor(Wrapper):
    EXT = "monitor.csv"
    f = None

    def __init__(self, env, filename, allow_early_resets=False, reset_keywords=(), info_keywords=()):
        Wrapper.__init__(self, env=env)
        self.tstart = time.time()
        if filename:
            self.results_writer = ResultsWriter(filename,
                header={"t_start": time.time(), 'env_id' : env.spec and env.spec.id},
                extra_keys=reset_keywords + info_keywords
            )
        else:
            self.results_writer = None
        self.reset_keywords = reset_keywords
        self.info_keywords = info_keywords
        self.allow_early_resets = allow_early_resets
        self.rewards = None
        self.needs_reset = True
        self.episode_rewards = []
        self.episode_lengths = []
        self.episode_times = []
        self.total_steps = 0
        self.current_reset_info = {} # extra info about the current episode, that was passed in during reset()

    def reset(self, **kwargs):
        self.reset_state()
        for k in self.reset_keywords:
            v = kwargs.get(k)
            if v is None:
                raise ValueError('Expected you to pass kwarg %s into reset'%k)
            self.current_reset_info[k] = v
        return self.env.reset(**kwargs)

    def reset_state(self):
        if not self.allow_early_resets and not self.needs_reset:
            raise RuntimeError("Tried to reset an environment before done. If you want to allow early resets, wrap your env with Monitor(env, path, allow_early_resets=True)")
        self.rewards = []
        self.needs_reset = False


    def step(self, action):
        if self.needs_reset:
            raise RuntimeError("Tried to step environment that needs reset")
        ob, rew, done, info = self.env.step(action)
        self.update(ob, rew, done, info)
        return (ob, rew, done, info)

    def update(self, ob, rew, done, info):
        self.rewards.append(rew)
        if done:
            self.needs_reset = True
            eprew = sum(self.rewards)
            eplen = len(self.rewards)
            epinfo = {"r": round(eprew, 6), "l": eplen, "t": round(time.time() - self.tstart, 6)}
            for k in self.info_keywords:
                epinfo[k] = info[k]
            self.episode_rewards.append(eprew)
            self.episode_lengths.append(eplen)
            self.episode_times.append(time.time() - self.tstart)
            epinfo.update(self.current_reset_info)
            if self.results_writer:
                self.results_writer.write_row(epinfo)
            assert isinstance(info, dict)
            if isinstance(info, dict):
                info['episode'] = epinfo

        self.total_steps += 1

    def close(self):
        super(Monitor, self).close()
        if self.f is not None:
            self.f.close()

    def get_total_steps(self):
        return self.total_steps

    def get_episode_rewards(self):
        return self.episode_rewards

    def get_episode_lengths(self):
        return self.episode_lengths

    def get_episode_times(self):
        return self.episode_times

class LoadMonitorResultsError(Exception):
    pass


class ResultsWriter(object):
    def __init__(self, filename, header='', extra_keys=()):
        self.extra_keys = extra_keys
        assert filename is not None
        if not filename.endswith(Monitor.EXT):
            if osp.isdir(filename):
                filename = osp.join(filename, Monitor.EXT)
            else:
                filename = filename + "." + Monitor.EXT
        self.f = open(filename, "wt")
        if isinstance(header, dict):
            header = '# {} \n'.format(json.dumps(header))
        self.f.write(header)
        self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+tuple(extra_keys))
        self.logger.writeheader()
        self.f.flush()

    def write_row(self, epinfo):
        if self.logger:
            self.logger.writerow(epinfo)
            self.f.flush()


def get_monitor_files(dir):
    return glob(osp.join(dir, "*" + Monitor.EXT))

def load_results(dir):
    import pandas
    monitor_files = (
        glob(osp.join(dir, "*monitor.json")) +
        glob(osp.join(dir, "*monitor.csv"))) # get both csv and (old) json files
    if not monitor_files:
        raise LoadMonitorResultsError("no monitor files of the form *%s found in %s" % (Monitor.EXT, dir))
    dfs = []
    headers = []
    for fname in monitor_files:
        with open(fname, 'rt') as fh:
            if fname.endswith('csv'):
                firstline = fh.readline()
                if not firstline:
                    continue
                assert firstline[0] == '#'
                header = json.loads(firstline[1:])
                df = pandas.read_csv(fh, index_col=None)
                headers.append(header)
            elif fname.endswith('json'): # Deprecated json format
                episodes = []
                lines = fh.readlines()
                header = json.loads(lines[0])
                headers.append(header)
                for line in lines[1:]:
                    episode = json.loads(line)
                    episodes.append(episode)
                df = pandas.DataFrame(episodes)
            else:
                assert 0, 'unreachable'
            df['t'] += header['t_start']
        dfs.append(df)
    df = pandas.concat(dfs)
    df.sort_values('t', inplace=True)
    df.reset_index(inplace=True)
    df['t'] -= min(header['t_start'] for header in headers)
    df.headers = headers # HACK to preserve backwards compatibility
    return df


================================================
FILE: baselines/bench/test_monitor.py
================================================
from .monitor import Monitor
import gym
import json

def test_monitor():
    import pandas
    import os
    import uuid

    env = gym.make("CartPole-v1")
    env.seed(0)
    mon_file = "/tmp/baselines-test-%s.monitor.csv" % uuid.uuid4()
    menv = Monitor(env, mon_file)
    menv.reset()
    for _ in range(1000):
        _, _, done, _ = menv.step(0)
        if done:
            menv.reset()

    f = open(mon_file, 'rt')

    firstline = f.readline()
    assert firstline.startswith('#')
    metadata = json.loads(firstline[1:])
    assert metadata['env_id'] == "CartPole-v1"
    assert set(metadata.keys()) == {'env_id', 't_start'},  "Incorrect keys in monitor metadata"

    last_logline = pandas.read_csv(f, index_col=None)
    assert set(last_logline.keys()) == {'l', 't', 'r'}, "Incorrect keys in monitor logline"
    f.close()
    os.remove(mon_file)


================================================
FILE: baselines/common/__init__.py
================================================
# flake8: noqa F403
from baselines.common.console_util import *
from baselines.common.dataset import Dataset
from baselines.common.math_util import *
from baselines.common.misc_util import *


================================================
FILE: baselines/common/atari_wrappers.py
================================================
import numpy as np
import os
os.environ.setdefault('PATH', '')
from collections import deque
import gym
from gym import spaces
import cv2
cv2.ocl.setUseOpenCL(False)
from .wrappers import TimeLimit


class NoopResetEnv(gym.Wrapper):
    def __init__(self, env, noop_max=30):
        """Sample initial states by taking random number of no-ops on reset.
        No-op is assumed to be action 0.
        """
        gym.Wrapper.__init__(self, env)
        self.noop_max = noop_max
        self.override_num_noops = None
        self.noop_action = 0
        assert env.unwrapped.get_action_meanings()[0] == 'NOOP'

    def reset(self, **kwargs):
        """ Do no-op action for a number of steps in [1, noop_max]."""
        self.env.reset(**kwargs)
        if self.override_num_noops is not None:
            noops = self.override_num_noops
        else:
            noops = self.unwrapped.np_random.randint(1, self.noop_max + 1) #pylint: disable=E1101
        assert noops > 0
        obs = None
        for _ in range(noops):
            obs, _, done, _ = self.env.step(self.noop_action)
            if done:
                obs = self.env.reset(**kwargs)
        return obs

    def step(self, ac):
        return self.env.step(ac)

class FireResetEnv(gym.Wrapper):
    def __init__(self, env):
        """Take action on reset for environments that are fixed until firing."""
        gym.Wrapper.__init__(self, env)
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        assert len(env.unwrapped.get_action_meanings()) >= 3

    def reset(self, **kwargs):
        self.env.reset(**kwargs)
        obs, _, done, _ = self.env.step(1)
        if done:
            self.env.reset(**kwargs)
        obs, _, done, _ = self.env.step(2)
        if done:
            self.env.reset(**kwargs)
        return obs

    def step(self, ac):
        return self.env.step(ac)

class EpisodicLifeEnv(gym.Wrapper):
    def __init__(self, env):
        """Make end-of-life == end-of-episode, but only reset on true game over.
        Done by DeepMind for the DQN and co. since it helps value estimation.
        """
        gym.Wrapper.__init__(self, env)
        self.lives = 0
        self.was_real_done  = True

    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        self.was_real_done = done
        # check current lives, make loss of life terminal,
        # then update lives to handle bonus lives
        lives = self.env.unwrapped.ale.lives()
        if lives < self.lives and lives > 0:
            # for Qbert sometimes we stay in lives == 0 condition for a few frames
            # so it's important to keep lives > 0, so that we only reset once
            # the environment advertises done.
            done = True
        self.lives = lives
        return obs, reward, done, info

    def reset(self, **kwargs):
        """Reset only when lives are exhausted.
        This way all states are still reachable even though lives are episodic,
        and the learner need not know about any of this behind-the-scenes.
        """
        if self.was_real_done:
            obs = self.env.reset(**kwargs)
        else:
            # no-op step to advance from terminal/lost life state
            obs, _, _, _ = self.env.step(0)
        self.lives = self.env.unwrapped.ale.lives()
        return obs

class MaxAndSkipEnv(gym.Wrapper):
    def __init__(self, env, skip=4):
        """Return only every `skip`-th frame"""
        gym.Wrapper.__init__(self, env)
        # most recent raw observations (for max pooling across time steps)
        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
        self._skip       = skip

    def step(self, action):
        """Repeat action, sum reward, and max over last observations."""
        total_reward = 0.0
        done = None
        for i in range(self._skip):
            obs, reward, done, info = self.env.step(action)
            if i == self._skip - 2: self._obs_buffer[0] = obs
            if i == self._skip - 1: self._obs_buffer[1] = obs
            total_reward += reward
            if done:
                break
        # Note that the observation on the done=True frame
        # doesn't matter
        max_frame = self._obs_buffer.max(axis=0)

        return max_frame, total_reward, done, info

    def reset(self, **kwargs):
        return self.env.reset(**kwargs)

class ClipRewardEnv(gym.RewardWrapper):
    def __init__(self, env):
        gym.RewardWrapper.__init__(self, env)

    def reward(self, reward):
        """Bin reward to {+1, 0, -1} by its sign."""
        return np.sign(reward)


class WarpFrame(gym.ObservationWrapper):
    def __init__(self, env, width=84, height=84, grayscale=True, dict_space_key=None):
        """
        Warp frames to 84x84 as done in the Nature paper and later work.

        If the environment uses dictionary observations, `dict_space_key` can be specified which indicates which
        observation should be warped.
        """
        super().__init__(env)
        self._width = width
        self._height = height
        self._grayscale = grayscale
        self._key = dict_space_key
        if self._grayscale:
            num_colors = 1
        else:
            num_colors = 3

        new_space = gym.spaces.Box(
            low=0,
            high=255,
            shape=(self._height, self._width, num_colors),
            dtype=np.uint8,
        )
        if self._key is None:
            original_space = self.observation_space
            self.observation_space = new_space
        else:
            original_space = self.observation_space.spaces[self._key]
            self.observation_space.spaces[self._key] = new_space
        assert original_space.dtype == np.uint8 and len(original_space.shape) == 3

    def observation(self, obs):
        if self._key is None:
            frame = obs
        else:
            frame = obs[self._key]

        if self._grayscale:
            frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        frame = cv2.resize(
            frame, (self._width, self._height), interpolation=cv2.INTER_AREA
        )
        if self._grayscale:
            frame = np.expand_dims(frame, -1)

        if self._key is None:
            obs = frame
        else:
            obs = obs.copy()
            obs[self._key] = frame
        return obs


class FrameStack(gym.Wrapper):
    def __init__(self, env, k):
        """Stack k last frames.

        Returns lazy array, which is much more memory efficient.

        See Also
        --------
        baselines.common.atari_wrappers.LazyFrames
        """
        gym.Wrapper.__init__(self, env)
        self.k = k
        self.frames = deque([], maxlen=k)
        shp = env.observation_space.shape
        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[:-1] + (shp[-1] * k,)), dtype=env.observation_space.dtype)

    def reset(self):
        ob = self.env.reset()
        for _ in range(self.k):
            self.frames.append(ob)
        return self._get_ob()

    def step(self, action):
        ob, reward, done, info = self.env.step(action)
        self.frames.append(ob)
        return self._get_ob(), reward, done, info

    def _get_ob(self):
        assert len(self.frames) == self.k
        return LazyFrames(list(self.frames))

class ScaledFloatFrame(gym.ObservationWrapper):
    def __init__(self, env):
        gym.ObservationWrapper.__init__(self, env)
        self.observation_space = gym.spaces.Box(low=0, high=1, shape=env.observation_space.shape, dtype=np.float32)

    def observation(self, observation):
        # careful! This undoes the memory optimization, use
        # with smaller replay buffers only.
        return np.array(observation).astype(np.float32) / 255.0

class LazyFrames(object):
    def __init__(self, frames):
        """This object ensures that common frames between the observations are only stored once.
        It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
        buffers.

        This object should only be converted to numpy array before being passed to the model.

        You'd not believe how complex the previous solution was."""
        self._frames = frames
        self._out = None

    def _force(self):
        if self._out is None:
            self._out = np.concatenate(self._frames, axis=-1)
            self._frames = None
        return self._out

    def __array__(self, dtype=None):
        out = self._force()
        if dtype is not None:
            out = out.astype(dtype)
        return out

    def __len__(self):
        return len(self._force())

    def __getitem__(self, i):
        return self._force()[i]

    def count(self):
        frames = self._force()
        return frames.shape[frames.ndim - 1]

    def frame(self, i):
        return self._force()[..., i]

def make_atari(env_id, max_episode_steps=None):
    env = gym.make(env_id)
    assert 'NoFrameskip' in env.spec.id
    env = NoopResetEnv(env, noop_max=30)
    env = MaxAndSkipEnv(env, skip=4)
    if max_episode_steps is not None:
        env = TimeLimit(env, max_episode_steps=max_episode_steps)
    return env

def wrap_deepmind(env, episode_life=True, clip_rewards=True, frame_stack=False, scale=False):
    """Configure environment for DeepMind-style Atari.
    """
    if episode_life:
        env = EpisodicLifeEnv(env)
    if 'FIRE' in env.unwrapped.get_action_meanings():
        env = FireResetEnv(env)
    env = WarpFrame(env)
    if scale:
        env = ScaledFloatFrame(env)
    if clip_rewards:
        env = ClipRewardEnv(env)
    if frame_stack:
        env = FrameStack(env, 4)
    return env



================================================
FILE: baselines/common/cg.py
================================================
import numpy as np
def cg(f_Ax, b, cg_iters=10, callback=None, verbose=False, residual_tol=1e-10):
    """
    Demmel p 312
    """
    p = b.copy()
    r = b.copy()
    x = np.zeros_like(b)
    rdotr = r.dot(r)

    fmtstr =  "%10i %10.3g %10.3g"
    titlestr =  "%10s %10s %10s"
    if verbose: print(titlestr % ("iter", "residual norm", "soln norm"))

    for i in range(cg_iters):
        if callback is not None:
            callback(x)
        if verbose: print(fmtstr % (i, rdotr, np.linalg.norm(x)))
        z = f_Ax(p)
        v = rdotr / p.dot(z)
        x += v*p
        r -= v*z
        newrdotr = r.dot(r)
        mu = newrdotr/rdotr
        p = r + mu*p

        rdotr = newrdotr
        if rdotr < residual_tol:
            break

    if callback is not None:
        callback(x)
    if verbose: print(fmtstr % (i+1, rdotr, np.linalg.norm(x)))  # pylint: disable=W0631
    return x


================================================
FILE: baselines/common/cmd_util.py
================================================
"""
Helpers for scripts like run_atari.py.
"""

import os
try:
    from mpi4py import MPI
except ImportError:
    MPI = None

import gym
from gym.wrappers import FlattenObservation, FilterObservation
from baselines import logger
from baselines.bench import Monitor
from baselines.common import set_global_seeds
from baselines.common.atari_wrappers import make_atari, wrap_deepmind
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from baselines.common import retro_wrappers
from baselines.common.wrappers import ClipActionsWrapper

def make_vec_env(env_id, env_type, num_env, seed,
                 wrapper_kwargs=None,
                 env_kwargs=None,
                 start_index=0,
                 reward_scale=1.0,
                 flatten_dict_observations=True,
                 gamestate=None,
                 initializer=None,
                 force_dummy=False):
    """
    Create a wrapped, monitored SubprocVecEnv for Atari and MuJoCo.
    """
    wrapper_kwargs = wrapper_kwargs or {}
    env_kwargs = env_kwargs or {}
    mpi_rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
    seed = seed + 10000 * mpi_rank if seed is not None else None
    logger_dir = logger.get_dir()
    def make_thunk(rank, initializer=None):
        return lambda: make_env(
            env_id=env_id,
            env_type=env_type,
            mpi_rank=mpi_rank,
            subrank=rank,
            seed=seed,
            reward_scale=reward_scale,
            gamestate=gamestate,
            flatten_dict_observations=flatten_dict_observations,
            wrapper_kwargs=wrapper_kwargs,
            env_kwargs=env_kwargs,
            logger_dir=logger_dir,
            initializer=initializer
        )

    set_global_seeds(seed)
    if not force_dummy and num_env > 1:
        return SubprocVecEnv([make_thunk(i + start_index, initializer=initializer) for i in range(num_env)])
    else:
        return DummyVecEnv([make_thunk(i + start_index, initializer=None) for i in range(num_env)])


def make_env(env_id, env_type, mpi_rank=0, subrank=0, seed=None, reward_scale=1.0, gamestate=None, flatten_dict_observations=True, wrapper_kwargs=None, env_kwargs=None, logger_dir=None, initializer=None):
    if initializer is not None:
        initializer(mpi_rank=mpi_rank, subrank=subrank)

    wrapper_kwargs = wrapper_kwargs or {}
    env_kwargs = env_kwargs or {}
    if ':' in env_id:
        import re
        import importlib
        module_name = re.sub(':.*','',env_id)
        env_id = re.sub('.*:', '', env_id)
        importlib.import_module(module_name)
    if env_type == 'atari':
        env = make_atari(env_id)
    elif env_type == 'retro':
        import retro
        gamestate = gamestate or retro.State.DEFAULT
        env = retro_wrappers.make_retro(game=env_id, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE, state=gamestate)
    else:
        env = gym.make(env_id, **env_kwargs)

    if flatten_dict_observations and isinstance(env.observation_space, gym.spaces.Dict):
        env = FlattenObservation(env)

    env.seed(seed + subrank if seed is not None else None)
    env = Monitor(env,
                  logger_dir and os.path.join(logger_dir, str(mpi_rank) + '.' + str(subrank)),
                  allow_early_resets=True)


    if env_type == 'atari':
        env = wrap_deepmind(env, **wrapper_kwargs)
    elif env_type == 'retro':
        if 'frame_stack' not in wrapper_kwargs:
            wrapper_kwargs['frame_stack'] = 1
        env = retro_wrappers.wrap_deepmind_retro(env, **wrapper_kwargs)

    if isinstance(env.action_space, gym.spaces.Box):
        env = ClipActionsWrapper(env)

    if reward_scale != 1:
        env = retro_wrappers.RewardScaler(env, reward_scale)

    return env


def make_mujoco_env(env_id, seed, reward_scale=1.0):
    """
    Create a wrapped, monitored gym.Env for MuJoCo.
    """
    rank = MPI.COMM_WORLD.Get_rank()
    myseed = seed  + 1000 * rank if seed is not None else None
    set_global_seeds(myseed)
    env = gym.make(env_id)
    logger_path = None if logger.get_dir() is None else os.path.join(logger.get_dir(), str(rank))
    env = Monitor(env, logger_path, allow_early_resets=True)
    env.seed(seed)
    if reward_scale != 1.0:
        from baselines.common.retro_wrappers import RewardScaler
        env = RewardScaler(env, reward_scale)
    return env

def make_robotics_env(env_id, seed, rank=0):
    """
    Create a wrapped, monitored gym.Env for MuJoCo.
    """
    set_global_seeds(seed)
    env = gym.make(env_id)
    env = FlattenObservation(FilterObservation(env, ['observation', 'desired_goal']))
    env = Monitor(
        env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)),
        info_keywords=('is_success',))
    env.seed(seed)
    return env

def arg_parser():
    """
    Create an empty argparse.ArgumentParser.
    """
    import argparse
    return argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)

def atari_arg_parser():
    """
    Create an argparse.ArgumentParser for run_atari.py.
    """
    print('Obsolete - use common_arg_parser instead')
    return common_arg_parser()

def mujoco_arg_parser():
    print('Obsolete - use common_arg_parser instead')
    return common_arg_parser()

def common_arg_parser():
    """
    Create an argparse.ArgumentParser for run_mujoco.py.
    """
    parser = arg_parser()
    parser.add_argument('--env', help='environment ID', type=str, default='Reacher-v2')
    parser.add_argument('--env_type', help='type of environment, used when the environment type cannot be automatically determined', type=str)
    parser.add_argument('--seed', help='RNG seed', type=int, default=None)
    parser.add_argument('--alg', help='Algorithm', type=str, default='ppo2')
    parser.add_argument('--num_timesteps', type=float, default=1e6),
    parser.add_argument('--network', help='network type (mlp, cnn, lstm, cnn_lstm, conv_only)', default=None)
    parser.add_argument('--gamestate', help='game state to load (so far only used in retro games)', default=None)
    parser.add_argument('--num_env', help='Number of environment copies being run in parallel. When not specified, set to number of cpus for Atari, and to 1 for Mujoco', default=None, type=int)
    parser.add_argument('--reward_scale', help='Reward scale factor. Default: 1.0', default=1.0, type=float)
    parser.add_argument('--save_path', help='Path to save trained model to', default=None, type=str)
    parser.add_argument('--save_video_interval', help='Save video every x steps (0 = disabled)', default=0, type=int)
    parser.add_argument('--save_video_length', help='Length of recorded video. Default: 200', default=200, type=int)
    parser.add_argument('--log_path', help='Directory to save learning curve data.', default=None, type=str)
    parser.add_argument('--play', default=False, action='store_true')
    return parser

def robotics_arg_parser():
    """
    Create an argparse.ArgumentParser for run_mujoco.py.
    """
    parser = arg_parser()
    parser.add_argument('--env', help='environment ID', type=str, default='FetchReach-v0')
    parser.add_argument('--seed', help='RNG seed', type=int, default=None)
    parser.add_argument('--num-timesteps', type=int, default=int(1e6))
    return parser


def parse_unknown_args(args):
    """
    Parse arguments not consumed by arg parser into a dictionary
    """
    retval = {}
    preceded_by_key = False
    for arg in args:
        if arg.startswith('--'):
            if '=' in arg:
                key = arg.split('=')[0][2:]
                value = arg.split('=')[1]
                retval[key] = value
            else:
                key = arg[2:]
                preceded_by_key = True
        elif preceded_by_key:
            retval[key] = arg
            preceded_by_key = False

    return retval


================================================
FILE: baselines/common/console_util.py
================================================
from __future__ import print_function
from contextlib import contextmanager
import numpy as np
import time
import shlex
import subprocess

# ================================================================
# Misc
# ================================================================

def fmt_row(width, row, header=False):
    out = " | ".join(fmt_item(x, width) for x in row)
    if header: out = out + "\n" + "-"*len(out)
    return out

def fmt_item(x, l):
    if isinstance(x, np.ndarray):
        assert x.ndim==0
        x = x.item()
    if isinstance(x, (float, np.float32, np.float64)):
        v = abs(x)
        if (v < 1e-4 or v > 1e+4) and v > 0:
            rep = "%7.2e" % x
        else:
            rep = "%7.5f" % x
    else: rep = str(x)
    return " "*(l - len(rep)) + rep

color2num = dict(
    gray=30,
    red=31,
    green=32,
    yellow=33,
    blue=34,
    magenta=35,
    cyan=36,
    white=37,
    crimson=38
)

def colorize(string, color='green', bold=False, highlight=False):
    attr = []
    num = color2num[color]
    if highlight: num += 10
    attr.append(str(num))
    if bold: attr.append('1')
    return '\x1b[%sm%s\x1b[0m' % (';'.join(attr), string)

def print_cmd(cmd, dry=False):
    if isinstance(cmd, str):  # for shell=True
        pass
    else:
        cmd = ' '.join(shlex.quote(arg) for arg in cmd)
    print(colorize(('CMD: ' if not dry else 'DRY: ') + cmd))


def get_git_commit(cwd=None):
    return subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD'], cwd=cwd).decode('utf8')

def get_git_commit_message(cwd=None):
    return subprocess.check_output(['git', 'show', '-s', '--format=%B', 'HEAD'], cwd=cwd).decode('utf8')

def ccap(cmd, dry=False, env=None, **kwargs):
    print_cmd(cmd, dry)
    if not dry:
        subprocess.check_call(cmd, env=env, **kwargs)


MESSAGE_DEPTH = 0

@contextmanager
def timed(msg):
    global MESSAGE_DEPTH #pylint: disable=W0603
    print(colorize('\t'*MESSAGE_DEPTH + '=: ' + msg, color='magenta'))
    tstart = time.time()
    MESSAGE_DEPTH += 1
    yield
    MESSAGE_DEPTH -= 1
    print(colorize('\t'*MESSAGE_DEPTH + "done in %.3f seconds"%(time.time() - tstart), color='magenta'))


================================================
FILE: baselines/common/dataset.py
================================================
import numpy as np

class Dataset(object):
    def __init__(self, data_map, deterministic=False, shuffle=True):
        self.data_map = data_map
        self.deterministic = deterministic
        self.enable_shuffle = shuffle
        self.n = next(iter(data_map.values())).shape[0]
        self._next_id = 0
        self.shuffle()

    def shuffle(self):
        if self.deterministic:
            return
        perm = np.arange(self.n)
        np.random.shuffle(perm)

        for key in self.data_map:
            self.data_map[key] = self.data_map[key][perm]

        self._next_id = 0

    def next_batch(self, batch_size):
        if self._next_id >= self.n and self.enable_shuffle:
            self.shuffle()

        cur_id = self._next_id
        cur_batch_size = min(batch_size, self.n - self._next_id)
        self._next_id += cur_batch_size

        data_map = dict()
        for key in self.data_map:
            data_map[key] = self.data_map[key][cur_id:cur_id+cur_batch_size]
        return data_map

    def iterate_once(self, batch_size):
        if self.enable_shuffle: self.shuffle()

        while self._next_id <= self.n - batch_size:
            yield self.next_batch(batch_size)
        self._next_id = 0

    def subset(self, num_elements, deterministic=True):
        data_map = dict()
        for key in self.data_map:
            data_map[key] = self.data_map[key][:num_elements]
        return Dataset(data_map, deterministic)


def iterbatches(arrays, *, num_batches=None, batch_size=None, shuffle=True, include_final_partial_batch=True):
    assert (num_batches is None) != (batch_size is None), 'Provide num_batches or batch_size, but not both'
    arrays = tuple(map(np.asarray, arrays))
    n = arrays[0].shape[0]
    assert all(a.shape[0] == n for a in arrays[1:])
    inds = np.arange(n)
    if shuffle: np.random.shuffle(inds)
    sections = np.arange(0, n, batch_size)[1:] if num_batches is None else num_batches
    for batch_inds in np.array_split(inds, sections):
        if include_final_partial_batch or len(batch_inds) == batch_size:
            yield tuple(a[batch_inds] for a in arrays)


================================================
FILE: baselines/common/distributions.py
================================================
import tensorflow as tf
import numpy as np
import baselines.common.tf_util as U
from baselines.a2c.utils import fc
from tensorflow.python.ops import math_ops

class Pd(object):
    """
    A particular probability distribution
    """
    def flatparam(self):
        raise NotImplementedError
    def mode(self):
        raise NotImplementedError
    def neglogp(self, x):
        # Usually it's easier to define the negative logprob
        raise NotImplementedError
    def kl(self, other):
        raise NotImplementedError
    def entropy(self):
        raise NotImplementedError
    def sample(self):
        raise NotImplementedError
    def logp(self, x):
        return - self.neglogp(x)
    def get_shape(self):
        return self.flatparam().shape
    @property
    def shape(self):
        return self.get_shape()
    def __getitem__(self, idx):
        return self.__class__(self.flatparam()[idx])

class PdType(object):
    """
    Parametrized family of probability distributions
    """
    def pdclass(self):
        raise NotImplementedError
    def pdfromflat(self, flat):
        return self.pdclass()(flat)
    def pdfromlatent(self, latent_vector, init_scale, init_bias):
        raise NotImplementedError
    def param_shape(self):
        raise NotImplementedError
    def sample_shape(self):
        raise NotImplementedError
    def sample_dtype(self):
        raise NotImplementedError

    def param_placeholder(self, prepend_shape, name=None):
        return tf.placeholder(dtype=tf.float32, shape=prepend_shape+self.param_shape(), name=name)
    def sample_placeholder(self, prepend_shape, name=None):
        return tf.placeholder(dtype=self.sample_dtype(), shape=prepend_shape+self.sample_shape(), name=name)

    def __eq__(self, other):
        return (type(self) == type(other)) and (self.__dict__ == other.__dict__)

class CategoricalPdType(PdType):
    def __init__(self, ncat):
        self.ncat = ncat
    def pdclass(self):
        return CategoricalPd
    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
        pdparam = _matching_fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
        return self.pdfromflat(pdparam), pdparam

    def param_shape(self):
        return [self.ncat]
    def sample_shape(self):
        return []
    def sample_dtype(self):
        return tf.int32


class MultiCategoricalPdType(PdType):
    def __init__(self, nvec):
        self.ncats = nvec.astype('int32')
        assert (self.ncats > 0).all()
    def pdclass(self):
        return MultiCategoricalPd
    def pdfromflat(self, flat):
        return MultiCategoricalPd(self.ncats, flat)

    def pdfromlatent(self, latent, init_scale=1.0, init_bias=0.0):
        pdparam = _matching_fc(latent, 'pi', self.ncats.sum(), init_scale=init_scale, init_bias=init_bias)
        return self.pdfromflat(pdparam), pdparam

    def param_shape(self):
        return [sum(self.ncats)]
    def sample_shape(self):
        return [len(self.ncats)]
    def sample_dtype(self):
        return tf.int32

class DiagGaussianPdType(PdType):
    def __init__(self, size):
        self.size = size
    def pdclass(self):
        return DiagGaussianPd

    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
        mean = _matching_fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
        logstd = tf.get_variable(name='pi/logstd', shape=[1, self.size], initializer=tf.zeros_initializer())
        pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
        return self.pdfromflat(pdparam), mean

    def param_shape(self):
        return [2*self.size]
    def sample_shape(self):
        return [self.size]
    def sample_dtype(self):
        return tf.float32

class BernoulliPdType(PdType):
    def __init__(self, size):
        self.size = size
    def pdclass(self):
        return BernoulliPd
    def param_shape(self):
        return [self.size]
    def sample_shape(self):
        return [self.size]
    def sample_dtype(self):
        return tf.int32
    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
        pdparam = _matching_fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
        return self.pdfromflat(pdparam), pdparam

# WRONG SECOND DERIVATIVES
# class CategoricalPd(Pd):
#     def __init__(self, logits):
#         self.logits = logits
#         self.ps = tf.nn.softmax(logits)
#     @classmethod
#     def fromflat(cls, flat):
#         return cls(flat)
#     def flatparam(self):
#         return self.logits
#     def mode(self):
#         return U.argmax(self.logits, axis=-1)
#     def logp(self, x):
#         return -tf.nn.sparse_softmax_cross_entropy_with_logits(self.logits, x)
#     def kl(self, other):
#         return tf.nn.softmax_cross_entropy_with_logits(other.logits, self.ps) \
#                 - tf.nn.softmax_cross_entropy_with_logits(self.logits, self.ps)
#     def entropy(self):
#         return tf.nn.softmax_cross_entropy_with_logits(self.logits, self.ps)
#     def sample(self):
#         u = tf.random_uniform(tf.shape(self.logits))
#         return U.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)

class CategoricalPd(Pd):
    def __init__(self, logits):
        self.logits = logits
    def flatparam(self):
        return self.logits
    def mode(self):
        return tf.argmax(self.logits, axis=-1)

    @property
    def mean(self):
        return tf.nn.softmax(self.logits)
    def neglogp(self, x):
        # return tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=x)
        # Note: we can't use sparse_softmax_cross_entropy_with_logits because
        #       the implementation does not allow second-order derivatives...
        if x.dtype in {tf.uint8, tf.int32, tf.int64}:
            # one-hot encoding
            x_shape_list = x.shape.as_list()
            logits_shape_list = self.logits.get_shape().as_list()[:-1]
            for xs, ls in zip(x_shape_list, logits_shape_list):
                if xs is not None and ls is not None:
                    assert xs == ls, 'shape mismatch: {} in x vs {} in logits'.format(xs, ls)

            x = tf.one_hot(x, self.logits.get_shape().as_list()[-1])
        else:
            # already encoded
            assert x.shape.as_list() == self.logits.shape.as_list()

        return tf.nn.softmax_cross_entropy_with_logits_v2(
            logits=self.logits,
            labels=x)
    def kl(self, other):
        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
        a1 = other.logits - tf.reduce_max(other.logits, axis=-1, keepdims=True)
        ea0 = tf.exp(a0)
        ea1 = tf.exp(a1)
        z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
        z1 = tf.reduce_sum(ea1, axis=-1, keepdims=True)
        p0 = ea0 / z0
        return tf.reduce_sum(p0 * (a0 - tf.log(z0) - a1 + tf.log(z1)), axis=-1)
    def entropy(self):
        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
        ea0 = tf.exp(a0)
        z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
        p0 = ea0 / z0
        return tf.reduce_sum(p0 * (tf.log(z0) - a0), axis=-1)
    def sample(self):
        u = tf.random_uniform(tf.shape(self.logits), dtype=self.logits.dtype)
        return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)
    @classmethod
    def fromflat(cls, flat):
        return cls(flat)

class MultiCategoricalPd(Pd):
    def __init__(self, nvec, flat):
        self.flat = flat
        self.categoricals = list(map(CategoricalPd,
            tf.split(flat, np.array(nvec, dtype=np.int32), axis=-1)))
    def flatparam(self):
        return self.flat
    def mode(self):
        return tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32)
    def neglogp(self, x):
        return tf.add_n([p.neglogp(px) for p, px in zip(self.categoricals, tf.unstack(x, axis=-1))])
    def kl(self, other):
        return tf.add_n([p.kl(q) for p, q in zip(self.categoricals, other.categoricals)])
    def entropy(self):
        return tf.add_n([p.entropy() for p in self.categoricals])
    def sample(self):
        return tf.cast(tf.stack([p.sample() for p in self.categoricals], axis=-1), tf.int32)
    @classmethod
    def fromflat(cls, flat):
        raise NotImplementedError

class DiagGaussianPd(Pd):
    def __init__(self, flat):
        self.flat = flat
        mean, logstd = tf.split(axis=len(flat.shape)-1, num_or_size_splits=2, value=flat)
        self.mean = mean
        self.logstd = logstd
        self.std = tf.exp(logstd)
    def flatparam(self):
        return self.flat
    def mode(self):
        return self.mean
    def neglogp(self, x):
        return 0.5 * tf.reduce_sum(tf.square((x - self.mean) / self.std), axis=-1) \
               + 0.5 * np.log(2.0 * np.pi) * tf.to_float(tf.shape(x)[-1]) \
               + tf.reduce_sum(self.logstd, axis=-1)
    def kl(self, other):
        assert isinstance(other, DiagGaussianPd)
        return tf.reduce_sum(other.logstd - self.logstd + (tf.square(self.std) + tf.square(self.mean - other.mean)) / (2.0 * tf.square(other.std)) - 0.5, axis=-1)
    def entropy(self):
        return tf.reduce_sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1)
    def sample(self):
        return self.mean + self.std * tf.random_normal(tf.shape(self.mean))
    @classmethod
    def fromflat(cls, flat):
        return cls(flat)


class BernoulliPd(Pd):
    def __init__(self, logits):
        self.logits = logits
        self.ps = tf.sigmoid(logits)
    def flatparam(self):
        return self.logits
    @property
    def mean(self):
        return self.ps
    def mode(self):
        return tf.round(self.ps)
    def neglogp(self, x):
        return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=tf.to_float(x)), axis=-1)
    def kl(self, other):
        return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=other.logits, labels=self.ps), axis=-1) - tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
    def entropy(self):
        return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
    def sample(self):
        u = tf.random_uniform(tf.shape(self.ps))
        return tf.to_float(math_ops.less(u, self.ps))
    @classmethod
    def fromflat(cls, flat):
        return cls(flat)

def make_pdtype(ac_space):
    from gym import spaces
    if isinstance(ac_space, spaces.Box):
        assert len(ac_space.shape) == 1
        return DiagGaussianPdType(ac_space.shape[0])
    elif isinstance(ac_space, spaces.Discrete):
        return CategoricalPdType(ac_space.n)
    elif isinstance(ac_space, spaces.MultiDiscrete):
        return MultiCategoricalPdType(ac_space.nvec)
    elif isinstance(ac_space, spaces.MultiBinary):
        return BernoulliPdType(ac_space.n)
    else:
        raise NotImplementedError

def shape_el(v, i):
    maybe = v.get_shape()[i]
    if maybe is not None:
        return maybe
    else:
        return tf.shape(v)[i]

@U.in_session
def test_probtypes():
    np.random.seed(0)

    pdparam_diag_gauss = np.array([-.2, .3, .4, -.5, .1, -.5, .1, 0.8])
    diag_gauss = DiagGaussianPdType(pdparam_diag_gauss.size // 2) #pylint: disable=E1101
    validate_probtype(diag_gauss, pdparam_diag_gauss)

    pdparam_categorical = np.array([-.2, .3, .5])
    categorical = CategoricalPdType(pdparam_categorical.size) #pylint: disable=E1101
    validate_probtype(categorical, pdparam_categorical)

    nvec = [1,2,3]
    pdparam_multicategorical = np.array([-.2, .3, .5, .1, 1, -.1])
    multicategorical = MultiCategoricalPdType(nvec) #pylint: disable=E1101
    validate_probtype(multicategorical, pdparam_multicategorical)

    pdparam_bernoulli = np.array([-.2, .3, .5])
    bernoulli = BernoulliPdType(pdparam_bernoulli.size) #pylint: disable=E1101
    validate_probtype(bernoulli, pdparam_bernoulli)


def validate_probtype(probtype, pdparam):
    N = 100000
    # Check to see if mean negative log likelihood == differential entropy
    Mval = np.repeat(pdparam[None, :], N, axis=0)
    M = probtype.param_placeholder([N])
    X = probtype.sample_placeholder([N])
    pd = probtype.pdfromflat(M)
    calcloglik = U.function([X, M], pd.logp(X))
    calcent = U.function([M], pd.entropy())
    Xval = tf.get_default_session().run(pd.sample(), feed_dict={M:Mval})
    logliks = calcloglik(Xval, Mval)
    entval_ll = - logliks.mean() #pylint: disable=E1101
    entval_ll_stderr = logliks.std() / np.sqrt(N) #pylint: disable=E1101
    entval = calcent(Mval).mean() #pylint: disable=E1101
    assert np.abs(entval - entval_ll) < 3 * entval_ll_stderr # within 3 sigmas

    # Check to see if kldiv[p,q] = - ent[p] - E_p[log q]
    M2 = probtype.param_placeholder([N])
    pd2 = probtype.pdfromflat(M2)
    q = pdparam + np.random.randn(pdparam.size) * 0.1
    Mval2 = np.repeat(q[None, :], N, axis=0)
    calckl = U.function([M, M2], pd.kl(pd2))
    klval = calckl(Mval, Mval2).mean() #pylint: disable=E1101
    logliks = calcloglik(Xval, Mval2)
    klval_ll = - entval - logliks.mean() #pylint: disable=E1101
    klval_ll_stderr = logliks.std() / np.sqrt(N) #pylint: disable=E1101
    assert np.abs(klval - klval_ll) < 3 * klval_ll_stderr # within 3 sigmas
    print('ok on', probtype, pdparam)


def _matching_fc(tensor, name, size, init_scale, init_bias):
    if tensor.shape[-1] == size:
        return tensor
    else:
        return fc(tensor, name, size, init_scale=init_scale, init_bias=init_bias)


================================================
FILE: baselines/common/input.py
================================================
import numpy as np
import tensorflow as tf
from gym.spaces import Discrete, Box, MultiDiscrete

def observation_placeholder(ob_space, batch_size=None, name='Ob'):
    '''
    Create placeholder to feed observations into of the size appropriate to the observation space

    Parameters:
    ----------

    ob_space: gym.Space     observation space

    batch_size: int         size of the batch to be fed into input. Can be left None in most cases.

    name: str               name of the placeholder

    Returns:
    -------

    tensorflow placeholder tensor
    '''

    assert isinstance(ob_space, Discrete) or isinstance(ob_space, Box) or isinstance(ob_space, MultiDiscrete), \
        'Can only deal with Discrete and Box observation spaces for now'

    dtype = ob_space.dtype
    if dtype == np.int8:
        dtype = np.uint8

    return tf.placeholder(shape=(batch_size,) + ob_space.shape, dtype=dtype, name=name)


def observation_input(ob_space, batch_size=None, name='Ob'):
    '''
    Create placeholder to feed observations into of the size appropriate to the observation space, and add input
    encoder of the appropriate type.
    '''

    placeholder = observation_placeholder(ob_space, batch_size, name)
    return placeholder, encode_observation(ob_space, placeholder)

def encode_observation(ob_space, placeholder):
    '''
    Encode input in the way that is appropriate to the observation space

    Parameters:
    ----------

    ob_space: gym.Space             observation space

    placeholder: tf.placeholder     observation input placeholder
    '''
    if isinstance(ob_space, Discrete):
        return tf.to_float(tf.one_hot(placeholder, ob_space.n))
    elif isinstance(ob_space, Box):
        return tf.to_float(placeholder)
    elif isinstance(ob_space, MultiDiscrete):
        placeholder = tf.cast(placeholder, tf.int32)
        one_hots = [tf.to_float(tf.one_hot(placeholder[..., i], ob_space.nvec[i])) for i in range(placeholder.shape[-1])]
        return tf.concat(one_hots, axis=-1)
    else:
        raise NotImplementedError



================================================
FILE: baselines/common/math_util.py
================================================
import numpy as np
import scipy.signal


def discount(x, gamma):
    """
    computes discounted sums along 0th dimension of x.

    inputs
    ------
    x: ndarray
    gamma: float

    outputs
    -------
    y: ndarray with same shape as x, satisfying

        y[t] = x[t] + gamma*x[t+1] + gamma^2*x[t+2] + ... + gamma^k x[t+k],
                where k = len(x) - t - 1

    """
    assert x.ndim >= 1
    return scipy.signal.lfilter([1],[1,-gamma],x[::-1], axis=0)[::-1]

def explained_variance(ypred,y):
    """
    Computes fraction of variance that ypred explains about y.
    Returns 1 - Var[y-ypred] / Var[y]

    interpretation:
        ev=0  =>  might as well have predicted zero
        ev=1  =>  perfect prediction
        ev<0  =>  worse than just predicting zero

    """
    assert y.ndim == 1 and ypred.ndim == 1
    vary = np.var(y)
    return np.nan if vary==0 else 1 - np.var(y-ypred)/vary

def explained_variance_2d(ypred, y):
    assert y.ndim == 2 and ypred.ndim == 2
    vary = np.var(y, axis=0)
    out = 1 - np.var(y-ypred)/vary
    out[vary < 1e-10] = 0
    return out

def ncc(ypred, y):
    return np.corrcoef(ypred, y)[1,0]

def flatten_arrays(arrs):
    return np.concatenate([arr.flat for arr in arrs])

def unflatten_vector(vec, shapes):
    i=0
    arrs = []
    for shape in shapes:
        size = np.prod(shape)
        arr = vec[i:i+size].reshape(shape)
        arrs.append(arr)
        i += size
    return arrs

def discount_with_boundaries(X, New, gamma):
    """
    X: 2d array of floats, time x features
    New: 2d array of bools, indicating when a new episode has started
    """
    Y = np.zeros_like(X)
    T = X.shape[0]
    Y[T-1] = X[T-1]
    for t in range(T-2, -1, -1):
        Y[t] = X[t] + gamma * Y[t+1] * (1 - New[t+1])
    return Y

def test_discount_with_boundaries():
    gamma=0.9
    x = np.array([1.0, 2.0, 3.0, 4.0], 'float32')
    starts = [1.0, 0.0, 0.0, 1.0]
    y = discount_with_boundaries(x, starts, gamma)
    assert np.allclose(y, [
        1 + gamma * 2 + gamma**2 * 3,
        2 + gamma * 3,
        3,
        4
    ])


================================================
FILE: baselines/common/misc_util.py
================================================
import gym
import numpy as np
import os
import pickle
import random
import tempfile
import zipfile


def zipsame(*seqs):
    L = len(seqs[0])
    assert all(len(seq) == L for seq in seqs[1:])
    return zip(*seqs)


class EzPickle(object):
    """Objects that are pickled and unpickled via their constructor
    arguments.

    Example usage:

        class Dog(Animal, EzPickle):
            def __init__(self, furcolor, tailkind="bushy"):
                Animal.__init__()
                EzPickle.__init__(furcolor, tailkind)
                ...

    When this object is unpickled, a new Dog will be constructed by passing the provided
    furcolor and tailkind into the constructor. However, philosophers are still not sure
    whether it is still the same dog.

    This is generally needed only for environments which wrap C/C++ code, such as MuJoCo
    and Atari.
    """

    def __init__(self, *args, **kwargs):
        self._ezpickle_args = args
        self._ezpickle_kwargs = kwargs

    def __getstate__(self):
        return {"_ezpickle_args": self._ezpickle_args, "_ezpickle_kwargs": self._ezpickle_kwargs}

    def __setstate__(self, d):
        out = type(self)(*d["_ezpickle_args"], **d["_ezpickle_kwargs"])
        self.__dict__.update(out.__dict__)


def set_global_seeds(i):
    try:
        import MPI
        rank = MPI.COMM_WORLD.Get_rank()
    except ImportError:
        rank = 0

    myseed = i  + 1000 * rank if i is not None else None
    try:
        import tensorflow as tf
        tf.set_random_seed(myseed)
    except ImportError:
        pass
    np.random.seed(myseed)
    random.seed(myseed)


def pretty_eta(seconds_left):
    """Print the number of seconds in human readable format.

    Examples:
    2 days
    2 hours and 37 minutes
    less than a minute

    Paramters
    ---------
    seconds_left: int
        Number of seconds to be converted to the ETA
    Returns
    -------
    eta: str
        String representing the pretty ETA.
    """
    minutes_left = seconds_left // 60
    seconds_left %= 60
    hours_left = minutes_left // 60
    minutes_left %= 60
    days_left = hours_left // 24
    hours_left %= 24

    def helper(cnt, name):
        return "{} {}{}".format(str(cnt), name, ('s' if cnt > 1 else ''))

    if days_left > 0:
        msg = helper(days_left, 'day')
        if hours_left > 0:
            msg += ' and ' + helper(hours_left, 'hour')
        return msg
    if hours_left > 0:
        msg = helper(hours_left, 'hour')
        if minutes_left > 0:
            msg += ' and ' + helper(minutes_left, 'minute')
        return msg
    if minutes_left > 0:
        return helper(minutes_left, 'minute')
    return 'less than a minute'


class RunningAvg(object):
    def __init__(self, gamma, init_value=None):
        """Keep a running estimate of a quantity. This is a bit like mean
        but more sensitive to recent changes.

        Parameters
        ----------
        gamma: float
            Must be between 0 and 1, where 0 is the most sensitive to recent
            changes.
        init_value: float or None
            Initial value of the estimate. If None, it will be set on the first update.
        """
        self._value = init_value
        self._gamma = gamma

    def update(self, new_val):
        """Update the estimate.

        Parameters
        ----------
        new_val: float
            new observated value of estimated quantity.
        """
        if self._value is None:
            self._value = new_val
        else:
            self._value = self._gamma * self._value + (1.0 - self._gamma) * new_val

    def __float__(self):
        """Get the current estimate"""
        return self._value

def boolean_flag(parser, name, default=False, help=None):
    """Add a boolean flag to argparse parser.

    Parameters
    ----------
    parser: argparse.Parser
        parser to add the flag to
    name: str
        --<name> will enable the flag, while --no-<name> will disable it
    default: bool or None
        default value of the flag
    help: str
        help string for the flag
    """
    dest = name.replace('-', '_')
    parser.add_argument("--" + name, action="store_true", default=default, dest=dest, help=help)
    parser.add_argument("--no-" + name, action="store_false", dest=dest)


def get_wrapper_by_name(env, classname):
    """Given an a gym environment possibly wrapped multiple times, returns a wrapper
    of class named classname or raises ValueError if no such wrapper was applied

    Parameters
    ----------
    env: gym.Env of gym.Wrapper
        gym environment
    classname: str
        name of the wrapper

    Returns
    -------
    wrapper: gym.Wrapper
        wrapper named classname
    """
    currentenv = env
    while True:
        if classname == currentenv.class_name():
            return currentenv
        elif isinstance(currentenv, gym.Wrapper):
            currentenv = currentenv.env
        else:
            raise ValueError("Couldn't find wrapper named %s" % classname)


def relatively_safe_pickle_dump(obj, path, compression=False):
    """This is just like regular pickle dump, except from the fact that failure cases are
    different:

        - It's never possible that we end up with a pickle in corrupted state.
        - If a there was a different file at the path, that file will remain unchanged in the
          even of failure (provided that filesystem rename is atomic).
        - it is sometimes possible that we end up with useless temp file which needs to be
          deleted manually (it will be removed automatically on the next function call)

    The indended use case is periodic checkpoints of experiment state, such that we never
    corrupt previous checkpoints if the current one fails.

    Parameters
    ----------
    obj: object
        object to pickle
    path: str
        path to the output file
    compression: bool
        if true pickle will be compressed
    """
    temp_storage = path + ".relatively_safe"
    if compression:
        # Using gzip here would be simpler, but the size is limited to 2GB
        with tempfile.NamedTemporaryFile() as uncompressed_file:
            pickle.dump(obj, uncompressed_file)
            uncompressed_file.file.flush()
            with zipfile.ZipFile(temp_storage, "w", compression=zipfile.ZIP_DEFLATED) as myzip:
                myzip.write(uncompressed_file.name, "data")
    else:
        with open(temp_storage, "wb") as f:
            pickle.dump(obj, f)
    os.rename(temp_storage, path)


def pickle_load(path, compression=False):
    """Unpickle a possible compressed pickle.

    Parameters
    ----------
    path: str
        path to the output file
    compression: bool
        if true assumes that pickle was compressed when created and attempts decompression.

    Returns
    -------
    obj: object
        the unpickled object
    """

    if compression:
        with zipfile.ZipFile(path, "r", compression=zipfile.ZIP_DEFLATED) as myzip:
            with myzip.open("data") as f:
                return pickle.load(f)
    else:
        with open(path, "rb") as f:
            return pickle.load(f)


================================================
FILE: baselines/common/models.py
================================================
import numpy as np
import tensorflow as tf
from baselines.a2c import utils
from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch
from baselines.common.mpi_running_mean_std import RunningMeanStd

mapping = {}

def register(name):
    def _thunk(func):
        mapping[name] = func
        return func
    return _thunk

def nature_cnn(unscaled_images, **conv_kwargs):
    """
    CNN from Nature paper.
    """
    scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
    activ = tf.nn.relu
    h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
                   **conv_kwargs))
    h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
    h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
    h3 = conv_to_fc(h3)
    return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))

def build_impala_cnn(unscaled_images, depths=[16,32,32], **conv_kwargs):
    """
    Model used in the paper "IMPALA: Scalable Distributed Deep-RL with
    Importance Weighted Actor-Learner Architectures" https://arxiv.org/abs/1802.01561
    """

    layer_num = 0

    def get_layer_num_str():
        nonlocal layer_num
        num_str = str(layer_num)
        layer_num += 1
        return num_str

    def conv_layer(out, depth):
        return tf.layers.conv2d(out, depth, 3, padding='same', name='layer_' + get_layer_num_str())

    def residual_block(inputs):
        depth = inputs.get_shape()[-1].value

        out = tf.nn.relu(inputs)

        out = conv_layer(out, depth)
        out = tf.nn.relu(out)
        out = conv_layer(out, depth)
        return out + inputs

    def conv_sequence(inputs, depth):
        out = conv_layer(inputs, depth)
        out = tf.layers.max_pooling2d(out, pool_size=3, strides=2, padding='same')
        out = residual_block(out)
        out = residual_block(out)
        return out

    out = tf.cast(unscaled_images, tf.float32) / 255.

    for depth in depths:
        out = conv_sequence(out, depth)

    out = tf.layers.flatten(out)
    out = tf.nn.relu(out)
    out = tf.layers.dense(out, 256, activation=tf.nn.relu, name='layer_' + get_layer_num_str())

    return out


@register("mlp")
def mlp(num_layers=2, num_hidden=64, activation=tf.tanh, layer_norm=False):
    """
    Stack of fully-connected layers to be used in a policy / q-function approximator

    Parameters:
    ----------

    num_layers: int                 number of fully-connected layers (default: 2)

    num_hidden: int                 size of fully-connected layers (default: 64)

    activation:                     activation function (default: tf.tanh)

    Returns:
    -------

    function that builds fully connected network with a given input tensor / placeholder
    """
    def network_fn(X):
        h = tf.layers.flatten(X)
        for i in range(num_layers):
            h = fc(h, 'mlp_fc{}'.format(i), nh=num_hidden, init_scale=np.sqrt(2))
            if layer_norm:
                h = tf.contrib.layers.layer_norm(h, center=True, scale=True)
            h = activation(h)

        return h

    return network_fn


@register("cnn")
def cnn(**conv_kwargs):
    def network_fn(X):
        return nature_cnn(X, **conv_kwargs)
    return network_fn

@register("impala_cnn")
def impala_cnn(**conv_kwargs):
    def network_fn(X):
        return build_impala_cnn(X)
    return network_fn

@register("cnn_small")
def cnn_small(**conv_kwargs):
    def network_fn(X):
        h = tf.cast(X, tf.float32) / 255.

        activ = tf.nn.relu
        h = activ(conv(h, 'c1', nf=8, rf=8, stride=4, init_scale=np.sqrt(2), **conv_kwargs))
        h = activ(conv(h, 'c2', nf=16, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
        h = conv_to_fc(h)
        h = activ(fc(h, 'fc1', nh=128, init_scale=np.sqrt(2)))
        return h
    return network_fn

@register("lstm")
def lstm(nlstm=128, layer_norm=False):
    """
    Builds LSTM (Long-Short Term Memory) network to be used in a policy.
    Note that the resulting function returns not only the output of the LSTM
    (i.e. hidden state of lstm for each step in the sequence), but also a dictionary
    with auxiliary tensors to be set as policy attributes.

    Specifically,
        S is a placeholder to feed current state (LSTM state has to be managed outside policy)
        M is a placeholder for the mask (used to mask out observations after the end of the episode, but can be used for other purposes too)
        initial_state is a numpy array containing initial lstm state (usually zeros)
        state is the output LSTM state (to be fed into S at the next call)


    An example of usage of lstm-based policy can be found here: common/tests/test_doc_examples.py/test_lstm_example

    Parameters:
    ----------

    nlstm: int          LSTM hidden state size

    layer_norm: bool    if True, layer-normalized version of LSTM is used

    Returns:
    -------

    function that builds LSTM with a given input tensor / placeholder
    """

    def network_fn(X, nenv=1):
        nbatch = X.shape[0]
        nsteps = nbatch // nenv

        h = tf.layers.flatten(X)

        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
        S = tf.placeholder(tf.float32, [nenv, 2*nlstm]) #states

        xs = batch_to_seq(h, nenv, nsteps)
        ms = batch_to_seq(M, nenv, nsteps)

        if layer_norm:
            h5, snew = utils.lnlstm(xs, ms, S, scope='lnlstm', nh=nlstm)
        else:
            h5, snew = utils.lstm(xs, ms, S, scope='lstm', nh=nlstm)

        h = seq_to_batch(h5)
        initial_state = np.zeros(S.shape.as_list(), dtype=float)

        return h, {'S':S, 'M':M, 'state':snew, 'initial_state':initial_state}

    return network_fn


@register("cnn_lstm")
def cnn_lstm(nlstm=128, layer_norm=False, conv_fn=nature_cnn, **conv_kwargs):
    def network_fn(X, nenv=1):
        nbatch = X.shape[0]
        nsteps = nbatch // nenv

        h = conv_fn(X, **conv_kwargs)

        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
        S = tf.placeholder(tf.float32, [nenv, 2*nlstm]) #states

        xs = batch_to_seq(h, nenv, nsteps)
        ms = batch_to_seq(M, nenv, nsteps)

        if layer_norm:
            h5, snew = utils.lnlstm(xs, ms, S, scope='lnlstm', nh=nlstm)
        else:
            h5, snew = utils.lstm(xs, ms, S, scope='lstm', nh=nlstm)

        h = seq_to_batch(h5)
        initial_state = np.zeros(S.shape.as_list(), dtype=float)

        return h, {'S':S, 'M':M, 'state':snew, 'initial_state':initial_state}

    return network_fn

@register("impala_cnn_lstm")
def impala_cnn_lstm():
    return cnn_lstm(nlstm=256, conv_fn=build_impala_cnn)

@register("cnn_lnlstm")
def cnn_lnlstm(nlstm=128, **conv_kwargs):
    return cnn_lstm(nlstm, layer_norm=True, **conv_kwargs)


@register("conv_only")
def conv_only(convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)], **conv_kwargs):
    '''
    convolutions-only net

    Parameters:
    ----------

    conv:       list of triples (filter_number, filter_size, stride) specifying parameters for each layer.

    Returns:

    function that takes tensorflow tensor as input and returns the output of the last convolutional layer

    '''

    def network_fn(X):
        out = tf.cast(X, tf.float32) / 255.
        with tf.variable_scope("convnet"):
            for num_outputs, kernel_size, stride in convs:
                out = tf.contrib.layers.convolution2d(out,
                                           num_outputs=num_outputs,
                                           kernel_size=kernel_size,
                                           stride=stride,
                                           activation_fn=tf.nn.relu,
                                           **conv_kwargs)

        return out
    return network_fn

def _normalize_clip_observation(x, clip_range=[-5.0, 5.0]):
    rms = RunningMeanStd(shape=x.shape[1:])
    norm_x = tf.clip_by_value((x - rms.mean) / rms.std, min(clip_range), max(clip_range))
    return norm_x, rms


def get_network_builder(name):
    """
    If you want to register your own network outside models.py, you just need:

    Usage Example:
    -------------
    from baselines.common.models import register
    @register("your_network_name")
    def your_network_define(**net_kwargs):
        ...
        return network_fn

    """
    if callable(name):
        return name
    elif name in mapping:
        return mapping[name]
    else:
        raise ValueError('Unknown network type: {}'.format(name))


================================================
FILE: baselines/common/mpi_adam.py
================================================
import baselines.common.tf_util as U
import tensorflow as tf
import numpy as np
try:
    from mpi4py import MPI
except ImportError:
    MPI = None


class MpiAdam(object):
    def __init__(self, var_list, *, beta1=0.9, beta2=0.999, epsilon=1e-08, scale_grad_by_procs=True, comm=None):
        self.var_list = var_list
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.scale_grad_by_procs = scale_grad_by_procs
        size = sum(U.numel(v) for v in var_list)
        self.m = np.zeros(size, 'float32')
        self.v = np.zeros(size, 'float32')
        self.t = 0
        self.setfromflat = U.SetFromFlat(var_list)
        self.getflat = U.GetFlat(var_list)
        self.comm = MPI.COMM_WORLD if comm is None and MPI is not None else comm

    def update(self, localg, stepsize):
        if self.t % 100 == 0:
            self.check_synced()
        localg = localg.astype('float32')
        if self.comm is not None:
            globalg = np.zeros_like(localg)
            self.comm.Allreduce(localg, globalg, op=MPI.SUM)
            if self.scale_grad_by_procs:
                globalg /= self.comm.Get_size()
        else:
            globalg = np.copy(localg)

        self.t += 1
        a = stepsize * np.sqrt(1 - self.beta2**self.t)/(1 - self.beta1**self.t)
        self.m = self.beta1 * self.m + (1 - self.beta1) * globalg
        self.v = self.beta2 * self.v + (1 - self.beta2) * (globalg * globalg)
        step = (- a) * self.m / (np.sqrt(self.v) + self.epsilon)
        self.setfromflat(self.getflat() + step)

    def sync(self):
        if self.comm is None:
            return
        theta = self.getflat()
        self.comm.Bcast(theta, root=0)
        self.setfromflat(theta)

    def check_synced(self):
        if self.comm is None:
            return
        if self.comm.Get_rank() == 0: # this is root
            theta = self.getflat()
            self.comm.Bcast(theta, root=0)
        else:
            thetalocal = self.getflat()
            thetaroot = np.empty_like(thetalocal)
            self.comm.Bcast(thetaroot, root=0)
            assert (thetaroot == thetalocal).all(), (thetaroot, thetalocal)

@U.in_session
def test_MpiAdam():
    np.random.seed(0)
    tf.set_random_seed(0)

    a = tf.Variable(np.random.randn(3).astype('float32'))
    b = tf.Variable(np.random.randn(2,5).astype('float32'))
    loss = tf.reduce_sum(tf.square(a)) + tf.reduce_sum(tf.sin(b))

    stepsize = 1e-2
    update_op = tf.train.AdamOptimizer(stepsize).minimize(loss)
    do_update = U.function([], loss, updates=[update_op])

    tf.get_default_session().run(tf.global_variables_initializer())
    losslist_ref = []
    for i in range(10):
        l = do_update()
        print(i, l)
        losslist_ref.append(l)



    tf.set_random_seed(0)
    tf.get_default_session().run(tf.global_variables_initializer())

    var_list = [a,
Download .txt
gitextract_zotywiye/

├── .benchmark_pattern
├── .gitignore
├── .travis.yml
├── Dockerfile
├── LICENSE
├── README.md
├── baselines/
│   ├── __init__.py
│   ├── a2c/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── a2c.py
│   │   ├── runner.py
│   │   └── utils.py
│   ├── acer/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── acer.py
│   │   ├── buffer.py
│   │   ├── defaults.py
│   │   ├── policies.py
│   │   └── runner.py
│   ├── acktr/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── acktr.py
│   │   ├── defaults.py
│   │   ├── kfac.py
│   │   ├── kfac_utils.py
│   │   └── utils.py
│   ├── bench/
│   │   ├── __init__.py
│   │   ├── benchmarks.py
│   │   ├── monitor.py
│   │   └── test_monitor.py
│   ├── common/
│   │   ├── __init__.py
│   │   ├── atari_wrappers.py
│   │   ├── cg.py
│   │   ├── cmd_util.py
│   │   ├── console_util.py
│   │   ├── dataset.py
│   │   ├── distributions.py
│   │   ├── input.py
│   │   ├── math_util.py
│   │   ├── misc_util.py
│   │   ├── models.py
│   │   ├── mpi_adam.py
│   │   ├── mpi_adam_optimizer.py
│   │   ├── mpi_fork.py
│   │   ├── mpi_moments.py
│   │   ├── mpi_running_mean_std.py
│   │   ├── mpi_util.py
│   │   ├── plot_util.py
│   │   ├── policies.py
│   │   ├── retro_wrappers.py
│   │   ├── runners.py
│   │   ├── running_mean_std.py
│   │   ├── schedules.py
│   │   ├── segment_tree.py
│   │   ├── test_mpi_util.py
│   │   ├── tests/
│   │   │   ├── __init__.py
│   │   │   ├── envs/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── fixed_sequence_env.py
│   │   │   │   ├── identity_env.py
│   │   │   │   ├── identity_env_test.py
│   │   │   │   └── mnist_env.py
│   │   │   ├── test_cartpole.py
│   │   │   ├── test_doc_examples.py
│   │   │   ├── test_env_after_learn.py
│   │   │   ├── test_fetchreach.py
│   │   │   ├── test_fixed_sequence.py
│   │   │   ├── test_identity.py
│   │   │   ├── test_mnist.py
│   │   │   ├── test_plot_util.py
│   │   │   ├── test_schedules.py
│   │   │   ├── test_segment_tree.py
│   │   │   ├── test_serialization.py
│   │   │   ├── test_tf_util.py
│   │   │   ├── test_with_mpi.py
│   │   │   └── util.py
│   │   ├── tf_util.py
│   │   ├── tile_images.py
│   │   ├── vec_env/
│   │   │   ├── __init__.py
│   │   │   ├── dummy_vec_env.py
│   │   │   ├── shmem_vec_env.py
│   │   │   ├── subproc_vec_env.py
│   │   │   ├── test_vec_env.py
│   │   │   ├── test_video_recorder.py
│   │   │   ├── util.py
│   │   │   ├── vec_env.py
│   │   │   ├── vec_frame_stack.py
│   │   │   ├── vec_monitor.py
│   │   │   ├── vec_normalize.py
│   │   │   ├── vec_remove_dict_obs.py
│   │   │   └── vec_video_recorder.py
│   │   └── wrappers.py
│   ├── ddpg/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── ddpg.py
│   │   ├── ddpg_learner.py
│   │   ├── memory.py
│   │   ├── models.py
│   │   ├── noise.py
│   │   └── test_smoke.py
│   ├── deepq/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── build_graph.py
│   │   ├── deepq.py
│   │   ├── defaults.py
│   │   ├── experiments/
│   │   │   ├── __init__.py
│   │   │   ├── custom_cartpole.py
│   │   │   ├── enjoy_cartpole.py
│   │   │   ├── enjoy_mountaincar.py
│   │   │   ├── enjoy_pong.py
│   │   │   ├── train_cartpole.py
│   │   │   ├── train_mountaincar.py
│   │   │   └── train_pong.py
│   │   ├── models.py
│   │   ├── replay_buffer.py
│   │   └── utils.py
│   ├── gail/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── adversary.py
│   │   ├── behavior_clone.py
│   │   ├── dataset/
│   │   │   ├── __init__.py
│   │   │   └── mujoco_dset.py
│   │   ├── gail-eval.py
│   │   ├── mlp_policy.py
│   │   ├── result/
│   │   │   └── gail-result.md
│   │   ├── run_mujoco.py
│   │   ├── statistics.py
│   │   └── trpo_mpi.py
│   ├── her/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── actor_critic.py
│   │   ├── ddpg.py
│   │   ├── experiment/
│   │   │   ├── __init__.py
│   │   │   ├── config.py
│   │   │   ├── data_generation/
│   │   │   │   └── fetch_data_generation.py
│   │   │   ├── play.py
│   │   │   └── plot.py
│   │   ├── her.py
│   │   ├── her_sampler.py
│   │   ├── normalizer.py
│   │   ├── replay_buffer.py
│   │   ├── rollout.py
│   │   └── util.py
│   ├── logger.py
│   ├── ppo1/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── cnn_policy.py
│   │   ├── mlp_policy.py
│   │   ├── pposgd_simple.py
│   │   ├── run_atari.py
│   │   ├── run_humanoid.py
│   │   ├── run_mujoco.py
│   │   └── run_robotics.py
│   ├── ppo2/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── defaults.py
│   │   ├── microbatched_model.py
│   │   ├── model.py
│   │   ├── ppo2.py
│   │   ├── runner.py
│   │   └── test_microbatches.py
│   ├── results_plotter.py
│   ├── run.py
│   └── trpo_mpi/
│       ├── README.md
│       ├── __init__.py
│       ├── defaults.py
│       └── trpo_mpi.py
├── benchmarks_atari10M.htm
├── benchmarks_mujoco1M.htm
├── docs/
│   └── viz/
│       └── viz.ipynb
├── setup.cfg
└── setup.py
Download .txt
SYMBOL INDEX (1041 symbols across 131 files)

FILE: baselines/a2c/a2c.py
  class Model (line 19) | class Model(object):
    method __init__ (line 33) | def __init__(self, policy, env, nsteps,
  function learn (line 119) | def learn(

FILE: baselines/a2c/runner.py
  class Runner (line 5) | class Runner(AbstractEnvRunner):
    method __init__ (line 15) | def __init__(self, env, model, nsteps=5, gamma=0.99):
    method run (line 21) | def run(self):

FILE: baselines/a2c/utils.py
  function sample (line 6) | def sample(logits):
  function cat_entropy (line 10) | def cat_entropy(logits):
  function cat_entropy_softmax (line 17) | def cat_entropy_softmax(p0):
  function ortho_init (line 20) | def ortho_init(scale=1.0):
  function conv (line 37) | def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_...
  function fc (line 58) | def fc(x, scope, nh, *, init_scale=1.0, init_bias=0.0):
  function batch_to_seq (line 65) | def batch_to_seq(h, nbatch, nsteps, flat=False):
  function seq_to_batch (line 72) | def seq_to_batch(h, flat = False):
  function lstm (line 81) | def lstm(xs, ms, s, scope, nh, init_scale=1.0):
  function _ln (line 104) | def _ln(x, g, b, e=1e-5, axes=[1]):
  function lnlstm (line 110) | def lnlstm(xs, ms, s, scope, nh, init_scale=1.0):
  function conv_to_fc (line 142) | def conv_to_fc(x):
  function discount_with_dones (line 147) | def discount_with_dones(rewards, dones, gamma):
  function find_trainable_variables (line 155) | def find_trainable_variables(key):
  function make_path (line 158) | def make_path(f):
  function constant (line 161) | def constant(p):
  function linear (line 164) | def linear(p):
  function middle_drop (line 167) | def middle_drop(p):
  function double_linear_con (line 173) | def double_linear_con(p):
  function double_middle_drop (line 180) | def double_middle_drop(p):
  class Scheduler (line 197) | class Scheduler(object):
    method __init__ (line 199) | def __init__(self, v, nvalues, schedule):
    method value (line 205) | def value(self):
    method value_steps (line 210) | def value_steps(self, steps):
  class EpisodeStats (line 214) | class EpisodeStats:
    method __init__ (line 215) | def __init__(self, nsteps, nenvs):
    method feed (line 224) | def feed(self, rewards, masks):
    method mean_length (line 237) | def mean_length(self):
    method mean_reward (line 243) | def mean_reward(self):
  function get_by_index (line 251) | def get_by_index(x, idx):
  function check_shape (line 259) | def check_shape(ts,shapes):
  function avg_norm (line 265) | def avg_norm(t):
  function gradient_add (line 268) | def gradient_add(g1, g2, param):
  function q_explained_variance (line 278) | def q_explained_variance(qpred, q):

FILE: baselines/acer/acer.py
  function strip (line 21) | def strip(var, nenvs, nsteps, flat = False):
  function q_retrace (line 25) | def q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma):
  class Model (line 58) | class Model(object):
    method __init__ (line 59) | def __init__(self, policy, ob_space, ac_space, nenvs, nsteps, ent_coef...
  class Acer (line 230) | class Acer():
    method __init__ (line 231) | def __init__(self, runner, model, buffer, log_interval):
    method call (line 240) | def call(self, on_policy):
  function learn (line 275) | def learn(network, env, seed=None, nsteps=20, total_timesteps=int(80e6),...

FILE: baselines/acer/buffer.py
  class Buffer (line 3) | class Buffer(object):
    method __init__ (line 5) | def __init__(self, env, nsteps, size=50000):
    method has_atleast (line 30) | def has_atleast(self, frames):
    method can_sample (line 35) | def can_sample(self):
    method decode (line 39) | def decode(self, enc_obs, dones):
    method put (line 47) | def put(self, enc_obs, actions, rewards, mus, dones, masks):
    method take (line 70) | def take(self, x, idx, envx):
    method get (line 77) | def get(self):
  function _stack_obs_ref (line 101) | def _stack_obs_ref(enc_obs, dones, nsteps):
  function _stack_obs (line 124) | def _stack_obs(enc_obs, dones, nsteps):
  function test_stack_obs (line 142) | def test_stack_obs():

FILE: baselines/acer/defaults.py
  function atari (line 1) | def atari():

FILE: baselines/acer/policies.py
  class AcerCnnPolicy (line 7) | class AcerCnnPolicy(object):
    method __init__ (line 9) | def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reu...
  class AcerLstmPolicy (line 45) | class AcerLstmPolicy(object):
    method __init__ (line 47) | def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reu...

FILE: baselines/acer/runner.py
  class Runner (line 7) | class Runner(AbstractEnvRunner):
    method __init__ (line 9) | def __init__(self, env, model, nsteps):
    method run (line 26) | def run(self):

FILE: baselines/acktr/acktr.py
  class Model (line 18) | class Model(object):
    method __init__ (line 20) | def __init__(self, policy, ob_space, ac_space, nenvs,total_timesteps, ...
  function learn (line 95) | def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log...

FILE: baselines/acktr/defaults.py
  function mujoco (line 1) | def mujoco():

FILE: baselines/acktr/kfac.py
  class KfacOptimizer (line 13) | class KfacOptimizer():
    method __init__ (line 15) | def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfa...
    method getFactors (line 58) | def getFactors(self, g, varlist):
    method getStats (line 183) | def getStats(self, factors, varlist):
    method compute_and_apply_stats (line 285) | def compute_and_apply_stats(self, loss_sampled, var_list=None):
    method compute_stats (line 293) | def compute_stats(self, loss_sampled, var_list=None):
    method apply_stats (line 440) | def apply_stats(self, statsUpdates):
    method _apply_stats (line 476) | def _apply_stats(self, statsUpdates, accumulate=False, accumulateCoeff...
    method getStatsEigen (line 512) | def getStatsEigen(self, stats=None):
    method computeStatsEigen (line 538) | def computeStatsEigen(self):
    method applyStatsEigen (line 602) | def applyStatsEigen(self, eigen_list):
    method getKfacPrecondUpdates (line 618) | def getKfacPrecondUpdates(self, gradlist, varlist):
    method compute_gradients (line 803) | def compute_gradients(self, loss, var_list=None):
    method apply_gradients_kfac (line 811) | def apply_gradients_kfac(self, grads):
    method apply_gradients (line 897) | def apply_gradients(self, grads):
    method minimize (line 924) | def minimize(self, loss, loss_sampled, var_list=None):

FILE: baselines/acktr/kfac_utils.py
  function gmatmul (line 3) | def gmatmul(a, b, transpose_a=False, transpose_b=False, reduce_dim=None):
  function clipoutNeg (line 55) | def clipoutNeg(vec, threshold=1e-6):
  function detectMinVal (line 60) | def detectMinVal(input_mat, var, threshold=1e-6, name='', debug=False):
  function factorReshape (line 73) | def factorReshape(Q, e, grad, facIndx=0, ftype='act'):

FILE: baselines/acktr/utils.py
  function dense (line 3) | def dense(x, size, name, weight_init=None, bias_init=0, weight_loss_dict...
  function kl_div (line 21) | def kl_div(action_dist1, action_dist2, action_size):

FILE: baselines/bench/benchmarks.py
  function register_benchmark (line 13) | def register_benchmark(benchmark):
  function list_benchmarks (line 26) | def list_benchmarks():
  function get_benchmark (line 30) | def get_benchmark(benchmark_name):
  function get_task (line 37) | def get_task(benchmark, env_id):
  function find_task_for_env_id_in_any_benchmark (line 42) | def find_task_for_env_id_in_any_benchmark(env_id):

FILE: baselines/bench/monitor.py
  class Monitor (line 10) | class Monitor(Wrapper):
    method __init__ (line 14) | def __init__(self, env, filename, allow_early_resets=False, reset_keyw...
    method reset (line 35) | def reset(self, **kwargs):
    method reset_state (line 44) | def reset_state(self):
    method step (line 51) | def step(self, action):
    method update (line 58) | def update(self, ob, rew, done, info):
    method close (line 79) | def close(self):
    method get_total_steps (line 84) | def get_total_steps(self):
    method get_episode_rewards (line 87) | def get_episode_rewards(self):
    method get_episode_lengths (line 90) | def get_episode_lengths(self):
    method get_episode_times (line 93) | def get_episode_times(self):
  class LoadMonitorResultsError (line 96) | class LoadMonitorResultsError(Exception):
  class ResultsWriter (line 100) | class ResultsWriter(object):
    method __init__ (line 101) | def __init__(self, filename, header='', extra_keys=()):
    method write_row (line 117) | def write_row(self, epinfo):
  function get_monitor_files (line 123) | def get_monitor_files(dir):
  function load_results (line 126) | def load_results(dir):

FILE: baselines/bench/test_monitor.py
  function test_monitor (line 5) | def test_monitor():

FILE: baselines/common/atari_wrappers.py
  class NoopResetEnv (line 12) | class NoopResetEnv(gym.Wrapper):
    method __init__ (line 13) | def __init__(self, env, noop_max=30):
    method reset (line 23) | def reset(self, **kwargs):
    method step (line 38) | def step(self, ac):
  class FireResetEnv (line 41) | class FireResetEnv(gym.Wrapper):
    method __init__ (line 42) | def __init__(self, env):
    method reset (line 48) | def reset(self, **kwargs):
    method step (line 58) | def step(self, ac):
  class EpisodicLifeEnv (line 61) | class EpisodicLifeEnv(gym.Wrapper):
    method __init__ (line 62) | def __init__(self, env):
    method step (line 70) | def step(self, action):
    method reset (line 84) | def reset(self, **kwargs):
  class MaxAndSkipEnv (line 97) | class MaxAndSkipEnv(gym.Wrapper):
    method __init__ (line 98) | def __init__(self, env, skip=4):
    method step (line 105) | def step(self, action):
    method reset (line 122) | def reset(self, **kwargs):
  class ClipRewardEnv (line 125) | class ClipRewardEnv(gym.RewardWrapper):
    method __init__ (line 126) | def __init__(self, env):
    method reward (line 129) | def reward(self, reward):
  class WarpFrame (line 134) | class WarpFrame(gym.ObservationWrapper):
    method __init__ (line 135) | def __init__(self, env, width=84, height=84, grayscale=True, dict_spac...
    method observation (line 166) | def observation(self, obs):
  class FrameStack (line 188) | class FrameStack(gym.Wrapper):
    method __init__ (line 189) | def __init__(self, env, k):
    method reset (line 204) | def reset(self):
    method step (line 210) | def step(self, action):
    method _get_ob (line 215) | def _get_ob(self):
  class ScaledFloatFrame (line 219) | class ScaledFloatFrame(gym.ObservationWrapper):
    method __init__ (line 220) | def __init__(self, env):
    method observation (line 224) | def observation(self, observation):
  class LazyFrames (line 229) | class LazyFrames(object):
    method __init__ (line 230) | def __init__(self, frames):
    method _force (line 241) | def _force(self):
    method __array__ (line 247) | def __array__(self, dtype=None):
    method __len__ (line 253) | def __len__(self):
    method __getitem__ (line 256) | def __getitem__(self, i):
    method count (line 259) | def count(self):
    method frame (line 263) | def frame(self, i):
  function make_atari (line 266) | def make_atari(env_id, max_episode_steps=None):
  function wrap_deepmind (line 275) | def wrap_deepmind(env, episode_life=True, clip_rewards=True, frame_stack...

FILE: baselines/common/cg.py
  function cg (line 2) | def cg(f_Ax, b, cg_iters=10, callback=None, verbose=False, residual_tol=...

FILE: baselines/common/cmd_util.py
  function make_vec_env (line 22) | def make_vec_env(env_id, env_type, num_env, seed,
  function make_env (line 62) | def make_env(env_id, env_type, mpi_rank=0, subrank=0, seed=None, reward_...
  function make_mujoco_env (line 108) | def make_mujoco_env(env_id, seed, reward_scale=1.0):
  function make_robotics_env (line 124) | def make_robotics_env(env_id, seed, rank=0):
  function arg_parser (line 137) | def arg_parser():
  function atari_arg_parser (line 144) | def atari_arg_parser():
  function mujoco_arg_parser (line 151) | def mujoco_arg_parser():
  function common_arg_parser (line 155) | def common_arg_parser():
  function robotics_arg_parser (line 176) | def robotics_arg_parser():
  function parse_unknown_args (line 187) | def parse_unknown_args(args):

FILE: baselines/common/console_util.py
  function fmt_row (line 12) | def fmt_row(width, row, header=False):
  function fmt_item (line 17) | def fmt_item(x, l):
  function colorize (line 42) | def colorize(string, color='green', bold=False, highlight=False):
  function print_cmd (line 50) | def print_cmd(cmd, dry=False):
  function get_git_commit (line 58) | def get_git_commit(cwd=None):
  function get_git_commit_message (line 61) | def get_git_commit_message(cwd=None):
  function ccap (line 64) | def ccap(cmd, dry=False, env=None, **kwargs):
  function timed (line 73) | def timed(msg):

FILE: baselines/common/dataset.py
  class Dataset (line 3) | class Dataset(object):
    method __init__ (line 4) | def __init__(self, data_map, deterministic=False, shuffle=True):
    method shuffle (line 12) | def shuffle(self):
    method next_batch (line 23) | def next_batch(self, batch_size):
    method iterate_once (line 36) | def iterate_once(self, batch_size):
    method subset (line 43) | def subset(self, num_elements, deterministic=True):
  function iterbatches (line 50) | def iterbatches(arrays, *, num_batches=None, batch_size=None, shuffle=Tr...

FILE: baselines/common/distributions.py
  class Pd (line 7) | class Pd(object):
    method flatparam (line 11) | def flatparam(self):
    method mode (line 13) | def mode(self):
    method neglogp (line 15) | def neglogp(self, x):
    method kl (line 18) | def kl(self, other):
    method entropy (line 20) | def entropy(self):
    method sample (line 22) | def sample(self):
    method logp (line 24) | def logp(self, x):
    method get_shape (line 26) | def get_shape(self):
    method shape (line 29) | def shape(self):
    method __getitem__ (line 31) | def __getitem__(self, idx):
  class PdType (line 34) | class PdType(object):
    method pdclass (line 38) | def pdclass(self):
    method pdfromflat (line 40) | def pdfromflat(self, flat):
    method pdfromlatent (line 42) | def pdfromlatent(self, latent_vector, init_scale, init_bias):
    method param_shape (line 44) | def param_shape(self):
    method sample_shape (line 46) | def sample_shape(self):
    method sample_dtype (line 48) | def sample_dtype(self):
    method param_placeholder (line 51) | def param_placeholder(self, prepend_shape, name=None):
    method sample_placeholder (line 53) | def sample_placeholder(self, prepend_shape, name=None):
    method __eq__ (line 56) | def __eq__(self, other):
  class CategoricalPdType (line 59) | class CategoricalPdType(PdType):
    method __init__ (line 60) | def __init__(self, ncat):
    method pdclass (line 62) | def pdclass(self):
    method pdfromlatent (line 64) | def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
    method param_shape (line 68) | def param_shape(self):
    method sample_shape (line 70) | def sample_shape(self):
    method sample_dtype (line 72) | def sample_dtype(self):
  class MultiCategoricalPdType (line 76) | class MultiCategoricalPdType(PdType):
    method __init__ (line 77) | def __init__(self, nvec):
    method pdclass (line 80) | def pdclass(self):
    method pdfromflat (line 82) | def pdfromflat(self, flat):
    method pdfromlatent (line 85) | def pdfromlatent(self, latent, init_scale=1.0, init_bias=0.0):
    method param_shape (line 89) | def param_shape(self):
    method sample_shape (line 91) | def sample_shape(self):
    method sample_dtype (line 93) | def sample_dtype(self):
  class DiagGaussianPdType (line 96) | class DiagGaussianPdType(PdType):
    method __init__ (line 97) | def __init__(self, size):
    method pdclass (line 99) | def pdclass(self):
    method pdfromlatent (line 102) | def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
    method param_shape (line 108) | def param_shape(self):
    method sample_shape (line 110) | def sample_shape(self):
    method sample_dtype (line 112) | def sample_dtype(self):
  class BernoulliPdType (line 115) | class BernoulliPdType(PdType):
    method __init__ (line 116) | def __init__(self, size):
    method pdclass (line 118) | def pdclass(self):
    method param_shape (line 120) | def param_shape(self):
    method sample_shape (line 122) | def sample_shape(self):
    method sample_dtype (line 124) | def sample_dtype(self):
    method pdfromlatent (line 126) | def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
  class CategoricalPd (line 153) | class CategoricalPd(Pd):
    method __init__ (line 154) | def __init__(self, logits):
    method flatparam (line 156) | def flatparam(self):
    method mode (line 158) | def mode(self):
    method mean (line 162) | def mean(self):
    method neglogp (line 164) | def neglogp(self, x):
    method kl (line 184) | def kl(self, other):
    method entropy (line 193) | def entropy(self):
    method sample (line 199) | def sample(self):
    method fromflat (line 203) | def fromflat(cls, flat):
  class MultiCategoricalPd (line 206) | class MultiCategoricalPd(Pd):
    method __init__ (line 207) | def __init__(self, nvec, flat):
    method flatparam (line 211) | def flatparam(self):
    method mode (line 213) | def mode(self):
    method neglogp (line 215) | def neglogp(self, x):
    method kl (line 217) | def kl(self, other):
    method entropy (line 219) | def entropy(self):
    method sample (line 221) | def sample(self):
    method fromflat (line 224) | def fromflat(cls, flat):
  class DiagGaussianPd (line 227) | class DiagGaussianPd(Pd):
    method __init__ (line 228) | def __init__(self, flat):
    method flatparam (line 234) | def flatparam(self):
    method mode (line 236) | def mode(self):
    method neglogp (line 238) | def neglogp(self, x):
    method kl (line 242) | def kl(self, other):
    method entropy (line 245) | def entropy(self):
    method sample (line 247) | def sample(self):
    method fromflat (line 250) | def fromflat(cls, flat):
  class BernoulliPd (line 254) | class BernoulliPd(Pd):
    method __init__ (line 255) | def __init__(self, logits):
    method flatparam (line 258) | def flatparam(self):
    method mean (line 261) | def mean(self):
    method mode (line 263) | def mode(self):
    method neglogp (line 265) | def neglogp(self, x):
    method kl (line 267) | def kl(self, other):
    method entropy (line 269) | def entropy(self):
    method sample (line 271) | def sample(self):
    method fromflat (line 275) | def fromflat(cls, flat):
  function make_pdtype (line 278) | def make_pdtype(ac_space):
  function shape_el (line 292) | def shape_el(v, i):
  function test_probtypes (line 300) | def test_probtypes():
  function validate_probtype (line 321) | def validate_probtype(probtype, pdparam):
  function _matching_fc (line 351) | def _matching_fc(tensor, name, size, init_scale, init_bias):

FILE: baselines/common/input.py
  function observation_placeholder (line 5) | def observation_placeholder(ob_space, batch_size=None, name='Ob'):
  function observation_input (line 34) | def observation_input(ob_space, batch_size=None, name='Ob'):
  function encode_observation (line 43) | def encode_observation(ob_space, placeholder):

FILE: baselines/common/math_util.py
  function discount (line 5) | def discount(x, gamma):
  function explained_variance (line 25) | def explained_variance(ypred,y):
  function explained_variance_2d (line 40) | def explained_variance_2d(ypred, y):
  function ncc (line 47) | def ncc(ypred, y):
  function flatten_arrays (line 50) | def flatten_arrays(arrs):
  function unflatten_vector (line 53) | def unflatten_vector(vec, shapes):
  function discount_with_boundaries (line 63) | def discount_with_boundaries(X, New, gamma):
  function test_discount_with_boundaries (line 75) | def test_discount_with_boundaries():

FILE: baselines/common/misc_util.py
  function zipsame (line 10) | def zipsame(*seqs):
  class EzPickle (line 16) | class EzPickle(object):
    method __init__ (line 36) | def __init__(self, *args, **kwargs):
    method __getstate__ (line 40) | def __getstate__(self):
    method __setstate__ (line 43) | def __setstate__(self, d):
  function set_global_seeds (line 48) | def set_global_seeds(i):
  function pretty_eta (line 65) | def pretty_eta(seconds_left):
  class RunningAvg (line 107) | class RunningAvg(object):
    method __init__ (line 108) | def __init__(self, gamma, init_value=None):
    method update (line 123) | def update(self, new_val):
    method __float__ (line 136) | def __float__(self):
  function boolean_flag (line 140) | def boolean_flag(parser, name, default=False, help=None):
  function get_wrapper_by_name (line 159) | def get_wrapper_by_name(env, classname):
  function relatively_safe_pickle_dump (line 185) | def relatively_safe_pickle_dump(obj, path, compression=False):
  function pickle_load (line 221) | def pickle_load(path, compression=False):

FILE: baselines/common/models.py
  function register (line 9) | def register(name):
  function nature_cnn (line 15) | def nature_cnn(unscaled_images, **conv_kwargs):
  function build_impala_cnn (line 28) | def build_impala_cnn(unscaled_images, depths=[16,32,32], **conv_kwargs):
  function mlp (line 75) | def mlp(num_layers=2, num_hidden=64, activation=tf.tanh, layer_norm=False):
  function cnn (line 107) | def cnn(**conv_kwargs):
  function impala_cnn (line 113) | def impala_cnn(**conv_kwargs):
  function cnn_small (line 119) | def cnn_small(**conv_kwargs):
  function lstm (line 132) | def lstm(nlstm=128, layer_norm=False):
  function cnn_lstm (line 187) | def cnn_lstm(nlstm=128, layer_norm=False, conv_fn=nature_cnn, **conv_kwa...
  function impala_cnn_lstm (line 213) | def impala_cnn_lstm():
  function cnn_lnlstm (line 217) | def cnn_lnlstm(nlstm=128, **conv_kwargs):
  function conv_only (line 222) | def conv_only(convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)], **conv_kwargs):
  function _normalize_clip_observation (line 251) | def _normalize_clip_observation(x, clip_range=[-5.0, 5.0]):
  function get_network_builder (line 257) | def get_network_builder(name):

FILE: baselines/common/mpi_adam.py
  class MpiAdam (line 10) | class MpiAdam(object):
    method __init__ (line 11) | def __init__(self, var_list, *, beta1=0.9, beta2=0.999, epsilon=1e-08,...
    method update (line 25) | def update(self, localg, stepsize):
    method sync (line 44) | def sync(self):
    method check_synced (line 51) | def check_synced(self):
  function test_MpiAdam (line 64) | def test_MpiAdam():

FILE: baselines/common/mpi_adam_optimizer.py
  class MpiAdamOptimizer (line 11) | class MpiAdamOptimizer(tf.train.AdamOptimizer):
    method __init__ (line 13) | def __init__(self, comm, grad_clip=None, mpi_rank_weight=1, **kwargs):
    method compute_gradients (line 18) | def compute_gradients(self, loss, var_list, **kwargs):
  function check_synced (line 53) | def check_synced(localval, comm=None):
  function test_nonfreeze (line 71) | def test_nonfreeze():

FILE: baselines/common/mpi_fork.py
  function mpi_fork (line 3) | def mpi_fork(n, bind_to_core=False):

FILE: baselines/common/mpi_moments.py
  function mpi_mean (line 6) | def mpi_mean(x, axis=0, comm=None, keepdims=False):
  function mpi_moments (line 20) | def mpi_moments(x, axis=0, comm=None, keepdims=False):
  function test_runningmeanstd (line 35) | def test_runningmeanstd():
  function _helper_runningmeanstd (line 41) | def _helper_runningmeanstd():

FILE: baselines/common/mpi_running_mean_std.py
  class RunningMeanStd (line 8) | class RunningMeanStd(object):
    method __init__ (line 10) | def __init__(self, epsilon=1e-2, shape=()):
    method update (line 41) | def update(self, x):
  function test_runningmeanstd (line 51) | def test_runningmeanstd():
  function test_dist (line 70) | def test_dist():

FILE: baselines/common/mpi_util.py
  function sync_from_root (line 15) | def sync_from_root(sess, variables, comm=None):
  function gpu_count (line 28) | def gpu_count():
  function setup_mpi_gpus (line 37) | def setup_mpi_gpus():
  function get_local_rank_size (line 49) | def get_local_rank_size(comm):
  function share_file (line 69) | def share_file(comm, path):
  function dict_gather (line 87) | def dict_gather(comm, d, op='mean', assert_all_have_data=True):
  function mpi_weighted_mean (line 110) | def mpi_weighted_mean(comm, local_name2valcount):

FILE: baselines/common/plot_util.py
  function smooth (line 11) | def smooth(y, radius, mode='two_sided', valid_only=False):
  function one_sided_ema (line 39) | def one_sided_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=...
  function symmetric_ema (line 111) | def symmetric_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=...
  function load_results (line 152) | def load_results(root_dir_or_dirs, enable_progress=True, enable_monitor=...
  function default_xy_fn (line 227) | def default_xy_fn(r):
  function default_split_fn (line 232) | def default_split_fn(r):
  function plot_results (line 240) | def plot_results(
  function regression_analysis (line 407) | def regression_analysis(df):
  function test_smooth (line 416) | def test_smooth():

FILE: baselines/common/policies.py
  class PolicyWithValue (line 13) | class PolicyWithValue(object):
    method __init__ (line 18) | def __init__(self, env, observations, latent, estimate_q=False, vf_lat...
    method _evaluate (line 66) | def _evaluate(self, variables, observation, **extra_feed):
    method step (line 77) | def step(self, observation, **extra_feed):
    method value (line 98) | def value(self, ob, *args, **kwargs):
    method save (line 115) | def save(self, save_path):
    method load (line 118) | def load(self, load_path):
  function build_policy (line 121) | def build_policy(env, policy_network, value_network=None,  normalize_obs...
  function _normalize_clip_observation (line 182) | def _normalize_clip_observation(x, clip_range=[-5.0, 5.0]):

FILE: baselines/common/retro_wrappers.py
  class StochasticFrameSkip (line 10) | class StochasticFrameSkip(gym.Wrapper):
    method __init__ (line 11) | def __init__(self, env, n, stickprob):
    method reset (line 19) | def reset(self, **kwargs):
    method step (line 23) | def step(self, ac):
    method seed (line 45) | def seed(self, s):
  class PartialFrameStack (line 48) | class PartialFrameStack(gym.Wrapper):
    method __init__ (line 49) | def __init__(self, env, k, channel=1):
    method reset (line 63) | def reset(self):
    method step (line 70) | def step(self, ac):
    method _get_ob (line 75) | def _get_ob(self):
  class Downsample (line 80) | class Downsample(gym.ObservationWrapper):
    method __init__ (line 81) | def __init__(self, env, ratio):
    method observation (line 91) | def observation(self, frame):
  class Rgb2gray (line 98) | class Rgb2gray(gym.ObservationWrapper):
    method __init__ (line 99) | def __init__(self, env):
    method observation (line 108) | def observation(self, frame):
  class MovieRecord (line 113) | class MovieRecord(gym.Wrapper):
    method __init__ (line 114) | def __init__(self, env, savedir, k):
    method reset (line 119) | def reset(self):
  class AppendTimeout (line 128) | class AppendTimeout(gym.Wrapper):
    method __init__ (line 129) | def __init__(self, env):
    method step (line 154) | def step(self, ac):
    method reset (line 159) | def reset(self):
    method _process (line 163) | def _process(self, ob):
  class StartDoingRandomActionsWrapper (line 170) | class StartDoingRandomActionsWrapper(gym.Wrapper):
    method __init__ (line 174) | def __init__(self, env, max_random_steps, on_startup=True, every_episo...
    method some_random_steps (line 183) | def some_random_steps(self):
    method reset (line 191) | def reset(self):
    method step (line 194) | def step(self, a):
  function make_retro (line 202) | def make_retro(*, game, state=None, max_episode_steps=4500, **kwargs):
  function wrap_deepmind_retro (line 212) | def wrap_deepmind_retro(env, scale=True, frame_stack=4):
  class SonicDiscretizer (line 224) | class SonicDiscretizer(gym.ActionWrapper):
    method __init__ (line 229) | def __init__(self, env):
    method action (line 242) | def action(self, a): # pylint: disable=W0221
  class RewardScaler (line 245) | class RewardScaler(gym.RewardWrapper):
    method __init__ (line 251) | def __init__(self, env, scale=0.01):
    method reward (line 255) | def reward(self, reward):
  class AllowBacktracking (line 258) | class AllowBacktracking(gym.Wrapper):
    method __init__ (line 265) | def __init__(self, env):
    method reset (line 270) | def reset(self, **kwargs): # pylint: disable=E0202
    method step (line 275) | def step(self, action): # pylint: disable=E0202

FILE: baselines/common/runners.py
  class AbstractEnvRunner (line 4) | class AbstractEnvRunner(ABC):
    method __init__ (line 5) | def __init__(self, *, env, model, nsteps):
    method run (line 17) | def run(self):

FILE: baselines/common/running_mean_std.py
  class RunningMeanStd (line 5) | class RunningMeanStd(object):
    method __init__ (line 7) | def __init__(self, epsilon=1e-4, shape=()):
    method update (line 12) | def update(self, x):
    method update_from_moments (line 18) | def update_from_moments(self, batch_mean, batch_var, batch_count):
  function update_mean_var_count_from_moments (line 22) | def update_mean_var_count_from_moments(mean, var, count, batch_mean, bat...
  class TfRunningMeanStd (line 36) | class TfRunningMeanStd(object):
    method __init__ (line 42) | def __init__(self, epsilon=1e-4, shape=(), scope=''):
    method _set_mean_var_count (line 65) | def _set_mean_var_count(self):
    method update (line 68) | def update(self, x):
  function test_runningmeanstd (line 85) | def test_runningmeanstd():
  function test_tf_runningmeanstd (line 102) | def test_tf_runningmeanstd():
  function profile_tf_runningmeanstd (line 120) | def profile_tf_runningmeanstd():

FILE: baselines/common/schedules.py
  class Schedule (line 12) | class Schedule(object):
    method value (line 13) | def value(self, t):
  class ConstantSchedule (line 18) | class ConstantSchedule(object):
    method __init__ (line 19) | def __init__(self, value):
    method value (line 29) | def value(self, t):
  function linear_interpolation (line 34) | def linear_interpolation(l, r, alpha):
  class PiecewiseSchedule (line 38) | class PiecewiseSchedule(object):
    method __init__ (line 39) | def __init__(self, endpoints, interpolation=linear_interpolation, outs...
    method value (line 64) | def value(self, t):
  class LinearSchedule (line 76) | class LinearSchedule(object):
    method __init__ (line 77) | def __init__(self, schedule_timesteps, final_p, initial_p=1.0):
    method value (line 96) | def value(self, t):

FILE: baselines/common/segment_tree.py
  class SegmentTree (line 4) | class SegmentTree(object):
    method __init__ (line 5) | def __init__(self, capacity, operation, neutral_element):
    method _reduce_helper (line 36) | def _reduce_helper(self, start, end, node, node_start, node_end):
    method reduce (line 51) | def reduce(self, start=0, end=None):
    method __setitem__ (line 76) | def __setitem__(self, idx, val):
    method __getitem__ (line 88) | def __getitem__(self, idx):
  class SumSegmentTree (line 93) | class SumSegmentTree(SegmentTree):
    method __init__ (line 94) | def __init__(self, capacity):
    method sum (line 101) | def sum(self, start=0, end=None):
    method find_prefixsum_idx (line 105) | def find_prefixsum_idx(self, prefixsum):
  class MinSegmentTree (line 134) | class MinSegmentTree(SegmentTree):
    method __init__ (line 135) | def __init__(self, capacity):
    method min (line 142) | def min(self, start=0, end=None):

FILE: baselines/common/test_mpi_util.py
  function test_mpi_weighted_mean (line 10) | def test_mpi_weighted_mean():

FILE: baselines/common/tests/envs/fixed_sequence_env.py
  class FixedSequenceEnv (line 6) | class FixedSequenceEnv(Env):
    method __init__ (line 7) | def __init__(
    method reset (line 21) | def reset(self):
    method step (line 25) | def step(self, actions):
    method seed (line 34) | def seed(self, seed=None):
    method _choose_next_state (line 37) | def _choose_next_state(self):
    method _get_reward (line 40) | def _get_reward(self, actions):

FILE: baselines/common/tests/envs/identity_env.py
  class IdentityEnv (line 7) | class IdentityEnv(Env):
    method __init__ (line 8) | def __init__(
    method reset (line 22) | def reset(self):
    method step (line 30) | def step(self, actions):
    method seed (line 39) | def seed(self, seed=None):
    method _get_reward (line 43) | def _get_reward(self, state, actions):
  class DiscreteIdentityEnv (line 47) | class DiscreteIdentityEnv(IdentityEnv):
    method __init__ (line 48) | def __init__(
    method _get_reward (line 59) | def _get_reward(self, state, actions):
  class MultiDiscreteIdentityEnv (line 62) | class MultiDiscreteIdentityEnv(IdentityEnv):
    method __init__ (line 63) | def __init__(
    method _get_reward (line 73) | def _get_reward(self, state, actions):
  class BoxIdentityEnv (line 77) | class BoxIdentityEnv(IdentityEnv):
    method __init__ (line 78) | def __init__(
    method _get_reward (line 87) | def _get_reward(self, state, actions):

FILE: baselines/common/tests/envs/identity_env_test.py
  function test_discrete_nodelay (line 4) | def test_discrete_nodelay():
  function test_discrete_delay1 (line 20) | def test_discrete_delay1():

FILE: baselines/common/tests/envs/mnist_env.py
  class MnistEnv (line 9) | class MnistEnv(Env):
    method __init__ (line 10) | def __init__(
    method reset (line 35) | def reset(self):
    method step (line 41) | def step(self, actions):
    method seed (line 51) | def seed(self, seed=None):
    method train_mode (line 54) | def train_mode(self):
    method test_mode (line 57) | def test_mode(self):
    method _choose_next_state (line 60) | def _choose_next_state(self):
    method _get_reward (line 68) | def _get_reward(self, actions):

FILE: baselines/common/tests/test_cartpole.py
  function test_cartpole (line 26) | def test_cartpole(alg):

FILE: baselines/common/tests/test_doc_examples.py
  function test_lstm_example (line 14) | def test_lstm_example():

FILE: baselines/common/tests/test_env_after_learn.py
  function test_env_after_learn (line 12) | def test_env_after_learn(algo):

FILE: baselines/common/tests/test_fetchreach.py
  function test_fetchreach (line 21) | def test_fetchreach(alg):

FILE: baselines/common/tests/test_fixed_sequence.py
  function test_fixed_sequence (line 29) | def test_fixed_sequence(alg, rnn):

FILE: baselines/common/tests/test_identity.py
  function test_discrete_identity (line 30) | def test_discrete_identity(alg):
  function test_multidiscrete_identity (line 45) | def test_multidiscrete_identity(alg):
  function test_continuous_identity (line 60) | def test_continuous_identity(alg):

FILE: baselines/common/tests/test_mnist.py
  function test_mnist (line 33) | def test_mnist(alg):

FILE: baselines/common/tests/test_plot_util.py
  function test_plot_util (line 6) | def test_plot_util():

FILE: baselines/common/tests/test_schedules.py
  function test_piecewise_schedule (line 6) | def test_piecewise_schedule():
  function test_constant_schedule (line 23) | def test_constant_schedule():

FILE: baselines/common/tests/test_segment_tree.py
  function test_tree_set (line 6) | def test_tree_set():
  function test_tree_set_overlap (line 20) | def test_tree_set_overlap():
  function test_prefixsum_idx (line 33) | def test_prefixsum_idx():
  function test_prefixsum_idx2 (line 47) | def test_prefixsum_idx2():
  function test_max_interval_tree (line 63) | def test_max_interval_tree():

FILE: baselines/common/tests/test_serialization.py
  function test_serialization (line 35) | def test_serialization(learn_fn, network_fn):
  function test_coexistence (line 87) | def test_coexistence(learn_fn, network_fn):
  function _serialize_variables (line 121) | def _serialize_variables():
  function _get_action_stats (line 128) | def _get_action_stats(model, ob):

FILE: baselines/common/tests/test_tf_util.py
  function test_function (line 10) | def test_function():
  function test_multikwargs (line 26) | def test_multikwargs():

FILE: baselines/common/tests/test_with_mpi.py
  function with_mpi (line 14) | def with_mpi(nproc=2, timeout=30, skip_if_no_mpi=True):

FILE: baselines/common/tests/util.py
  function simple_test (line 14) | def simple_test(env_fn, learn_fn, min_reward_fraction, n_trials=N_TRIALS):
  function reward_per_episode_test (line 41) | def reward_per_episode_test(env_fn, learn_fn, min_avg_reward, n_trials=N...
  function rollout (line 53) | def rollout(env, model, n_trials):
  function smoketest (line 81) | def smoketest(argstr, **kwargs):

FILE: baselines/common/tf_util.py
  function switch (line 9) | def switch(condition, then_expression, else_expression):
  function lrelu (line 30) | def lrelu(x, leak=0.2):
  function huber_loss (line 39) | def huber_loss(x, delta=1.0):
  function get_session (line 51) | def get_session(config=None):
  function make_session (line 58) | def make_session(config=None, num_cpu=None, make_default=False, graph=No...
  function single_threaded_session (line 74) | def single_threaded_session():
  function in_session (line 78) | def in_session(f):
  function initialize (line 87) | def initialize():
  function normc_initializer (line 97) | def normc_initializer(std=1.0, axis=0):
  function conv2d (line 104) | def conv2d(x, num_filters, name, filter_size=(3, 3), stride=(1, 1), pad=...
  function function (line 137) | def function(inputs, outputs, updates=None, givens=None):
  class _Function (line 182) | class _Function(object):
    method __init__ (line 183) | def __init__(self, inputs, outputs, updates, givens):
    method _feed_input (line 194) | def _feed_input(self, feed_dict, inpt, value):
    method __call__ (line 200) | def __call__(self, *args, **kwargs):
  function var_shape (line 218) | def var_shape(x):
  function numel (line 224) | def numel(x):
  function intprod (line 227) | def intprod(x):
  function flatgrad (line 230) | def flatgrad(loss, var_list, clip_norm=None):
  class SetFromFlat (line 239) | class SetFromFlat(object):
    method __init__ (line 240) | def __init__(self, var_list, dtype=tf.float32):
    method __call__ (line 254) | def __call__(self, theta):
  class GetFlat (line 257) | class GetFlat(object):
    method __init__ (line 258) | def __init__(self, var_list):
    method __call__ (line 261) | def __call__(self):
  function flattenallbut0 (line 264) | def flattenallbut0(x):
  function get_placeholder (line 273) | def get_placeholder(name, dtype, shape):
  function get_placeholder_cached (line 285) | def get_placeholder_cached(name):
  function display_var_info (line 294) | def display_var_info(vars):
  function get_available_gpus (line 308) | def get_available_gpus(session_config=None):
  function load_state (line 325) | def load_state(fname, sess=None):
  function save_state (line 332) | def save_state(fname, sess=None):
  function save_variables (line 345) | def save_variables(save_path, variables=None, sess=None):
  function load_variables (line 357) | def load_variables(load_path, variables=None, sess=None):
  function adjust_shape (line 377) | def adjust_shape(placeholder, data):
  function _check_shape (line 404) | def _check_shape(placeholder_shape, data_shape):
  function _squeeze_shape (line 419) | def _squeeze_shape(shape):
  function launch_tensorboard_in_background (line 426) | def launch_tensorboard_in_background(log_dir):

FILE: baselines/common/tile_images.py
  function tile_images (line 3) | def tile_images(img_nhwc):

FILE: baselines/common/vec_env/dummy_vec_env.py
  class DummyVecEnv (line 5) | class DummyVecEnv(VecEnv):
    method __init__ (line 12) | def __init__(self, env_fns):
    method step_async (line 31) | def step_async(self, actions):
    method step_wait (line 45) | def step_wait(self):
    method reset (line 58) | def reset(self):
    method _save_obs (line 64) | def _save_obs(self, e, obs):
    method _obs_from_buf (line 71) | def _obs_from_buf(self):
    method get_images (line 74) | def get_images(self):
    method render (line 77) | def render(self, mode='human'):

FILE: baselines/common/vec_env/shmem_vec_env.py
  class ShmemVecEnv (line 20) | class ShmemVecEnv(VecEnv):
    method __init__ (line 25) | def __init__(self, env_fns, spaces=None, context='spawn'):
    method reset (line 61) | def reset(self):
    method step_async (line 69) | def step_async(self, actions):
    method step_wait (line 75) | def step_wait(self):
    method close_extras (line 81) | def close_extras(self):
    method get_images (line 92) | def get_images(self, mode='human'):
    method _decode_obses (line 97) | def _decode_obses(self, obs):
  function _subproc_worker (line 107) | def _subproc_worker(pipe, parent_pipe, env_fn_wrapper, obs_bufs, obs_sha...

FILE: baselines/common/vec_env/subproc_vec_env.py
  function worker (line 7) | def worker(remote, parent_remote, env_fn_wrappers):
  class SubprocVecEnv (line 39) | class SubprocVecEnv(VecEnv):
    method __init__ (line 44) | def __init__(self, env_fns, spaces=None, context='spawn', in_series=1):
    method step_async (line 75) | def step_async(self, actions):
    method step_wait (line 82) | def step_wait(self):
    method reset (line 90) | def reset(self):
    method close_extras (line 98) | def close_extras(self):
    method get_images (line 108) | def get_images(self):
    method _assert_not_closed (line 116) | def _assert_not_closed(self):
    method __del__ (line 119) | def __del__(self):
  function _flatten_obs (line 123) | def _flatten_obs(obs):
  function _flatten_list (line 133) | def _flatten_list(l):

FILE: baselines/common/vec_env/test_vec_env.py
  function assert_venvs_equal (line 14) | def assert_venvs_equal(venv1, venv2, num_steps):
  function test_vec_env (line 49) | def test_vec_env(klass, dtype):  # pylint: disable=R0914
  function test_sync_sampling (line 72) | def test_sync_sampling(dtype, num_envs_in_series):
  function test_sync_sampling_sanity (line 94) | def test_sync_sampling_sanity(dtype, num_envs_in_series):
  class SimpleEnv (line 114) | class SimpleEnv(gym.Env):
    method __init__ (line 120) | def __init__(self, seed, shape, dtype):
    method step (line 133) | def step(self, action):
    method reset (line 140) | def reset(self):
    method render (line 145) | def render(self, mode=None):
  function test_mpi_with_subprocvecenv (line 151) | def test_mpi_with_subprocvecenv():

FILE: baselines/common/vec_env/test_video_recorder.py
  function test_video_recorder (line 20) | def test_video_recorder(klass, num_envs, video_length, video_interval):

FILE: baselines/common/vec_env/util.py
  function copy_obs_dict (line 11) | def copy_obs_dict(obs):
  function dict_to_obs (line 18) | def dict_to_obs(obs_dict):
  function obs_space_info (line 28) | def obs_space_info(obs_space):
  function obs_to_dict (line 56) | def obs_to_dict(obs):

FILE: baselines/common/vec_env/vec_env.py
  class AlreadySteppingError (line 7) | class AlreadySteppingError(Exception):
    method __init__ (line 13) | def __init__(self):
  class NotSteppingError (line 18) | class NotSteppingError(Exception):
    method __init__ (line 24) | def __init__(self):
  class VecEnv (line 29) | class VecEnv(ABC):
    method __init__ (line 43) | def __init__(self, num_envs, observation_space, action_space):
    method reset (line 49) | def reset(self):
    method step_async (line 61) | def step_async(self, actions):
    method step_wait (line 73) | def step_wait(self):
    method close_extras (line 86) | def close_extras(self):
    method close (line 93) | def close(self):
    method step (line 101) | def step(self, actions):
    method render (line 110) | def render(self, mode='human'):
    method get_images (line 121) | def get_images(self):
    method unwrapped (line 128) | def unwrapped(self):
    method get_viewer (line 134) | def get_viewer(self):
  class VecEnvWrapper (line 140) | class VecEnvWrapper(VecEnv):
    method __init__ (line 146) | def __init__(self, venv, observation_space=None, action_space=None):
    method step_async (line 152) | def step_async(self, actions):
    method reset (line 156) | def reset(self):
    method step_wait (line 160) | def step_wait(self):
    method close (line 163) | def close(self):
    method render (line 166) | def render(self, mode='human'):
    method get_images (line 169) | def get_images(self):
    method __getattr__ (line 172) | def __getattr__(self, name):
  class VecEnvObservationWrapper (line 177) | class VecEnvObservationWrapper(VecEnvWrapper):
    method process (line 179) | def process(self, obs):
    method reset (line 182) | def reset(self):
    method step_wait (line 186) | def step_wait(self):
  class CloudpickleWrapper (line 190) | class CloudpickleWrapper(object):
    method __init__ (line 195) | def __init__(self, x):
    method __getstate__ (line 198) | def __getstate__(self):
    method __setstate__ (line 202) | def __setstate__(self, ob):
  function clear_mpi_env_vars (line 208) | def clear_mpi_env_vars():

FILE: baselines/common/vec_env/vec_frame_stack.py
  class VecFrameStack (line 6) | class VecFrameStack(VecEnvWrapper):
    method __init__ (line 7) | def __init__(self, venv, nstack):
    method step_wait (line 17) | def step_wait(self):
    method reset (line 26) | def reset(self):

FILE: baselines/common/vec_env/vec_monitor.py
  class VecMonitor (line 7) | class VecMonitor(VecEnvWrapper):
    method __init__ (line 8) | def __init__(self, venv, filename=None, keep_buf=0, info_keywords=()):
    method reset (line 25) | def reset(self):
    method step_wait (line 31) | def step_wait(self):

FILE: baselines/common/vec_env/vec_normalize.py
  class VecNormalize (line 4) | class VecNormalize(VecEnvWrapper):
    method __init__ (line 10) | def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., g...
    method step_wait (line 26) | def step_wait(self):
    method _obfilt (line 36) | def _obfilt(self, obs):
    method reset (line 44) | def reset(self):

FILE: baselines/common/vec_env/vec_remove_dict_obs.py
  class VecExtractDictObs (line 3) | class VecExtractDictObs(VecEnvObservationWrapper):
    method __init__ (line 4) | def __init__(self, venv, key):
    method process (line 9) | def process(self, obs):

FILE: baselines/common/vec_env/vec_video_recorder.py
  class VecVideoRecorder (line 7) | class VecVideoRecorder(VecEnvWrapper):
    method __init__ (line 12) | def __init__(self, venv, directory, record_video_trigger, video_length...
    method reset (line 39) | def reset(self):
    method start_video_recorder (line 46) | def start_video_recorder(self):
    method _video_enabled (line 60) | def _video_enabled(self):
    method step_wait (line 63) | def step_wait(self):
    method close_video_recorder (line 78) | def close_video_recorder(self):
    method close (line 84) | def close(self):
    method __del__ (line 88) | def __del__(self):

FILE: baselines/common/wrappers.py
  class TimeLimit (line 3) | class TimeLimit(gym.Wrapper):
    method __init__ (line 4) | def __init__(self, env, max_episode_steps=None):
    method step (line 9) | def step(self, ac):
    method reset (line 17) | def reset(self, **kwargs):
  class ClipActionsWrapper (line 21) | class ClipActionsWrapper(gym.Wrapper):
    method step (line 22) | def step(self, action):
    method reset (line 28) | def reset(self, **kwargs):

FILE: baselines/ddpg/ddpg.py
  function learn (line 21) | def learn(network, env,

FILE: baselines/ddpg/ddpg_learner.py
  function normalize (line 17) | def normalize(x, stats):
  function denormalize (line 23) | def denormalize(x, stats):
  function reduce_std (line 28) | def reduce_std(x, axis=None, keepdims=False):
  function reduce_var (line 31) | def reduce_var(x, axis=None, keepdims=False):
  function get_target_updates (line 36) | def get_target_updates(vars, target_vars, tau):
  function get_perturbed_actor_updates (line 50) | def get_perturbed_actor_updates(actor, perturbed_actor, param_noise_stdd...
  class DDPG (line 66) | class DDPG(object):
    method __init__ (line 67) | def __init__(self, actor, critic, memory, observation_shape, action_sh...
    method setup_target_network_updates (line 149) | def setup_target_network_updates(self):
    method setup_param_noise (line 155) | def setup_param_noise(self, normalized_obs0):
    method setup_actor_optimizer (line 172) | def setup_actor_optimizer(self):
    method setup_critic_optimizer (line 183) | def setup_critic_optimizer(self):
    method setup_popart (line 205) | def setup_popart(self):
    method setup_stats (line 223) | def setup_stats(self):
    method step (line 259) | def step(self, obs, apply_noise=True, compute_Q=True):
    method store_transition (line 280) | def store_transition(self, obs0, action, reward, obs1, terminal1):
    method train (line 289) | def train(self):
    method initialize (line 333) | def initialize(self, sess):
    method update_target_net (line 340) | def update_target_net(self):
    method get_stats (line 343) | def get_stats(self):
    method adapt_param_noise (line 362) | def adapt_param_noise(self):
    method reset (line 389) | def reset(self):

FILE: baselines/ddpg/memory.py
  class RingBuffer (line 4) | class RingBuffer(object):
    method __init__ (line 5) | def __init__(self, maxlen, shape, dtype='float32'):
    method __len__ (line 11) | def __len__(self):
    method __getitem__ (line 14) | def __getitem__(self, idx):
    method get_batch (line 19) | def get_batch(self, idxs):
    method append (line 22) | def append(self, v):
  function array_min2d (line 35) | def array_min2d(x):
  class Memory (line 42) | class Memory(object):
    method __init__ (line 43) | def __init__(self, limit, action_shape, observation_shape):
    method sample (line 52) | def sample(self, batch_size):
    method append (line 71) | def append(self, obs0, action, reward, obs1, terminal1, training=True):
    method nb_entries (line 82) | def nb_entries(self):

FILE: baselines/ddpg/models.py
  class Model (line 5) | class Model(object):
    method __init__ (line 6) | def __init__(self, name, network='mlp', **network_kwargs):
    method vars (line 11) | def vars(self):
    method trainable_vars (line 15) | def trainable_vars(self):
    method perturbable_vars (line 19) | def perturbable_vars(self):
  class Actor (line 23) | class Actor(Model):
    method __init__ (line 24) | def __init__(self, nb_actions, name='actor', network='mlp', **network_...
    method __call__ (line 28) | def __call__(self, obs, reuse=False):
  class Critic (line 36) | class Critic(Model):
    method __init__ (line 37) | def __init__(self, name='critic', network='mlp', **network_kwargs):
    method __call__ (line 41) | def __call__(self, obs, action, reuse=False):
    method output_vars (line 49) | def output_vars(self):

FILE: baselines/ddpg/noise.py
  class AdaptiveParamNoiseSpec (line 4) | class AdaptiveParamNoiseSpec(object):
    method __init__ (line 5) | def __init__(self, initial_stddev=0.1, desired_action_stddev=0.1, adop...
    method adapt (line 12) | def adapt(self, distance):
    method get_stats (line 20) | def get_stats(self):
    method __repr__ (line 26) | def __repr__(self):
  class ActionNoise (line 31) | class ActionNoise(object):
    method reset (line 32) | def reset(self):
  class NormalActionNoise (line 36) | class NormalActionNoise(ActionNoise):
    method __init__ (line 37) | def __init__(self, mu, sigma):
    method __call__ (line 41) | def __call__(self):
    method __repr__ (line 44) | def __repr__(self):
  class OrnsteinUhlenbeckActionNoise (line 49) | class OrnsteinUhlenbeckActionNoise(ActionNoise):
    method __init__ (line 50) | def __init__(self, mu, sigma, theta=.15, dt=1e-2, x0=None):
    method __call__ (line 58) | def __call__(self):
    method reset (line 63) | def reset(self):
    method __repr__ (line 66) | def __repr__(self):

FILE: baselines/ddpg/test_smoke.py
  function _run (line 2) | def _run(argstr):
  function test_popart (line 5) | def test_popart():
  function test_noise_normal (line 8) | def test_noise_normal():
  function test_noise_ou (line 11) | def test_noise_ou():
  function test_noise_adaptive (line 14) | def test_noise_adaptive():

FILE: baselines/deepq/__init__.py
  function wrap_atari_dqn (line 6) | def wrap_atari_dqn(env):

FILE: baselines/deepq/build_graph.py
  function scope_vars (line 100) | def scope_vars(scope, trainable_only=False):
  function scope_name (line 121) | def scope_name():
  function absolute_scope_name (line 126) | def absolute_scope_name(relative_scope_name):
  function default_param_noise_filter (line 131) | def default_param_noise_filter(var):
  function build_act (line 146) | def build_act(make_obs_ph, q_func, num_actions, scope="deepq", reuse=None):
  function build_act_with_param_noise (line 202) | def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="...
  function build_train (line 317) | def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_c...

FILE: baselines/deepq/deepq.py
  class ActWrapper (line 23) | class ActWrapper(object):
    method __init__ (line 24) | def __init__(self, act, act_params):
    method load_act (line 30) | def load_act(path):
    method __call__ (line 46) | def __call__(self, *args, **kwargs):
    method step (line 49) | def step(self, observation, **kwargs):
    method save_act (line 55) | def save_act(self, path=None):
    method save (line 74) | def save(self, path):
  function load_act (line 78) | def load_act(path):
  function learn (line 95) | def learn(env,

FILE: baselines/deepq/defaults.py
  function atari (line 1) | def atari():
  function retro (line 19) | def retro():

FILE: baselines/deepq/experiments/custom_cartpole.py
  function model (line 16) | def model(inpt, num_actions, scope, reuse=False):

FILE: baselines/deepq/experiments/enjoy_cartpole.py
  function main (line 6) | def main():

FILE: baselines/deepq/experiments/enjoy_mountaincar.py
  function main (line 7) | def main():

FILE: baselines/deepq/experiments/enjoy_pong.py
  function main (line 5) | def main():

FILE: baselines/deepq/experiments/train_cartpole.py
  function callback (line 6) | def callback(lcl, _glb):
  function main (line 12) | def main():

FILE: baselines/deepq/experiments/train_mountaincar.py
  function main (line 7) | def main():

FILE: baselines/deepq/experiments/train_pong.py
  function main (line 7) | def main():

FILE: baselines/deepq/models.py
  function build_q_func (line 5) | def build_q_func(network, hiddens=[256], dueling=True, layer_norm=False,...

FILE: baselines/deepq/replay_buffer.py
  class ReplayBuffer (line 7) | class ReplayBuffer(object):
    method __init__ (line 8) | def __init__(self, size):
    method __len__ (line 21) | def __len__(self):
    method add (line 24) | def add(self, obs_t, action, reward, obs_tp1, done):
    method _encode_sample (line 33) | def _encode_sample(self, idxes):
    method sample (line 45) | def sample(self, batch_size):
  class PrioritizedReplayBuffer (line 71) | class PrioritizedReplayBuffer(ReplayBuffer):
    method __init__ (line 72) | def __init__(self, size, alpha):
    method add (line 100) | def add(self, *args, **kwargs):
    method _sample_proportional (line 107) | def _sample_proportional(self, batch_size):
    method sample (line 117) | def sample(self, batch_size, beta):
    method update_priorities (line 169) | def update_priorities(self, idxes, priorities):

FILE: baselines/deepq/utils.py
  class TfInput (line 9) | class TfInput(object):
    method __init__ (line 10) | def __init__(self, name="(unnamed)"):
    method get (line 17) | def get(self):
    method make_feed_dict (line 23) | def make_feed_dict(self, data):
  class PlaceholderTfInput (line 28) | class PlaceholderTfInput(TfInput):
    method __init__ (line 29) | def __init__(self, placeholder):
    method get (line 34) | def get(self):
    method make_feed_dict (line 37) | def make_feed_dict(self, data):
  class ObservationInput (line 41) | class ObservationInput(PlaceholderTfInput):
    method __init__ (line 42) | def __init__(self, observation_space, name=None):
    method get (line 56) | def get(self):

FILE: baselines/gail/adversary.py
  function logsigmoid (line 11) | def logsigmoid(a):
  function logit_bernoulli_entropy (line 16) | def logit_bernoulli_entropy(logits):
  class TransitionClassifier (line 20) | class TransitionClassifier(object):
    method __init__ (line 21) | def __init__(self, env, hidden_size, entcoeff=0.001, lr_rate=1e-3, sco...
    method build_ph (line 56) | def build_ph(self):
    method build_graph (line 62) | def build_graph(self, obs_ph, acs_ph, reuse=False):
    method get_trainable_variables (line 76) | def get_trainable_variables(self):
    method get_reward (line 79) | def get_reward(self, obs, acs):

FILE: baselines/gail/behavior_clone.py
  function argsparser (line 24) | def argsparser():
  function learn (line 42) | def learn(env, policy_func, dataset, optim_batch_size=128, max_iters=1e4,
  function get_task_name (line 80) | def get_task_name(args):
  function main (line 88) | def main(args):

FILE: baselines/gail/dataset/mujoco_dset.py
  class Dset (line 12) | class Dset(object):
    method __init__ (line 13) | def __init__(self, inputs, labels, randomize):
    method init_pointer (line 21) | def init_pointer(self):
    method get_next_batch (line 29) | def get_next_batch(self, batch_size):
  class Mujoco_Dset (line 42) | class Mujoco_Dset(object):
    method __init__ (line 43) | def __init__(self, expert_path, train_fraction=0.7, traj_limitation=-1...
    method log_info (line 79) | def log_info(self):
    method get_next_batch (line 85) | def get_next_batch(self, batch_size, split=None):
    method plot (line 95) | def plot(self):
  function test (line 102) | def test(expert_path, traj_limitation, plot):

FILE: baselines/gail/gail-eval.py
  function load_dataset (line 28) | def load_dataset(expert_path):
  function argsparser (line 33) | def argsparser():
  function evaluate_env (line 43) | def evaluate_env(env_name, seed, policy_hidden_size, stochastic, reuse, ...
  function plot (line 92) | def plot(env_name, bc_log, gail_log, stochastic):
  function main (line 130) | def main(args):

FILE: baselines/gail/mlp_policy.py
  class MlpPolicy (line 15) | class MlpPolicy(object):
    method __init__ (line 18) | def __init__(self, name, reuse=False, *args, **kwargs):
    method _init (line 25) | def _init(self, ob_space, ac_space, hid_size, num_hid_layers, gaussian...
    method act (line 64) | def act(self, stochastic, ob):
    method get_variables (line 68) | def get_variables(self):
    method get_trainable_variables (line 71) | def get_trainable_variables(self):
    method get_initial_state (line 74) | def get_initial_state(self):

FILE: baselines/gail/run_mujoco.py
  function argsparser (line 23) | def argsparser():
  function get_task_name (line 58) | def get_task_name(args):
  function main (line 71) | def main(args):
  function train (line 121) | def train(env, seed, policy_fn, reward_giver, dataset, algo,
  function runner (line 157) | def runner(env, policy_func, load_model_path, timesteps_per_batch, numbe...
  function traj_1_generator (line 197) | def traj_1_generator(pi, env, horizon, stochastic):

FILE: baselines/gail/statistics.py
  class stats (line 11) | class stats():
    method __init__ (line 13) | def __init__(self, scalar_keys=[], histogram_keys=[]):
    method add_all_summary (line 34) | def add_all_summary(self, writer, values, iter):

FILE: baselines/gail/trpo_mpi.py
  function traj_segment_generator (line 23) | def traj_segment_generator(pi, env, reward_giver, horizon, stochastic):
  function add_vtarg_and_adv (line 91) | def add_vtarg_and_adv(seg, gamma, lam):
  function learn (line 105) | def learn(env, policy_func, reward_giver, expert_dataset, rank,
  function flatten_lists (line 353) | def flatten_lists(listoflists):

FILE: baselines/her/actor_critic.py
  class ActorCritic (line 5) | class ActorCritic:
    method __init__ (line 7) | def __init__(self, inputs_tf, dimo, dimg, dimu, max_u, o_stats, g_stat...

FILE: baselines/her/ddpg.py
  function dims_to_shapes (line 16) | def dims_to_shapes(input_dims):
  class DDPG (line 22) | class DDPG(object):
    method __init__ (line 24) | def __init__(self, input_dims, buffer_size, hidden, layers, network_cl...
    method _random_action (line 109) | def _random_action(self, n):
    method _preprocess_og (line 112) | def _preprocess_og(self, o, ag, g):
    method step (line 123) | def step(self, obs):
    method get_actions (line 128) | def get_actions(self, o, ag, g, noise_eps=0., random_eps=0., use_targe...
    method init_demo_buffer (line 160) | def init_demo_buffer(self, demoDataFile, update_stats=True): #function...
    method store_episode (line 217) | def store_episode(self, episode_batch, update_stats=True):
    method get_current_buffer_size (line 242) | def get_current_buffer_size(self):
    method _sync_optimizers (line 245) | def _sync_optimizers(self):
    method _grads (line 249) | def _grads(self):
    method _update (line 259) | def _update(self, Q_grad, pi_grad):
    method sample_batch (line 263) | def sample_batch(self):
    method stage_batch (line 284) | def stage_batch(self, batch=None):
    method train (line 290) | def train(self, stage=True):
    method _init_target_net (line 297) | def _init_target_net(self):
    method update_target_net (line 300) | def update_target_net(self):
    method clear_buffer (line 303) | def clear_buffer(self):
    method _vars (line 306) | def _vars(self, scope):
    method _global_vars (line 311) | def _global_vars(self, scope):
    method _create_network (line 315) | def _create_network(self, reuse=False):
    method logs (line 406) | def logs(self, prefix=''):
    method __getstate__ (line 418) | def __getstate__(self):
    method __setstate__ (line 430) | def __setstate__(self, state):
    method save (line 446) | def save(self, save_path):

FILE: baselines/her/experiment/config.py
  function cached_make_env (line 61) | def cached_make_env(make_env):
  function prepare_params (line 73) | def prepare_params(kwargs):
  function log_params (line 122) | def log_params(params, logger=logger):
  function configure_her (line 127) | def configure_her(params):
  function simple_goal_subtract (line 147) | def simple_goal_subtract(a, b):
  function configure_ddpg (line 152) | def configure_ddpg(dims, params, reuse=False, use_mpi=True, clip_return=...
  function configure_dims (line 186) | def configure_dims(params):

FILE: baselines/her/experiment/data_generation/fetch_data_generation.py
  function main (line 11) | def main():
  function goToGoal (line 30) | def goToGoal(env, lastObs):

FILE: baselines/her/experiment/play.py
  function main (line 17) | def main(policy_file, seed, n_test_rollouts, render):

FILE: baselines/her/experiment/plot.py
  function smooth_reward_curve (line 12) | def smooth_reward_curve(x, y):
  function load_results (line 21) | def load_results(file):
  function pad (line 40) | def pad(xs, value=np.nan):

FILE: baselines/her/her.py
  function mpi_average (line 14) | def mpi_average(value):
  function train (line 22) | def train(*, policy, rollout_worker, evaluator,
  function learn (line 87) | def learn(*, network, env, total_timesteps,
  function main (line 188) | def main(**kwargs):

FILE: baselines/her/her_sampler.py
  function make_sample_her_transitions (line 4) | def make_sample_her_transitions(replay_strategy, replay_k, reward_fun):

FILE: baselines/her/normalizer.py
  class Normalizer (line 10) | class Normalizer:
    method __init__ (line 11) | def __init__(self, size, eps=1e-2, default_clip_range=np.inf, sess=None):
    method update (line 64) | def update(self, v):
    method normalize (line 72) | def normalize(self, v, clip_range=None):
    method denormalize (line 79) | def denormalize(self, v):
    method _mpi_average (line 84) | def _mpi_average(self, x):
    method synchronize (line 90) | def synchronize(self, local_sum, local_sumsq, local_count, root=None):
    method recompute_stats (line 96) | def recompute_stats(self):
  class IdentityNormalizer (line 121) | class IdentityNormalizer:
    method __init__ (line 122) | def __init__(self, size, std=1.):
    method update (line 127) | def update(self, x):
    method normalize (line 130) | def normalize(self, x, clip_range=None):
    method denormalize (line 133) | def denormalize(self, x):
    method synchronize (line 136) | def synchronize(self):
    method recompute_stats (line 139) | def recompute_stats(self):

FILE: baselines/her/replay_buffer.py
  class ReplayBuffer (line 6) | class ReplayBuffer:
    method __init__ (line 7) | def __init__(self, buffer_shapes, size_in_transitions, T, sample_trans...
    method full (line 33) | def full(self):
    method sample (line 37) | def sample(self, batch_size):
    method store_episode (line 57) | def store_episode(self, episode_batch):
    method get_current_episode_size (line 73) | def get_current_episode_size(self):
    method get_current_size (line 77) | def get_current_size(self):
    method get_transitions_stored (line 81) | def get_transitions_stored(self):
    method clear_buffer (line 85) | def clear_buffer(self):
    method _get_storage_idx (line 89) | def _get_storage_idx(self, inc=None):

FILE: baselines/her/rollout.py
  class RolloutWorker (line 9) | class RolloutWorker:
    method __init__ (line 12) | def __init__(self, venv, policy, dims, logger, T, rollout_batch_size=1,
    method reset_all_rollouts (line 44) | def reset_all_rollouts(self):
    method generate_rollouts (line 50) | def generate_rollouts(self):
    method clear_history (line 138) | def clear_history(self):
    method current_success_rate (line 144) | def current_success_rate(self):
    method current_mean_Q (line 147) | def current_mean_Q(self):
    method save_policy (line 150) | def save_policy(self, path):
    method logs (line 156) | def logs(self, prefix='worker'):

FILE: baselines/her/util.py
  function store_args (line 14) | def store_args(method):
  function import_function (line 41) | def import_function(spec):
  function flatten_grads (line 50) | def flatten_grads(var_list, grads):
  function nn (line 57) | def nn(input, layers_sizes, reuse=None, flatten=False, name=""):
  function install_mpi_excepthook (line 75) | def install_mpi_excepthook():
  function mpi_fork (line 88) | def mpi_fork(n, extra_mpi_args=[]):
  function convert_episode_to_batch_major (line 114) | def convert_episode_to_batch_major(episode):
  function transitions_in_episode_batch (line 127) | def transitions_in_episode_batch(episode_batch):
  function reshape_for_broadcasting (line 134) | def reshape_for_broadcasting(source, target):

FILE: baselines/logger.py
  class KVWriter (line 19) | class KVWriter(object):
    method writekvs (line 20) | def writekvs(self, kvs):
  class SeqWriter (line 23) | class SeqWriter(object):
    method writeseq (line 24) | def writeseq(self, seq):
  class HumanOutputFormat (line 27) | class HumanOutputFormat(KVWriter, SeqWriter):
    method __init__ (line 28) | def __init__(self, filename_or_file):
    method writekvs (line 37) | def writekvs(self, kvs):
    method _truncate (line 71) | def _truncate(self, s):
    method writeseq (line 75) | def writeseq(self, seq):
    method close (line 84) | def close(self):
  class JSONOutputFormat (line 88) | class JSONOutputFormat(KVWriter):
    method __init__ (line 89) | def __init__(self, filename):
    method writekvs (line 92) | def writekvs(self, kvs):
    method close (line 99) | def close(self):
  class CSVOutputFormat (line 102) | class CSVOutputFormat(KVWriter):
    method __init__ (line 103) | def __init__(self, filename):
    method writekvs (line 108) | def writekvs(self, kvs):
    method close (line 135) | def close(self):
  class TensorBoardOutputFormat (line 139) | class TensorBoardOutputFormat(KVWriter):
    method __init__ (line 143) | def __init__(self, dir):
    method writekvs (line 158) | def writekvs(self, kvs):
    method close (line 169) | def close(self):
  function make_output_format (line 174) | def make_output_format(format, ev_dir, log_suffix=''):
  function logkv (line 193) | def logkv(key, val):
  function logkv_mean (line 201) | def logkv_mean(key, val):
  function logkvs (line 207) | def logkvs(d):
  function dumpkvs (line 214) | def dumpkvs():
  function getkvs (line 220) | def getkvs():
  function log (line 224) | def log(*args, level=INFO):
  function debug (line 230) | def debug(*args):
  function info (line 233) | def info(*args):
  function warn (line 236) | def warn(*args):
  function error (line 239) | def error(*args):
  function set_level (line 243) | def set_level(level):
  function set_comm (line 249) | def set_comm(comm):
  function get_dir (line 252) | def get_dir():
  function profile_kv (line 263) | def profile_kv(scopename):
  function profile (line 271) | def profile(n):
  function get_current (line 289) | def get_current():
  class Logger (line 296) | class Logger(object):
    method __init__ (line 301) | def __init__(self, dir, output_formats, comm=None):
    method logkv (line 311) | def logkv(self, key, val):
    method logkv_mean (line 314) | def logkv_mean(self, key, val):
    method dumpkvs (line 319) | def dumpkvs(self):
    method log (line 337) | def log(self, *args, level=INFO):
    method set_level (line 343) | def set_level(self, level):
    method set_comm (line 346) | def set_comm(self, comm):
    method get_dir (line 349) | def get_dir(self):
    method close (line 352) | def close(self):
    method _do_log (line 358) | def _do_log(self, args):
  function get_rank_without_mpi_import (line 363) | def get_rank_without_mpi_import():
  function configure (line 372) | def configure(dir=None, format_strs=None, comm=None, log_suffix=''):
  function _configure_default_logger (line 401) | def _configure_default_logger():
  function reset (line 405) | def reset():
  function scoped_configure (line 412) | def scoped_configure(dir=None, format_strs=None, comm=None):
  function _demo (line 423) | def _demo():
  function read_json (line 456) | def read_json(fname):
  function read_csv (line 464) | def read_csv(fname):
  function read_tb (line 468) | def read_tb(path):

FILE: baselines/ppo1/cnn_policy.py
  class CnnPolicy (line 6) | class CnnPolicy(object):
    method __init__ (line 8) | def __init__(self, name, ob_space, ac_space, kind='large'):
    method _init (line 13) | def _init(self, ob_space, ac_space, kind):
    method act (line 47) | def act(self, stochastic, ob):
    method get_variables (line 50) | def get_variables(self):
    method get_trainable_variables (line 52) | def get_trainable_variables(self):
    method get_initial_state (line 54) | def get_initial_state(self):

FILE: baselines/ppo1/mlp_policy.py
  class MlpPolicy (line 7) | class MlpPolicy(object):
    method __init__ (line 9) | def __init__(self, name, *args, **kwargs):
    method _init (line 14) | def _init(self, ob_space, ac_space, hid_size, num_hid_layers, gaussian...
    method act (line 52) | def act(self, stochastic, ob):
    method get_variables (line 55) | def get_variables(self):
    method get_trainable_variables (line 57) | def get_trainable_variables(self):
    method get_initial_state (line 59) | def get_initial_state(self):

FILE: baselines/ppo1/pposgd_simple.py
  function traj_segment_generator (line 11) | def traj_segment_generator(pi, env, horizon, stochastic):
  function add_vtarg_and_adv (line 64) | def add_vtarg_and_adv(seg, gamma, lam):
  function learn (line 80) | def learn(env, policy_fn, *,
  function flatten_lists (line 216) | def flatten_lists(listoflists):

FILE: baselines/ppo1/run_atari.py
  function train (line 11) | def train(env_id, num_timesteps, seed):
  function main (line 43) | def main():

FILE: baselines/ppo1/run_humanoid.py
  function train (line 9) | def train(num_timesteps, seed, model_path=None):
  class RewScale (line 40) | class RewScale(gym.RewardWrapper):
    method __init__ (line 41) | def __init__(self, env, scale):
    method reward (line 44) | def reward(self, r):
  function main (line 47) | def main():

FILE: baselines/ppo1/run_mujoco.py
  function train (line 7) | def train(env_id, num_timesteps, seed):
  function main (line 23) | def main():

FILE: baselines/ppo1/run_robotics.py
  function train (line 10) | def train(env_id, num_timesteps, seed):
  function main (line 34) | def main():

FILE: baselines/ppo2/defaults.py
  function mujoco (line 1) | def mujoco():
  function atari (line 15) | def atari():
  function retro (line 24) | def retro():

FILE: baselines/ppo2/microbatched_model.py
  class MicrobatchedModel (line 5) | class MicrobatchedModel(Model):
    method __init__ (line 10) | def __init__(self, *, policy, ob_space, ac_space, nbatch_act, nbatch_t...
    method train (line 35) | def train(self, lr, cliprange, obs, returns, masks, actions, values, n...

FILE: baselines/ppo2/model.py
  class Model (line 14) | class Model(object):
    method __init__ (line 27) | def __init__(self, *, policy, ob_space, ac_space, nbatch_act, nbatch_t...
    method train (line 133) | def train(self, lr, cliprange, obs, returns, masks, actions, values, n...

FILE: baselines/ppo2/ppo2.py
  function constfn (line 16) | def constfn(val):
  function learn (line 21) | def learn(*, network, env, total_timesteps, eval_env = None, seed=None, ...
  function safemean (line 220) | def safemean(xs):

FILE: baselines/ppo2/runner.py
  class Runner (line 4) | class Runner(AbstractEnvRunner):
    method __init__ (line 13) | def __init__(self, *, env, model, nsteps, gamma, lam):
    method run (line 20) | def run(self):
  function sf01 (line 69) | def sf01(arr):

FILE: baselines/ppo2/test_microbatches.py
  function test_microbatches (line 12) | def test_microbatches():

FILE: baselines/results_plotter.py
  function rolling_window (line 21) | def rolling_window(a, window):
  function window_func (line 26) | def window_func(x, y, window, func):
  function ts2xy (line 31) | def ts2xy(ts, xaxis, yaxis):
  function plot_curves (line 48) | def plot_curves(xy_list, xaxis, yaxis, title):
  function split_by_task (line 66) | def split_by_task(taskpath):
  function plot_results (line 69) | def plot_results(dirs, num_timesteps=10e6, xaxis=X_TIMESTEPS, yaxis=Y_RE...
  function main (line 79) | def main():

FILE: baselines/run.py
  function train (line 53) | def train(args, extra_args):
  function build_env (line 86) | def build_env(args):
  function get_env_type (line 121) | def get_env_type(args):
  function get_default_network (line 148) | def get_default_network(env_type):
  function get_alg_module (line 154) | def get_alg_module(alg, submodule=None):
  function get_learn_function (line 166) | def get_learn_function(alg):
  function get_learn_function_defaults (line 170) | def get_learn_function_defaults(alg, env_type):
  function parse_cmdline_kwargs (line 180) | def parse_cmdline_kwargs(args):
  function configure_logger (line 195) | def configure_logger(log_path, **kwargs):
  function main (line 202) | def main(args):

FILE: baselines/trpo_mpi/defaults.py
  function atari (line 4) | def atari():
  function mujoco (line 18) | def mujoco():

FILE: baselines/trpo_mpi/trpo_mpi.py
  function traj_segment_generator (line 20) | def traj_segment_generator(pi, env, horizon, stochastic):
  function add_vtarg_and_adv (line 76) | def add_vtarg_and_adv(seg, gamma, lam):
  function learn (line 89) | def learn(*,
  function flatten_lists (line 394) | def flatten_lists(listoflists):
  function get_variables (line 397) | def get_variables(scope):
  function get_trainable_variables (line 400) | def get_trainable_variables(scope):
  function get_vf_trainable_variables (line 403) | def get_vf_trainable_variables(scope):
  function get_pi_trainable_variables (line 406) | def get_pi_trainable_variables(scope):
Condensed preview — 171 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,896K chars).
[
  {
    "path": ".benchmark_pattern",
    "chars": 1,
    "preview": "\n"
  },
  {
    "path": ".gitignore",
    "chars": 283,
    "preview": "*.swp\n*.pyc\n*.pkl\n*.py~\n.pytest_cache\n.DS_Store\n.idea\n\n# Setuptools distribution and build folders.\n/dist/\n/build\nkeys/\n"
  },
  {
    "path": ".travis.yml",
    "chars": 243,
    "preview": "language: python\npython:\n    - \"3.6\"\n\nservices:\n    - docker\n\ninstall:\n    - pip install flake8\n    - docker build . -t "
  },
  {
    "path": "Dockerfile",
    "chars": 465,
    "preview": "FROM python:3.6\n\nRUN apt-get -y update && apt-get -y install ffmpeg\n# RUN apt-get -y update && apt-get -y install git wg"
  },
  {
    "path": "LICENSE",
    "chars": 1087,
    "preview": "The MIT License\n\nCopyright (c) 2017 OpenAI (http://openai.com)\n\nPermission is hereby granted, free of charge, to any per"
  },
  {
    "path": "README.md",
    "chars": 8617,
    "preview": "**Status:** Maintenance (expect bug fixes and minor updates)\n\n<img src=\"data/logo.jpg\" width=25% align=\"right\" /> [![Bui"
  },
  {
    "path": "baselines/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/a2c/README.md",
    "chars": 833,
    "preview": "# A2C\n\n- Original paper: https://arxiv.org/abs/1602.01783\n- Baselines blog post: https://blog.openai.com/baselines-acktr"
  },
  {
    "path": "baselines/a2c/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/a2c/a2c.py",
    "chars": 9451,
    "preview": "import time\nimport functools\nimport tensorflow as tf\n\nfrom baselines import logger\n\nfrom baselines.common import set_glo"
  },
  {
    "path": "baselines/a2c/runner.py",
    "chars": 3241,
    "preview": "import numpy as np\nfrom baselines.a2c.utils import discount_with_dones\nfrom baselines.common.runners import AbstractEnvR"
  },
  {
    "path": "baselines/a2c/utils.py",
    "chars": 9348,
    "preview": "import os\nimport numpy as np\nimport tensorflow as tf\nfrom collections import deque\n\ndef sample(logits):\n    noise = tf.r"
  },
  {
    "path": "baselines/acer/README.md",
    "chars": 301,
    "preview": "# ACER\n\n- Original paper: https://arxiv.org/abs/1611.01224\n- `python -m baselines.run --alg=acer --env=PongNoFrameskip-v"
  },
  {
    "path": "baselines/acer/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/acer/acer.py",
    "chars": 18596,
    "preview": "import time\nimport functools\nimport numpy as np\nimport tensorflow as tf\nfrom baselines import logger\n\nfrom baselines.com"
  },
  {
    "path": "baselines/acer/buffer.py",
    "chars": 5881,
    "preview": "import numpy as np\n\nclass Buffer(object):\n    # gets obs, actions, rewards, mu's, (states, masks), dones\n    def __init_"
  },
  {
    "path": "baselines/acer/defaults.py",
    "chars": 66,
    "preview": "def atari():\n    return dict(\n        lrschedule='constant'\n    )\n"
  },
  {
    "path": "baselines/acer/policies.py",
    "chars": 2807,
    "preview": "import numpy as np\nimport tensorflow as tf\nfrom baselines.common.policies import nature_cnn\nfrom baselines.a2c.utils imp"
  },
  {
    "path": "baselines/acer/runner.py",
    "chars": 2689,
    "preview": "import numpy as np\nfrom baselines.common.runners import AbstractEnvRunner\nfrom baselines.common.vec_env.vec_frame_stack "
  },
  {
    "path": "baselines/acktr/README.md",
    "chars": 893,
    "preview": "# ACKTR\n\n- Original paper: https://arxiv.org/abs/1708.05144\n- Baselines blog post: https://blog.openai.com/baselines-ack"
  },
  {
    "path": "baselines/acktr/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/acktr/acktr.py",
    "chars": 7037,
    "preview": "import os.path as osp\nimport time\nimport functools\nimport tensorflow as tf\nfrom baselines import logger\n\nfrom baselines."
  },
  {
    "path": "baselines/acktr/defaults.py",
    "chars": 87,
    "preview": "def mujoco():\n    return dict(\n        nsteps=2500,\n        value_network='copy'\n    )\n"
  },
  {
    "path": "baselines/acktr/kfac.py",
    "chars": 45679,
    "preview": "import tensorflow as tf\nimport numpy as np\nimport re\n\n # flake8: noqa F403, F405\nfrom baselines.acktr.kfac_utils import "
  },
  {
    "path": "baselines/acktr/kfac_utils.py",
    "chars": 3389,
    "preview": "import tensorflow as tf\n\ndef gmatmul(a, b, transpose_a=False, transpose_b=False, reduce_dim=None):\n    assert reduce_dim"
  },
  {
    "path": "baselines/acktr/utils.py",
    "chars": 1322,
    "preview": "import tensorflow as tf\n\ndef dense(x, size, name, weight_init=None, bias_init=0, weight_loss_dict=None, reuse=None):\n   "
  },
  {
    "path": "baselines/bench/__init__.py",
    "chars": 99,
    "preview": "# flake8: noqa F403\nfrom baselines.bench.benchmarks import *\nfrom baselines.bench.monitor import *\n"
  },
  {
    "path": "baselines/bench/benchmarks.py",
    "chars": 6102,
    "preview": "import re\nimport os\nSCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))\n\n_atari7 = ['BeamRider', 'Breakout', 'Enduro"
  },
  {
    "path": "baselines/bench/monitor.py",
    "chars": 5741,
    "preview": "__all__ = ['Monitor', 'get_monitor_files', 'load_results']\n\nfrom gym.core import Wrapper\nimport time\nfrom glob import gl"
  },
  {
    "path": "baselines/bench/test_monitor.py",
    "chars": 861,
    "preview": "from .monitor import Monitor\nimport gym\nimport json\n\ndef test_monitor():\n    import pandas\n    import os\n    import uuid"
  },
  {
    "path": "baselines/common/__init__.py",
    "chars": 191,
    "preview": "# flake8: noqa F403\nfrom baselines.common.console_util import *\nfrom baselines.common.dataset import Dataset\nfrom baseli"
  },
  {
    "path": "baselines/common/atari_wrappers.py",
    "chars": 9686,
    "preview": "import numpy as np\nimport os\nos.environ.setdefault('PATH', '')\nfrom collections import deque\nimport gym\nfrom gym import "
  },
  {
    "path": "baselines/common/cg.py",
    "chars": 897,
    "preview": "import numpy as np\ndef cg(f_Ax, b, cg_iters=10, callback=None, verbose=False, residual_tol=1e-10):\n    \"\"\"\n    Demmel p "
  },
  {
    "path": "baselines/common/cmd_util.py",
    "chars": 7922,
    "preview": "\"\"\"\nHelpers for scripts like run_atari.py.\n\"\"\"\n\nimport os\ntry:\n    from mpi4py import MPI\nexcept ImportError:\n    MPI = "
  },
  {
    "path": "baselines/common/console_util.py",
    "chars": 2179,
    "preview": "from __future__ import print_function\nfrom contextlib import contextmanager\nimport numpy as np\nimport time\nimport shlex\n"
  },
  {
    "path": "baselines/common/dataset.py",
    "chars": 2132,
    "preview": "import numpy as np\n\nclass Dataset(object):\n    def __init__(self, data_map, deterministic=False, shuffle=True):\n        "
  },
  {
    "path": "baselines/common/distributions.py",
    "chars": 13595,
    "preview": "import tensorflow as tf\nimport numpy as np\nimport baselines.common.tf_util as U\nfrom baselines.a2c.utils import fc\nfrom "
  },
  {
    "path": "baselines/common/input.py",
    "chars": 2071,
    "preview": "import numpy as np\nimport tensorflow as tf\nfrom gym.spaces import Discrete, Box, MultiDiscrete\n\ndef observation_placehol"
  },
  {
    "path": "baselines/common/math_util.py",
    "chars": 2094,
    "preview": "import numpy as np\nimport scipy.signal\n\n\ndef discount(x, gamma):\n    \"\"\"\n    computes discounted sums along 0th dimensio"
  },
  {
    "path": "baselines/common/misc_util.py",
    "chars": 7166,
    "preview": "import gym\nimport numpy as np\nimport os\nimport pickle\nimport random\nimport tempfile\nimport zipfile\n\n\ndef zipsame(*seqs):"
  },
  {
    "path": "baselines/common/models.py",
    "chars": 8557,
    "preview": "import numpy as np\nimport tensorflow as tf\nfrom baselines.a2c import utils\nfrom baselines.a2c.utils import conv, fc, con"
  },
  {
    "path": "baselines/common/mpi_adam.py",
    "chars": 3296,
    "preview": "import baselines.common.tf_util as U\nimport tensorflow as tf\nimport numpy as np\ntry:\n    from mpi4py import MPI\nexcept I"
  },
  {
    "path": "baselines/common/mpi_adam_optimizer.py",
    "chars": 3976,
    "preview": "import numpy as np\nimport tensorflow as tf\nfrom baselines.common import tf_util as U\nfrom baselines.common.tests.test_wi"
  },
  {
    "path": "baselines/common/mpi_fork.py",
    "chars": 667,
    "preview": "import os, subprocess, sys\n\ndef mpi_fork(n, bind_to_core=False):\n    \"\"\"Re-launches the current script with workers\n    "
  },
  {
    "path": "baselines/common/mpi_moments.py",
    "chars": 2018,
    "preview": "from mpi4py import MPI\nimport numpy as np\nfrom baselines.common import zipsame\n\n\ndef mpi_mean(x, axis=0, comm=None, keep"
  },
  {
    "path": "baselines/common/mpi_running_mean_std.py",
    "chars": 3706,
    "preview": "try:\n    from mpi4py import MPI\nexcept ImportError:\n    MPI = None\n\nimport tensorflow as tf, baselines.common.tf_util as"
  },
  {
    "path": "baselines/common/mpi_util.py",
    "chars": 4259,
    "preview": "from collections import defaultdict\nimport os, numpy as np\nimport platform\nimport shutil\nimport subprocess\nimport warnin"
  },
  {
    "path": "baselines/common/plot_util.py",
    "chars": 18930,
    "preview": "import matplotlib.pyplot as plt\nimport os.path as osp\nimport json\nimport os\nimport numpy as np\nimport pandas\nfrom collec"
  },
  {
    "path": "baselines/common/policies.py",
    "chars": 6652,
    "preview": "import tensorflow as tf\nfrom baselines.common import tf_util\nfrom baselines.a2c.utils import fc\nfrom baselines.common.di"
  },
  {
    "path": "baselines/common/retro_wrappers.py",
    "chars": 9752,
    "preview": "from collections import deque\nimport cv2\ncv2.ocl.setUseOpenCL(False)\nfrom .atari_wrappers import WarpFrame, ClipRewardEn"
  },
  {
    "path": "baselines/common/runners.py",
    "chars": 670,
    "preview": "import numpy as np\nfrom abc import ABC, abstractmethod\n\nclass AbstractEnvRunner(ABC):\n    def __init__(self, *, env, mod"
  },
  {
    "path": "baselines/common/running_mean_std.py",
    "chars": 6081,
    "preview": "import tensorflow as tf\nimport numpy as np\nfrom baselines.common.tf_util import get_session\n\nclass RunningMeanStd(object"
  },
  {
    "path": "baselines/common/schedules.py",
    "chars": 3702,
    "preview": "\"\"\"This file is used for specifying various schedules that evolve over\ntime throughout the execution of the algorithm, s"
  },
  {
    "path": "baselines/common/segment_tree.py",
    "chars": 4899,
    "preview": "import operator\n\n\nclass SegmentTree(object):\n    def __init__(self, capacity, operation, neutral_element):\n        \"\"\"Bu"
  },
  {
    "path": "baselines/common/test_mpi_util.py",
    "chars": 986,
    "preview": "from baselines.common import mpi_util\nfrom baselines import logger\nfrom baselines.common.tests.test_with_mpi import with"
  },
  {
    "path": "baselines/common/tests/__init__.py",
    "chars": 89,
    "preview": "import os, pytest\nmark_slow = pytest.mark.skipif(not os.getenv('RUNSLOW'), reason='slow')"
  },
  {
    "path": "baselines/common/tests/envs/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/common/tests/envs/fixed_sequence_env.py",
    "chars": 1054,
    "preview": "import numpy as np\nfrom gym import Env\nfrom gym.spaces import Discrete\n\n\nclass FixedSequenceEnv(Env):\n    def __init__(\n"
  },
  {
    "path": "baselines/common/tests/envs/identity_env.py",
    "chars": 2444,
    "preview": "import numpy as np\nfrom abc import abstractmethod\nfrom gym import Env\nfrom gym.spaces import MultiDiscrete, Discrete, Bo"
  },
  {
    "path": "baselines/common/tests/envs/identity_env_test.py",
    "chars": 1034,
    "preview": "from baselines.common.tests.envs.identity_env import DiscreteIdentityEnv\n\n\ndef test_discrete_nodelay():\n    nsteps = 100"
  },
  {
    "path": "baselines/common/tests/envs/mnist_env.py",
    "chars": 2110,
    "preview": "import os.path as osp\nimport numpy as np\nimport tempfile\nfrom gym import Env\nfrom gym.spaces import Discrete, Box\n\n\n\ncla"
  },
  {
    "path": "baselines/common/tests/test_cartpole.py",
    "chars": 1098,
    "preview": "import pytest\nimport gym\n\nfrom baselines.run import get_learn_function\nfrom baselines.common.tests.util import reward_pe"
  },
  {
    "path": "baselines/common/tests/test_doc_examples.py",
    "chars": 1351,
    "preview": "import pytest\ntry:\n    import mujoco_py\n    _mujoco_present = True\nexcept BaseException:\n    mujoco_py = None\n    _mujoc"
  },
  {
    "path": "baselines/common/tests/test_env_after_learn.py",
    "chars": 865,
    "preview": "import pytest\nimport gym\nimport tensorflow as tf\n\nfrom baselines.common.vec_env.subproc_vec_env import SubprocVecEnv\nfro"
  },
  {
    "path": "baselines/common/tests/test_fetchreach.py",
    "chars": 860,
    "preview": "import pytest\nimport gym\n\nfrom baselines.run import get_learn_function\nfrom baselines.common.tests.util import reward_pe"
  },
  {
    "path": "baselines/common/tests/test_fixed_sequence.py",
    "chars": 1389,
    "preview": "import pytest\nfrom baselines.common.tests.envs.fixed_sequence_env import FixedSequenceEnv\n\nfrom baselines.common.tests.u"
  },
  {
    "path": "baselines/common/tests/test_identity.py",
    "chars": 2304,
    "preview": "import pytest\nfrom baselines.common.tests.envs.identity_env import DiscreteIdentityEnv, BoxIdentityEnv, MultiDiscreteIde"
  },
  {
    "path": "baselines/common/tests/test_mnist.py",
    "chars": 1515,
    "preview": "import pytest\n\n# from baselines.acer import acer_simple as acer\nfrom baselines.common.tests.envs.mnist_env import MnistE"
  },
  {
    "path": "baselines/common/tests/test_plot_util.py",
    "chars": 717,
    "preview": "# smoke tests of plot_util\nfrom baselines.common import plot_util as pu\nfrom baselines.common.tests.util import smoketes"
  },
  {
    "path": "baselines/common/tests/test_schedules.py",
    "chars": 823,
    "preview": "import numpy as np\n\nfrom baselines.common.schedules import ConstantSchedule, PiecewiseSchedule\n\n\ndef test_piecewise_sche"
  },
  {
    "path": "baselines/common/tests/test_segment_tree.py",
    "chars": 2691,
    "preview": "import numpy as np\n\nfrom baselines.common.segment_tree import SumSegmentTree, MinSegmentTree\n\n\ndef test_tree_set():\n    "
  },
  {
    "path": "baselines/common/tests/test_serialization.py",
    "chars": 4273,
    "preview": "import os\nimport gym\nimport tempfile\nimport pytest\nimport tensorflow as tf\nimport numpy as np\n\nfrom baselines.common.tes"
  },
  {
    "path": "baselines/common/tests/test_tf_util.py",
    "chars": 1072,
    "preview": "# tests for tf_util\nimport tensorflow as tf\nfrom baselines.common.tf_util import (\n    function,\n    initialize,\n    sin"
  },
  {
    "path": "baselines/common/tests/test_with_mpi.py",
    "chars": 997,
    "preview": "import os\nimport sys\nimport subprocess\nimport cloudpickle\nimport base64\nimport pytest\nfrom functools import wraps\n\ntry:\n"
  },
  {
    "path": "baselines/common/tests/util.py",
    "chars": 3181,
    "preview": "import tensorflow as tf\nimport numpy as np\nfrom baselines.common.vec_env.dummy_vec_env import DummyVecEnv\n\nN_TRIALS = 10"
  },
  {
    "path": "baselines/common/tf_util.py",
    "chars": 16969,
    "preview": "import numpy as np\nimport tensorflow as tf  # pylint: ignore-module\nimport copy\nimport os\nimport functools\nimport collec"
  },
  {
    "path": "baselines/common/tile_images.py",
    "chars": 763,
    "preview": "import numpy as np\n\ndef tile_images(img_nhwc):\n    \"\"\"\n    Tile N images into one big PxQ image\n    (P,Q) are chosen to "
  },
  {
    "path": "baselines/common/vec_env/__init__.py",
    "chars": 668,
    "preview": "from .vec_env import AlreadySteppingError, NotSteppingError, VecEnv, VecEnvWrapper, VecEnvObservationWrapper, Cloudpickl"
  },
  {
    "path": "baselines/common/vec_env/dummy_vec_env.py",
    "chars": 2923,
    "preview": "import numpy as np\nfrom .vec_env import VecEnv\nfrom .util import copy_obs_dict, dict_to_obs, obs_space_info\n\nclass Dummy"
  },
  {
    "path": "baselines/common/vec_env/shmem_vec_env.py",
    "chars": 5178,
    "preview": "\"\"\"\nAn interface for asynchronous vectorized environments.\n\"\"\"\n\nimport multiprocessing as mp\nimport numpy as np\nfrom .ve"
  },
  {
    "path": "baselines/common/vec_env/subproc_vec_env.py",
    "chars": 5069,
    "preview": "import multiprocessing as mp\n\nimport numpy as np\nfrom .vec_env import VecEnv, CloudpickleWrapper, clear_mpi_env_vars\n\n\nd"
  },
  {
    "path": "baselines/common/vec_env/test_vec_env.py",
    "chars": 5162,
    "preview": "\"\"\"\nTests for asynchronous vectorized environments.\n\"\"\"\n\nimport gym\nimport numpy as np\nimport pytest\nfrom .dummy_vec_env"
  },
  {
    "path": "baselines/common/vec_env/test_video_recorder.py",
    "chars": 1467,
    "preview": "\"\"\"\nTests for asynchronous vectorized environments.\n\"\"\"\n\nimport gym\nimport pytest\nimport os\nimport glob\nimport tempfile\n"
  },
  {
    "path": "baselines/common/vec_env/util.py",
    "chars": 1513,
    "preview": "\"\"\"\nHelpers for dealing with vectorized environments.\n\"\"\"\n\nfrom collections import OrderedDict\n\nimport gym\nimport numpy "
  },
  {
    "path": "baselines/common/vec_env/vec_env.py",
    "chars": 6195,
    "preview": "import contextlib\nimport os\nfrom abc import ABC, abstractmethod\n\nfrom baselines.common.tile_images import tile_images\n\nc"
  },
  {
    "path": "baselines/common/vec_env/vec_frame_stack.py",
    "chars": 1150,
    "preview": "from .vec_env import VecEnvWrapper\nimport numpy as np\nfrom gym import spaces\n\n\nclass VecFrameStack(VecEnvWrapper):\n    d"
  },
  {
    "path": "baselines/common/vec_env/vec_monitor.py",
    "chars": 1971,
    "preview": "from . import VecEnvWrapper\nfrom baselines.bench.monitor import ResultsWriter\nimport numpy as np\nimport time\nfrom collec"
  },
  {
    "path": "baselines/common/vec_env/vec_normalize.py",
    "chars": 1854,
    "preview": "from . import VecEnvWrapper\nimport numpy as np\n\nclass VecNormalize(VecEnvWrapper):\n    \"\"\"\n    A vectorized wrapper that"
  },
  {
    "path": "baselines/common/vec_env/vec_remove_dict_obs.py",
    "chars": 321,
    "preview": "from .vec_env import VecEnvObservationWrapper\n\nclass VecExtractDictObs(VecEnvObservationWrapper):\n    def __init__(self,"
  },
  {
    "path": "baselines/common/vec_env/vec_video_recorder.py",
    "chars": 2746,
    "preview": "import os\nfrom baselines import logger\nfrom baselines.common.vec_env import VecEnvWrapper\nfrom gym.wrappers.monitoring i"
  },
  {
    "path": "baselines/common/wrappers.py",
    "chars": 946,
    "preview": "import gym\n\nclass TimeLimit(gym.Wrapper):\n    def __init__(self, env, max_episode_steps=None):\n        super(TimeLimit, "
  },
  {
    "path": "baselines/ddpg/README.md",
    "chars": 330,
    "preview": "# DDPG\n\n- Original paper: https://arxiv.org/abs/1509.02971\n- Baselines post: https://blog.openai.com/better-exploration-"
  },
  {
    "path": "baselines/ddpg/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/ddpg/ddpg.py",
    "chars": 11283,
    "preview": "import os\nimport time\nfrom collections import deque\nimport pickle\n\nfrom baselines.ddpg.ddpg_learner import DDPG\nfrom bas"
  },
  {
    "path": "baselines/ddpg/ddpg_learner.py",
    "chars": 17731,
    "preview": "from copy import copy\nfrom functools import reduce\n\nimport numpy as np\nimport tensorflow as tf\nimport tensorflow.contrib"
  },
  {
    "path": "baselines/ddpg/memory.py",
    "chars": 2708,
    "preview": "import numpy as np\n\n\nclass RingBuffer(object):\n    def __init__(self, maxlen, shape, dtype='float32'):\n        self.maxl"
  },
  {
    "path": "baselines/ddpg/models.py",
    "chars": 1941,
    "preview": "import tensorflow as tf\nfrom baselines.common.models import get_network_builder\n\n\nclass Model(object):\n    def __init__("
  },
  {
    "path": "baselines/ddpg/noise.py",
    "chars": 2162,
    "preview": "import numpy as np\n\n\nclass AdaptiveParamNoiseSpec(object):\n    def __init__(self, initial_stddev=0.1, desired_action_std"
  },
  {
    "path": "baselines/ddpg/test_smoke.py",
    "chars": 413,
    "preview": "from baselines.common.tests.util import smoketest\ndef _run(argstr):\n    smoketest('--alg=ddpg --env=Pendulum-v0 --num_ti"
  },
  {
    "path": "baselines/deepq/README.md",
    "chars": 1604,
    "preview": "## If you are curious.\n\n##### Train a Cartpole agent and watch it play once it converges!\n\nHere's a list of commands to "
  },
  {
    "path": "baselines/deepq/__init__.py",
    "chars": 409,
    "preview": "from baselines.deepq import models  # noqa\nfrom baselines.deepq.build_graph import build_act, build_train  # noqa\nfrom b"
  },
  {
    "path": "baselines/deepq/build_graph.py",
    "chars": 20635,
    "preview": "\"\"\"Deep Q learning graph\n\nThe functions in this file can are used to create the following functions:\n\n======= act ======"
  },
  {
    "path": "baselines/deepq/deepq.py",
    "chars": 13125,
    "preview": "import os\nimport tempfile\n\nimport tensorflow as tf\nimport zipfile\nimport cloudpickle\nimport numpy as np\n\nimport baseline"
  },
  {
    "path": "baselines/deepq/defaults.py",
    "chars": 480,
    "preview": "def atari():\n    return dict(\n        network='conv_only',\n        lr=1e-4,\n        buffer_size=10000,\n        explorati"
  },
  {
    "path": "baselines/deepq/experiments/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/deepq/experiments/custom_cartpole.py",
    "chars": 3358,
    "preview": "import gym\nimport itertools\nimport numpy as np\nimport tensorflow as tf\nimport tensorflow.contrib.layers as layers\n\nimpor"
  },
  {
    "path": "baselines/deepq/experiments/enjoy_cartpole.py",
    "chars": 486,
    "preview": "import gym\n\nfrom baselines import deepq\n\n\ndef main():\n    env = gym.make(\"CartPole-v0\")\n    act = deepq.learn(env, netwo"
  },
  {
    "path": "baselines/deepq/experiments/enjoy_mountaincar.py",
    "chars": 600,
    "preview": "import gym\n\nfrom baselines import deepq\nfrom baselines.common import models\n\n\ndef main():\n    env = gym.make(\"MountainCa"
  },
  {
    "path": "baselines/deepq/experiments/enjoy_pong.py",
    "chars": 625,
    "preview": "import gym\nfrom baselines import deepq\n\n\ndef main():\n    env = gym.make(\"PongNoFrameskip-v4\")\n    env = deepq.wrap_atari"
  },
  {
    "path": "baselines/deepq/experiments/train_cartpole.py",
    "chars": 646,
    "preview": "import gym\n\nfrom baselines import deepq\n\n\ndef callback(lcl, _glb):\n    # stop training if reward exceeds 199\n    is_solv"
  },
  {
    "path": "baselines/deepq/experiments/train_mountaincar.py",
    "chars": 616,
    "preview": "import gym\n\nfrom baselines import deepq\nfrom baselines.common import models\n\n\ndef main():\n    env = gym.make(\"MountainCa"
  },
  {
    "path": "baselines/deepq/experiments/train_pong.py",
    "chars": 817,
    "preview": "from baselines import deepq\nfrom baselines import bench\nfrom baselines import logger\nfrom baselines.common.atari_wrapper"
  },
  {
    "path": "baselines/deepq/models.py",
    "chars": 2194,
    "preview": "import tensorflow as tf\nimport tensorflow.contrib.layers as layers\n\n\ndef build_q_func(network, hiddens=[256], dueling=Tr"
  },
  {
    "path": "baselines/deepq/replay_buffer.py",
    "chars": 6475,
    "preview": "import numpy as np\nimport random\n\nfrom baselines.common.segment_tree import SumSegmentTree, MinSegmentTree\n\n\nclass Repla"
  },
  {
    "path": "baselines/deepq/utils.py",
    "chars": 1885,
    "preview": "from baselines.common.input import observation_input\nfrom baselines.common.tf_util import adjust_shape\n\n# =============="
  },
  {
    "path": "baselines/gail/README.md",
    "chars": 1094,
    "preview": "# Generative Adversarial Imitation Learning (GAIL)\n\n- Original paper: https://arxiv.org/abs/1606.03476\n\nFor results benc"
  },
  {
    "path": "baselines/gail/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/gail/adversary.py",
    "chars": 4674,
    "preview": "'''\nReference: https://github.com/openai/imitation\nI follow the architecture from the official repository\n'''\nimport ten"
  },
  {
    "path": "baselines/gail/behavior_clone.py",
    "chars": 5195,
    "preview": "'''\nThe code is used to train BC imitator, or pretrained GAIL imitator\n'''\n\nimport argparse\nimport tempfile\nimport os.pa"
  },
  {
    "path": "baselines/gail/dataset/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/gail/dataset/mujoco_dset.py",
    "chars": 4448,
    "preview": "'''\nData structure of the input .npz:\nthe data is save in python dictionary format with keys: 'acs', 'ep_rets', 'rews', "
  },
  {
    "path": "baselines/gail/gail-eval.py",
    "chars": 5886,
    "preview": "'''\nThis code is used to evalaute the imitators trained with different number of trajectories\nand plot the results in th"
  },
  {
    "path": "baselines/gail/mlp_policy.py",
    "chars": 2930,
    "preview": "'''\nfrom baselines/ppo1/mlp_policy.py and add simple modification\n(1) add reuse argument\n(2) cache the `stochastic` plac"
  },
  {
    "path": "baselines/gail/result/gail-result.md",
    "chars": 2575,
    "preview": "# Results of GAIL/BC on Mujoco\n\nHere's the extensive experimental results of applying GAIL/BC on Mujoco environments, in"
  },
  {
    "path": "baselines/gail/run_mujoco.py",
    "chars": 9366,
    "preview": "'''\nDisclaimer: this code is highly based on trpo_mpi at @openai/baselines and @openai/imitation\n'''\n\nimport argparse\nim"
  },
  {
    "path": "baselines/gail/statistics.py",
    "chars": 1802,
    "preview": "'''\nThis code is highly based on https://github.com/carpedm20/deep-rl-tensorflow/blob/master/agents/statistic.py\n'''\n\nim"
  },
  {
    "path": "baselines/gail/trpo_mpi.py",
    "chars": 14662,
    "preview": "'''\nDisclaimer: The trpo part highly rely on trpo_mpi at @openai/baselines\n'''\n\nimport time\nimport os\nfrom contextlib im"
  },
  {
    "path": "baselines/her/README.md",
    "chars": 5129,
    "preview": "# Hindsight Experience Replay\nFor details on Hindsight Experience Replay (HER), please read the [paper](https://arxiv.or"
  },
  {
    "path": "baselines/her/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/her/actor_critic.py",
    "chars": 1996,
    "preview": "import tensorflow as tf\nfrom baselines.her.util import store_args, nn\n\n\nclass ActorCritic:\n    @store_args\n    def __ini"
  },
  {
    "path": "baselines/her/ddpg.py",
    "chars": 21980,
    "preview": "from collections import OrderedDict\n\nimport numpy as np\nimport tensorflow as tf\nfrom tensorflow.contrib.staging import S"
  },
  {
    "path": "baselines/her/experiment/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/her/experiment/config.py",
    "chars": 7705,
    "preview": "import os\nimport numpy as np\nimport gym\n\nfrom baselines import logger\nfrom baselines.her.ddpg import DDPG\nfrom baselines"
  },
  {
    "path": "baselines/her/experiment/data_generation/fetch_data_generation.py",
    "chars": 3603,
    "preview": "import gym\nimport numpy as np\n\n\n\"\"\"Data generation for the case of a single block pick and place in Fetch Env\"\"\"\n\naction"
  },
  {
    "path": "baselines/her/experiment/play.py",
    "chars": 1775,
    "preview": "# DEPRECATED, use --play flag to baselines.run instead\nimport click\nimport numpy as np\nimport pickle\n\nfrom baselines imp"
  },
  {
    "path": "baselines/her/experiment/plot.py",
    "chars": 3611,
    "preview": "# DEPRECATED, use baselines.common.plot_util instead\n\nimport os\nimport matplotlib.pyplot as plt\nimport numpy as np\nimpor"
  },
  {
    "path": "baselines/her/her.py",
    "chars": 7498,
    "preview": "import os\n\nimport click\nimport numpy as np\nimport json\nfrom mpi4py import MPI\n\nfrom baselines import logger\nfrom baselin"
  },
  {
    "path": "baselines/her/her_sampler.py",
    "chars": 2822,
    "preview": "import numpy as np\n\n\ndef make_sample_her_transitions(replay_strategy, replay_k, reward_fun):\n    \"\"\"Creates a sample fun"
  },
  {
    "path": "baselines/her/normalizer.py",
    "chars": 5304,
    "preview": "import threading\n\nimport numpy as np\nfrom mpi4py import MPI\nimport tensorflow as tf\n\nfrom baselines.her.util import resh"
  },
  {
    "path": "baselines/her/replay_buffer.py",
    "chars": 3669,
    "preview": "import threading\n\nimport numpy as np\n\n\nclass ReplayBuffer:\n    def __init__(self, buffer_shapes, size_in_transitions, T,"
  },
  {
    "path": "baselines/her/rollout.py",
    "chars": 6782,
    "preview": "from collections import deque\n\nimport numpy as np\nimport pickle\n\nfrom baselines.her.util import convert_episode_to_batch"
  },
  {
    "path": "baselines/her/util.py",
    "chars": 4038,
    "preview": "import os\nimport subprocess\nimport sys\nimport importlib\nimport inspect\nimport functools\n\nimport tensorflow as tf\nimport "
  },
  {
    "path": "baselines/logger.py",
    "chars": 14802,
    "preview": "import os\nimport sys\nimport shutil\nimport os.path as osp\nimport json\nimport time\nimport datetime\nimport tempfile\nfrom co"
  },
  {
    "path": "baselines/ppo1/README.md",
    "chars": 629,
    "preview": "# PPOSGD\n\n- Original paper: https://arxiv.org/abs/1707.06347\n- Baselines blog post: https://blog.openai.com/openai-basel"
  },
  {
    "path": "baselines/ppo1/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/ppo1/cnn_policy.py",
    "chars": 2417,
    "preview": "import baselines.common.tf_util as U\nimport tensorflow as tf\nimport gym\nfrom baselines.common.distributions import make_"
  },
  {
    "path": "baselines/ppo1/mlp_policy.py",
    "chars": 2842,
    "preview": "from baselines.common.mpi_running_mean_std import RunningMeanStd\nimport baselines.common.tf_util as U\nimport tensorflow "
  },
  {
    "path": "baselines/ppo1/pposgd_simple.py",
    "chars": 9432,
    "preview": "from baselines.common import Dataset, explained_variance, fmt_row, zipsame\nfrom baselines import logger\nimport baselines"
  },
  {
    "path": "baselines/ppo1/run_atari.py",
    "chars": 1583,
    "preview": "#!/usr/bin/env python3\n\nfrom mpi4py import MPI\nfrom baselines.common import set_global_seeds\nfrom baselines import bench"
  },
  {
    "path": "baselines/ppo1/run_humanoid.py",
    "chars": 2434,
    "preview": "#!/usr/bin/env python3\nimport os\nfrom baselines.common.cmd_util import make_mujoco_env, mujoco_arg_parser\nfrom baselines"
  },
  {
    "path": "baselines/ppo1/run_mujoco.py",
    "chars": 1025,
    "preview": "#!/usr/bin/env python3\n\nfrom baselines.common.cmd_util import make_mujoco_env, mujoco_arg_parser\nfrom baselines.common i"
  },
  {
    "path": "baselines/ppo1/run_robotics.py",
    "chars": 1293,
    "preview": "#!/usr/bin/env python3\n\nfrom mpi4py import MPI\nfrom baselines.common import set_global_seeds\nfrom baselines import logge"
  },
  {
    "path": "baselines/ppo2/README.md",
    "chars": 504,
    "preview": "# PPO2\n\n- Original paper: https://arxiv.org/abs/1707.06347\n- Baselines blog post: https://blog.openai.com/openai-baselin"
  },
  {
    "path": "baselines/ppo2/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/ppo2/defaults.py",
    "chars": 518,
    "preview": "def mujoco():\n    return dict(\n        nsteps=2048,\n        nminibatches=32,\n        lam=0.95,\n        gamma=0.99,\n     "
  },
  {
    "path": "baselines/ppo2/microbatched_model.py",
    "chars": 3241,
    "preview": "import tensorflow as tf\nimport numpy as np\nfrom baselines.ppo2.model import Model\n\nclass MicrobatchedModel(Model):\n    \""
  },
  {
    "path": "baselines/ppo2/model.py",
    "chars": 6054,
    "preview": "import tensorflow as tf\nimport functools\n\nfrom baselines.common.tf_util import get_session, save_variables, load_variabl"
  },
  {
    "path": "baselines/ppo2/ppo2.py",
    "chars": 10229,
    "preview": "import os\nimport time\nimport numpy as np\nimport os.path as osp\nfrom baselines import logger\nfrom collections import dequ"
  },
  {
    "path": "baselines/ppo2/runner.py",
    "chars": 3194,
    "preview": "import numpy as np\nfrom baselines.common.runners import AbstractEnvRunner\n\nclass Runner(AbstractEnvRunner):\n    \"\"\"\n    "
  },
  {
    "path": "baselines/ppo2/test_microbatches.py",
    "chars": 1152,
    "preview": "import gym\nimport tensorflow as tf\nimport numpy as np\nfrom functools import partial\n\nfrom baselines.common.vec_env.dummy"
  },
  {
    "path": "baselines/results_plotter.py",
    "chars": 3455,
    "preview": "import numpy as np\nimport matplotlib\nmatplotlib.use('TkAgg') # Can change to 'Agg' for non-interactive mode\n\nimport matp"
  },
  {
    "path": "baselines/run.py",
    "chars": 7388,
    "preview": "import sys\nimport re\nimport multiprocessing\nimport os.path as osp\nimport gym\nfrom collections import defaultdict\nimport "
  },
  {
    "path": "baselines/trpo_mpi/README.md",
    "chars": 532,
    "preview": "# trpo_mpi\n\n- Original paper: https://arxiv.org/abs/1502.05477\n- Baselines blog post https://blog.openai.com/openai-base"
  },
  {
    "path": "baselines/trpo_mpi/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "baselines/trpo_mpi/defaults.py",
    "chars": 638,
    "preview": "from baselines.common.models import mlp, cnn_small\n\n\ndef atari():\n    return dict(\n        network = cnn_small(),\n      "
  },
  {
    "path": "baselines/trpo_mpi/trpo_mpi.py",
    "chars": 15098,
    "preview": "from baselines.common import explained_variance, zipsame, dataset\nfrom baselines import logger\nimport baselines.common.t"
  },
  {
    "path": "benchmarks_atari10M.htm",
    "chars": 435960,
    "preview": "<html lang=\"en\">\n<head>\n    <meta charset=\"utf-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale="
  },
  {
    "path": "benchmarks_mujoco1M.htm",
    "chars": 156672,
    "preview": "<html lang=\"en\">\n<head>\n    <meta charset=\"utf-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale="
  },
  {
    "path": "docs/viz/viz.ipynb",
    "chars": 580297,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"colab_type\": \"text\",\n    \"id\": \"Ynb-laSwmpac\"\n   },\n"
  },
  {
    "path": "setup.cfg",
    "chars": 93,
    "preview": "[flake8]\nselect = F,E999,W291,W293\nexclude = \n    .git,\n    __pycache__,\n    baselines/ppo1,\n"
  },
  {
    "path": "setup.py",
    "chars": 1670,
    "preview": "import re\nfrom setuptools import setup, find_packages\nimport sys\n\nif sys.version_info.major != 3:\n    print('This Python"
  }
]

About this extraction

This page contains the full source code of the openai/baselines GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 171 files (1.7 MB), approximately 822.4k tokens, and a symbol index with 1041 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!