Full Code of DanielTakeshi/rl_algorithms for AI

master 7a43fa485137 cached
89 files
2.9 MB
750.9k tokens
305 symbols
1 requests
Download .txt
Showing preview only (3,003K chars total). Download the full file or copy to clipboard to get everything.
Repository: DanielTakeshi/rl_algorithms
Branch: master
Commit: 7a43fa485137
Files: 89
Total size: 2.9 MB

Directory structure:
gitextract_kv5056s4/

├── .gitignore
├── LICENSE
├── README.md
├── bc/
│   ├── README.md
│   ├── bash_scripts/
│   │   ├── demo.bash
│   │   ├── gen_exp_data.sh
│   │   └── runbc_allmujoco.sh
│   ├── bc.py
│   ├── experts/
│   │   ├── Ant-v1.pkl
│   │   ├── HalfCheetah-v1.pkl
│   │   ├── Hopper-v1.pkl
│   │   ├── Humanoid-v1.pkl
│   │   ├── Reacher-v1.pkl
│   │   └── Walker2d-v1.pkl
│   ├── load_policy.py
│   ├── plot_bc.py
│   ├── random_logs/
│   │   └── gen_exp_data.text
│   ├── run_expert.py
│   └── tf_util.py
├── ddpg/
│   ├── README.md
│   ├── ddpg.py
│   ├── main.py
│   └── replay_buffer.py
├── dqn/
│   ├── README.md
│   ├── atari_wrappers.py
│   ├── dqn.py
│   ├── dqn_utils.py
│   ├── logs_pkls/
│   │   ├── BeamRider_s001.pkl
│   │   ├── BeamRider_s002.pkl
│   │   ├── Breakout_s001.pkl
│   │   ├── Breakout_s002.pkl
│   │   ├── Enduro_s001.pkl
│   │   ├── Enduro_s002.pkl
│   │   ├── Pong_s001.pkl
│   │   └── Pong_s002.pkl
│   ├── logs_text/
│   │   ├── BeamRider_s001.text
│   │   ├── BeamRider_s002.text
│   │   ├── Breakout_s001.text
│   │   ├── Breakout_s002.text
│   │   ├── Enduro_s001.text
│   │   ├── Enduro_s002.text
│   │   ├── Pong_s001.text
│   │   └── Pong_s002.text
│   ├── plot_dqn.py
│   ├── run_dqn_atari.py
│   └── run_dqn_ram.py
├── es/
│   ├── README.md
│   ├── bash_scripts/
│   │   └── InvertedPendulum-v1.sh
│   ├── es.py
│   ├── logz.py
│   ├── main.py
│   ├── optimizers.py
│   ├── plot.py
│   ├── test.py
│   ├── toy_es.py
│   └── utils.py
├── g_learning/
│   ├── G-Learning.py
│   ├── README.md
│   └── __init__.py
├── lib/
│   ├── __init__.py
│   ├── envs/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── blackjack.py
│   │   ├── cliff_walking.py
│   │   ├── gridworld.py
│   │   ├── two_room_domain.py
│   │   └── windy_gridworld.py
│   └── plotting.py
├── q_learning/
│   ├── Q-Learning.py
│   ├── README.md
│   └── __init__.py
├── trpo/
│   ├── README.md
│   ├── fxn_approx.py
│   ├── main.py
│   ├── trpo.py
│   └── utils_trpo.py
├── utils/
│   ├── __init__.py
│   ├── logz.py
│   ├── policies.py
│   ├── utils_pg.py
│   └── value_functions.py
└── vpg/
    ├── README.md
    ├── bash_scripts/
    │   ├── CartPole-v0.sh
    │   ├── Pendulum-v0.sh
    │   ├── halfcheetah.sh
    │   ├── hopper.sh
    │   └── walker.sh
    ├── main.py
    └── plot_learning_curves.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
__pycache__
*.pyc
*.swp
*.swo
.DS_Store

vpg/outputs/*/*/a.diff

bc/data/*
bc/expert_data/*


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2017 Daniel Seita

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
Note: this repository has not been updated in years and I don't have plans to do so. I recommend using rlpyt for future RL research.

# Reinforcement Learning Algorithms

I will use this repository to implement various reinforcement learning
algorithms (also imitation learning), because I'm sick of reading about them but
not really understanding them.  Hence, hopefully this repository will help me
understand them better. I will also implement various supporting code as needed,
such as for simple custom scenarios like GridWorld. Or I can use OpenAI gym.
Click on the links to get to the appropriate algorithms. Each sub-directory will
have its own READMEs with results there, along with usage instructions.

Here are the algorithms currently implemented or in progress:

- [Q-Learning, tabular version](https://github.com/DanielTakeshi/rl_algorithms/tree/master/q_learning) (should be correct)
- [G-Learning](https://github.com/DanielTakeshi/rl_algorithms/tree/master/g_learning) (in progress ...)
- [Behavioral Cloning](https://github.com/DanielTakeshi/rl_algorithms/tree/master/bc) (should be correct)
- [Natural Evolution Strategies](https://github.com/DanielTakeshi/rl_algorithms/tree/master/es) (should be correct)
- [Deep-Q Networks](https://github.com/DanielTakeshi/rl_algorithms/tree/master/dqn) (should be correct)
- [Vanilla Policy Gradients](https://github.com/DanielTakeshi/rl_algorithms/tree/master/vpg) (should be correct)
- [Deep Deterministic Policy Gradients](https://github.com/DanielTakeshi/rl_algorithms/tree/master/ddpg) (in progress ...)
- [Trust Region Policy Optimization](https://github.com/DanielTakeshi/rl_algorithms/tree/master/trpo) (in progress ...)

Note: "Vanilla Policy Gradients" refers to the REINFORCE algorithm, also known
as Monte Carlo Policy Gradient. Sometimes it's called an actor-critic method
and other times it's not. Even if it's considered an actor-critic method, the
usual way we think of actor-critic involves a TD update rather than waiting
until the end of an episode to get returns.


# Requirements

Right now the code is designed for Python 2.7, but it *should* be compatible
with Python 3.5+, with the possible exception of if the bash scripts can't tell
the difference between which Python versions I'm using.

In short:

- Python 2.7.x
- Tensorflow 1.2.0


# GPU and TensorFlow 

(Update 06/16/17, these are out of date ... just install with pip and preferably
virtualenv. It's so much easier.)

I installed TensorFlow 1.0.1 from source.  For the configuration script, I used
CUDA 8.0, cuDNN 5.1.5, and compute capability 6.1.

Compiling from source means I can get faster CPU instructions. This requires
`bazel` plus extra compiler options. I used:

```
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
```

This resulted in ton of warning messages but I ended up with:

```
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
  bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 884.276s, Critical Path: 672.19s
```

and things seem to be working. Then run the command:

```
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
```

To get a wheel, which we then do a pip install. But be careful due to pip on
anaconda vs pip with default python. I use anaconda. And make sure you're not in
either the `tensorflow` or the `bazel` directories!

Track the GPU usage with `nvidia-smi`. Unfortunately, that's only for one
time-step, but we can instead run:

```
while true; do nvidia-smi --query-gpu=utilization.gpu --format=csv >> gpu_utilization.log; sleep 10; done;
```

Or something like that. It will record the output every 10 seconds and dump it
into the log file. Ideally, GPU usage should be as high as possible (100% or
close to it).

# References

I have read a number of reinforcement learning paper references to help me out.
A list of papers and summaries (for a few of them) are [in my paper notes
repository](https://github.com/DanielTakeshi/Paper_Notes).


================================================
FILE: bc/README.md
================================================
# Behavioral Cloning

## Main Idea

This runs Behavioral Cloning (BC) on MuJoCo environments, with settings inspired
by the NIPS 2016 [GAIL paper][1]. Specifically:

- The expert is TRPO and provided from Jonathan Ho (see below).

- Dataset consists of inputs (states=s) to labels (actions=a). They're
  continuous, so minimize mean L2 loss across minibatches. It is split into 70%
  training, 30% validation.

- Expert data size is measured in terms of the number of expert rollouts we
  collected. Note that, like the GAIL paper, I *subsample*, so the actual amount
  of (s,a) pairs available for BC is much smaller.

- The neural network (from TensorFlow) is fully connected and has two hidden
  layers of 100 units each with hyperbolic tangent non-linearities. It's trained
  with Adam, with a step size of 1e-3, and has a batch size of 128.

- For now, I plot validation set performance (i.e. loss) without really using
  it. If I needed to be formal and had to pick a given iteration for which to
  choose my BC expert (because different iterations mean different weights) I'd
  choose the one with best validation set performance. I also plot training set
  performance just for kicks.


## Running the Code

To run BC, there are several steps:

- (If needed) generate expert data. Run

  ```
  ./bash_scripts/gen_exp_data.sh
  ```

  which will run and save expert trajectories in numpy arrays. They're not saved
  in the repository (ask if you want my version). By default, the number of
  trajectories is saved into the file name by default and matches the values in
  the GAIL paper (see Table 1). No subsampling is done at this stage.
  
- See the bash scripts for examples of running BC. For these, I used one script
  to run everything.

- To plot the code, it's simple: `python plot_bc.py`. No command line arguments!


If you're interested:

- The expert performance that I'm seeing is roughly similar to what's reported
  in the GAIL paper, with the exception of the Walker environment, but I may be
  running a newer version from Jonathan Ho. See the output in
  `logs/gen_exp_data.text` for details.

- The `bash_scripts` directory also contains a file called `demo.bash`, which
  you can use to visualize expert trajectories, just for fun.


# Results

Many subplots have three curves, each with one standard deviation error regions.
The reason is that for a given BC run, each "evaluation point" (e.g. every 50
training minibatches) I will run some "test-time" rollouts to see performance,
but I also wanted to test with different initializations, hence the different
random seeds.

Ant-v1, HalfCheetah-v1, Hopper-v1, and Walker2d-v1 use 4, 11, 18, and 25 expert
rollouts since that follows the GAIL paper. Humanoid-v1 uses 80, 160, and 240
expert trajectories.

Observations:

- BC does well in Ant-v1 and HalfCheetah-v1. 

- Hopper-v1 seems to be difficult, surprisingly. It has a relatively small state
  space compared to Ant (11 vs 111).

- Walker2d-v1 seems to be in between, BC doesn't get going until 25 rollouts.

- How am I doing so poorly on Humanoid-v1?

## Ant-v1

![ant](figures/Ant-v1.png?raw=true)

## HalfCheetah-v1

![halfcheetah](figures/HalfCheetah-v1.png?raw=true)

## Hopper-v1

![hopper](figures/Hopper-v1.png?raw=true)

## Walker2d-v1

![walker2d](figures/Walker2d-v1.png?raw=true)

## Reacher-v1

![reacher](figures/Reacher-v1.png?raw=true)

## Humanoid-v1

![humanoid](figures/Humanoid-v1.png?raw=true)


# Original Notes from Berkeley

This started from UC Berkeley's Deep Reinforcement Learning class. Here's their
information:

> Dependencies: TensorFlow, MuJoCo version 1.31, OpenAI Gym
> 
> The only file that you need to look at is `run_expert.py`, which is code to
> load up an expert policy, run a specified number of roll-outs, and save out
> data.
> 
> In `experts/`, the provided expert policies are:
> * Ant-v1.pkl
> * HalfCheetah-v1.pkl
> * Hopper-v1.pkl
> * Humanoid-v1.pkl
> * Reacher-v1.pkl
> * Walker2d-v1.pkl
> 
> The name of the pickle file corresponds to the name of the gym environment.

[1]:https://arxiv.org/abs/1606.03476


================================================
FILE: bc/bash_scripts/demo.bash
================================================
#!/bin/bash
set -eux
for e in Hopper-v1 Ant-v1 HalfCheetah-v1 Humanoid-v1 Reacher-v1 Walker2d-v1
do
    python run_expert.py experts/$e.pkl $e --render --num_rollouts=1
done


================================================
FILE: bc/bash_scripts/gen_exp_data.sh
================================================
python run_expert.py experts/Reacher-v1.pkl Reacher-v1 --save --num_rollouts 4
python run_expert.py experts/Reacher-v1.pkl Reacher-v1 --save --num_rollouts 11
python run_expert.py experts/Reacher-v1.pkl Reacher-v1 --save --num_rollouts 18

python run_expert.py experts/HalfCheetah-v1.pkl HalfCheetah-v1 --save --num_rollouts 4
python run_expert.py experts/HalfCheetah-v1.pkl HalfCheetah-v1 --save --num_rollouts 11
python run_expert.py experts/HalfCheetah-v1.pkl HalfCheetah-v1 --save --num_rollouts 18
python run_expert.py experts/HalfCheetah-v1.pkl HalfCheetah-v1 --save --num_rollouts 25

python run_expert.py experts/Hopper-v1.pkl Hopper-v1 --save --num_rollouts 4
python run_expert.py experts/Hopper-v1.pkl Hopper-v1 --save --num_rollouts 11
python run_expert.py experts/Hopper-v1.pkl Hopper-v1 --save --num_rollouts 18
python run_expert.py experts/Hopper-v1.pkl Hopper-v1 --save --num_rollouts 25

python run_expert.py experts/Walker2d-v1.pkl Walker2d-v1 --save --num_rollouts 4
python run_expert.py experts/Walker2d-v1.pkl Walker2d-v1 --save --num_rollouts 11
python run_expert.py experts/Walker2d-v1.pkl Walker2d-v1 --save --num_rollouts 18
python run_expert.py experts/Walker2d-v1.pkl Walker2d-v1 --save --num_rollouts 25

python run_expert.py experts/Ant-v1.pkl Ant-v1 --save --num_rollouts 4
python run_expert.py experts/Ant-v1.pkl Ant-v1 --save --num_rollouts 11
python run_expert.py experts/Ant-v1.pkl Ant-v1 --save --num_rollouts 18
python run_expert.py experts/Ant-v1.pkl Ant-v1 --save --num_rollouts 25

python run_expert.py experts/Humanoid-v1.pkl Humanoid-v1 --save --num_rollouts 80
python run_expert.py experts/Humanoid-v1.pkl Humanoid-v1 --save --num_rollouts 160
python run_expert.py experts/Humanoid-v1.pkl Humanoid-v1 --save --num_rollouts 240


================================================
FILE: bc/bash_scripts/runbc_allmujoco.sh
================================================
#!/bin/bash
set -eux
for e in 4 11 18 25; do
    for s in 0 1 2; do
        python bc.py Ant-v1         $e --test_rollouts 50 --eval_freq 100 --train_iters 20001 --seed $s --subsamp_freq 20
        python bc.py HalfCheetah-v1 $e --test_rollouts 50 --eval_freq 100 --train_iters 20001 --seed $s --subsamp_freq 20
        python bc.py Hopper-v1      $e --test_rollouts 50 --eval_freq 100 --train_iters 20001 --seed $s --subsamp_freq 20
        python bc.py Walker2d-v1    $e --test_rollouts 50 --eval_freq 100 --train_iters 20001 --seed $s --subsamp_freq 20 
    done
done
for e in 4 11 18; do
    for s in 0 1 2; do
        python bc.py Reacher-v1     $e --test_rollouts 50 --eval_freq 100 --train_iters 20001 --seed $s --subsamp_freq 1
    done
done
for e in 80 160 240; do
    for s in 0 1 2; do
        python bc.py Humanoid-v1    $e --test_rollouts 50 --eval_freq 100 --train_iters 20001 --seed $s --subsamp_freq 20
    done
done


================================================
FILE: bc/bc.py
================================================
"""
(c) June 2017 by Daniel Seita

Behavioral cloning (continuous actions only).  For results, see the README(s)
nearby.

    TODO right now we assume we'll get our minibatches of data with `get_batch`
    but this is inefficient if we decide to scale up and avoid subsampling the
    data, where it would be better to have a list which supplies fixed,
    pre-computed minibatches. I should fix this later.

    TODO handle l2 regualrization? Though I have found that this doesn't have as
    good an effect as I thought it would ...

    TODO maybe save the weights with best validation set performance, along with
    weights every 500 iterations or so? Then we can visualize it.
"""

import argparse
import gym
import matplotlib.pyplot as plt
import numpy as np
import os
import pickle
import sys
import tensorflow as tf
import tensorflow.contrib.layers as layers
import tf_util
plt.style.use('seaborn-darkgrid')
np.set_printoptions(edgeitems=100, linewidth=100, suppress=True)


def get_tf_session():
    """ Returning a session. Set options here if desired. """
    tf.reset_default_graph()
    tf_config = tf.ConfigProto(inter_op_parallelism_threads=1,
                               intra_op_parallelism_threads=1)
    gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.5)
    session = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

    def get_available_gpus():
        from tensorflow.python.client import device_lib
        local_device_protos = device_lib.list_local_devices()
        return [x.physical_device_desc for x in local_device_protos if x.device_type == 'GPU']

    print("AVAILABLE GPUS: ", get_available_gpus())
    return session


def load_dataset(args):
    """ Loads the dataset for BC and return training and validation splits,
    separating the observations and actions, along with observation and action
    shapes.

    This is also where we should handle the case of varying-length expert
    trajectories, even though these should be rare. (I kept those there so that
    the leading dimension is the number of trajectories, in case we want to use
    that information somehow, but right now we just mix among trajectories.) In
    addition, it might be useful to subsample the data.
    """
    # Load the numpy file and parse it.
    str_roll = str(args.num_rollouts).zfill(3)
    expert_data = np.load('expert_data/'+args.envname+'_'+str_roll+'.npy')
    expert_obs = expert_data[()]['observations']
    expert_act = expert_data[()]['actions']
    expert_ret = expert_data[()]['returns']
    expert_stp = expert_data[()]['steps']
    N = expert_obs.shape[0]
    assert N == expert_act.shape[0] == len(expert_ret) == len(expert_stp)
    obs_shape = expert_obs.shape[2]
    act_shape = expert_act.shape[2]
    print("\nobs_shape = {}\nact_shape = {}".format(obs_shape, act_shape))
    print("subsampling freq = {}".format(args.subsamp_freq))
    print("expert_steps = {}".format(expert_stp))
    print("expert_returns = {}".format(expert_ret))
    print("mean(expert_returns) = {}".format(np.mean(expert_ret))) # remember!
    print("(raw) expert_obs.shape = {}".format(expert_obs.shape))
    print("(raw) expert_act.shape = {}".format(expert_act.shape))

    # Choose a different starting point to subsample for each trajectory.
    start_indices = np.random.randint(0, args.subsamp_freq, N)
    
    # Subsample expert data, remove actions which were only for padding.
    expert_obs_l = []
    expert_act_l = []
    for i in range(N):
        expert_obs_l.append(
            expert_obs[i, start_indices[i]:expert_stp[i]:args.subsamp_freq, :]
        )
        expert_act_l.append(
            expert_act[i, start_indices[i]:expert_stp[i]:args.subsamp_freq, :]
        )

    # Concatenate everything together.
    expert_obs = np.concatenate(expert_obs_l, axis=0)
    expert_act = np.concatenate(expert_act_l, axis=0)
    print("(subsampled/reshaped) expert_obs.shape = {}".format(expert_obs.shape))
    print("(subsampled/reshaped) expert_act.shape = {}".format(expert_act.shape))
    assert expert_obs.shape[0] == expert_act.shape[0]

    # Finally, form training and validation splits.
    num_examples = expert_obs.shape[0]
    num_train = int(args.train_frac * num_examples)
    shuffled_inds = np.random.permutation(num_examples)
    train_inds, valid_inds = shuffled_inds[:num_train], shuffled_inds[num_train:]
    expert_obs_tr  = expert_obs[train_inds]
    expert_act_tr  = expert_act[train_inds]
    expert_obs_val = expert_obs[valid_inds]
    expert_act_val = expert_act[valid_inds]
    print("\n(train) expert_obs.shape = {}".format(expert_obs_tr.shape))
    print("(train) expert_act.shape = {}".format(expert_act_tr.shape))
    print("(valid) expert_obs.shape = {}".format(expert_obs_val.shape))
    print("(valid) expert_act.shape = {}\n".format(expert_act_val.shape))

    return (expert_obs_tr, expert_act_tr, expert_obs_val, expert_act_val, \
            obs_shape, act_shape)


def policy_model(data_in, action_dim):
    """ Create a neural network representing the BC policy. It will be trained
    using standard supervised learning techniques.
    
    Parameters
    ----------
    data_in: [Tensor]
        The input (a placeholder) to the network, with leading dimension
        representing the batch size.
    action_dim: [int]
        Number of actions, each of which (at least for MuJoCo) is
        continuous-valued.

    Returns
    ------- 
    out [Tensor]
        The output tensor which represents the predicted (or desired, if
        testing) action to take for the agent.
    """
    with tf.variable_scope("BCNetwork", reuse=False):
        out = data_in
        out = layers.fully_connected(out, num_outputs=100,
                weights_initializer=layers.xavier_initializer(uniform=True),
                activation_fn=tf.nn.tanh)
        out = layers.fully_connected(out, num_outputs=100,
                weights_initializer=layers.xavier_initializer(uniform=True),
                activation_fn=tf.nn.tanh)
        out = layers.fully_connected(out, num_outputs=action_dim,
                weights_initializer=layers.xavier_initializer(uniform=True),
                activation_fn=None)
        return out


def get_batch(expert_obs, expert_act, batch_size):
    """ 
    Obtain a minibatch of samples. Note that this is relatively inefficient, and
    if dealing with very large datasets without subsampling, use a list of
    samples instead. 
    """
    indices = np.arange(expert_obs.shape[0])
    np.random.shuffle(indices)
    xs = expert_obs[indices[:batch_size]]
    ys = expert_act[indices[:batch_size]]
    return xs, ys


def run_bc(session, args, log_dir):
    """ Runs behavioral cloning on some stored data.

    It roughly mirrors the experimental setup of [Ho & Ermon, NIPS 2016]. They
    trained using ADAM (batch size 128) on 70% of the data and trained until
    validation error on the held-out set of 30% no longer decreases. They also
    substantially subsampled their data.

    Parameters
    ----------
    session: [TF Session]
        The TensorFlow session we're using.
    args: [Arguments Namespace]
        Namedspace representing convenient arguments from the user.
    log_dir: [string]
        Where we save files to. FYI, it doesn't include the ending slash.
    """
    env = gym.make(args.envname)
    (expert_obs_tr, expert_act_tr, expert_obs_val, expert_act_val, obs_shape, \
            act_shape) = load_dataset(args)

    # Build the data and network. For now, no casting (see DQN code).
    x = tf.placeholder(tf.float32, shape=[None,obs_shape])
    y = tf.placeholder(tf.float32, shape=[None,act_shape])
    policy_fn = policy_model(data_in=x, action_dim=act_shape)

    # Save weights as a single vector to make saving/loading easy.
    weights_bc = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='BCNetwork')
    weight_vector = tf.concat([tf.reshape(w, [-1]) for w in weights_bc], axis=0)

    # Construct the loss function and training information.
    l2_loss = tf.reduce_mean(
        tf.reduce_sum((policy_fn-y)*(policy_fn-y), axis=[1])
    )
    train_step = tf.train.AdamOptimizer(args.lrate).minimize(l2_loss)

    all_tr_loss = []
    all_val_loss = []
    all_iters = [] # Makes plotting easier since these are the x-coords.
    all_returns = [] # Will turn into an array of arrays later.
    session.run(tf.global_variables_initializer())

    for i in range(args.train_iters):
        b_xs, b_ys = get_batch(expert_obs_tr, expert_act_tr, args.batch_size)
        _,tr_loss = session.run([train_step, l2_loss], feed_dict={x:b_xs, y:b_ys})

        if (i % args.eval_freq == 0):
            # Only save/evaluate stuff every `args.eval_freq` iterations.
            val_loss = session.run(l2_loss, feed_dict={x:expert_obs_val, y:expert_act_val})
            returns = run_bc_test(args, session, policy_fn, x, env)
            print("iter={}   tr_loss={:.5f}   val_loss={:.5f}".format(
                str(i).zfill(4), tr_loss, val_loss))
            print("mean(returns): {}\nstd(returns): {}\n".format(
                    np.mean(returns), np.std(returns)))
            all_iters.append(i)
            all_tr_loss.append(tr_loss)
            all_val_loss.append(val_loss)
            all_returns.append(returns)

            # Save snapshot of the current weights. We can pick out the best one
            # by seeing the minimizing index in `all_val_loss`.
            itr = str(i).zfill(len(str(abs(args.train_iters))))
            weights_numpy = session.run(weight_vector) 
            np.save(log_dir+'/snapshots/weights_'+itr, weights_numpy)

    # Store the results as numpy arrays so we can easily plot later.
    np.save(log_dir +"/iters", np.array(all_iters))
    np.save(log_dir +"/tr_loss", np.array(all_tr_loss))
    np.save(log_dir +"/val_loss", np.array(all_val_loss))
    np.save(log_dir +"/returns", np.array(all_returns))


def run_bc_test(args, session, policy_fn, x, env):
    """ Run the agent in the world! 
    
    Returns
    -------
    returns [list]
        A list of returns, one for each of the `args.test_rollouts` rollouts.
    """
    actions = []
    observations = []
    returns = []
    max_steps = env.spec.timestep_limit

    for rr in range(args.test_rollouts):
        obs = env.reset()
        done = False
        totalr = 0
        steps = 0
        while not done:
            # Take steps by expanding observation (to get shapes to match).
            exp_obs = np.expand_dims(obs, axis=0)
            action = np.squeeze(session.run(policy_fn, feed_dict={x:exp_obs}))
            obs, r, done, _ = env.step(action)
            totalr += r
            steps += 1
            if args.render: env.render()
            if steps >= max_steps: break
        returns.append(totalr)

    return returns


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('envname', type=str)
    parser.add_argument('num_rollouts', type=str)
    parser.add_argument('--batch_size', type=int, default=128)
    parser.add_argument('--eval_freq', type=int, default=100)
    parser.add_argument('--lrate', type=float, default=0.001)
    parser.add_argument('--regu', type=float, default=0.0) # don't use now
    parser.add_argument('--seed', type=int, default=0)
    parser.add_argument('--subsamp_freq', type=int, default=20)
    parser.add_argument('--test_rollouts', type=int, default=50) # GAIL paper used 50
    parser.add_argument('--train_frac', type=float, default=0.7)
    parser.add_argument('--train_iters', type=int, default=5001) # GAIL paper used 20001
    parser.add_argument('--render', action='store_true') # don't use now
    args = parser.parse_args()
    print("\nUsing the following arguments: {}".format(args))

    # Handle some logic with the log file and save the args there.
    log_dir = "logs/"+args.envname+"/numroll_"+args.num_rollouts+"_seed_"+str(args.seed)
    print("log_dir: {}\n".format(log_dir))
    assert not os.path.exists(log_dir), "Error: log_dir already exists!"
    os.makedirs(log_dir)
    os.makedirs(log_dir+'/snapshots/')
    with open(log_dir+'/args.pkl','w') as f:
        pickle.dump(args, f)

    # Create a session, handle random seeds (well, partly...) and run.
    session = get_tf_session()
    np.random.seed(args.seed)
    tf.set_random_seed(args.seed)
    run_bc(session, args, log_dir)


================================================
FILE: bc/load_policy.py
================================================
import pickle, tensorflow as tf, tf_util, numpy as np

def load_policy(filename):
    with open(filename, 'rb') as f:
        data = pickle.loads(f.read())

    # assert len(data.keys()) == 2
    nonlin_type = data['nonlin_type']
    policy_type = [k for k in data.keys() if k != 'nonlin_type'][0]

    assert policy_type == 'GaussianPolicy', 'Policy type {} not supported'.format(policy_type)
    policy_params = data[policy_type]

    assert set(policy_params.keys()) == {'logstdevs_1_Da', 'hidden', 'obsnorm', 'out'}

    # Keep track of input and output dims (i.e. observation and action dims) for the user

    def build_policy(obs_bo):
        def read_layer(l):
            assert list(l.keys()) == ['AffineLayer']
            assert sorted(l['AffineLayer'].keys()) == ['W', 'b']
            return l['AffineLayer']['W'].astype(np.float32), l['AffineLayer']['b'].astype(np.float32)

        def apply_nonlin(x):
            if nonlin_type == 'lrelu':
                return tf_util.lrelu(x, leak=.01) # openai/imitation nn.py:233
            elif nonlin_type == 'tanh':
                return tf.tanh(x)
            else:
                raise NotImplementedError(nonlin_type)

        # Build the policy. First, observation normalization.
        assert list(policy_params['obsnorm'].keys()) == ['Standardizer']
        obsnorm_mean = policy_params['obsnorm']['Standardizer']['mean_1_D']
        obsnorm_meansq = policy_params['obsnorm']['Standardizer']['meansq_1_D']
        obsnorm_stdev = np.sqrt(np.maximum(0, obsnorm_meansq - np.square(obsnorm_mean)))
        print('obs', obsnorm_mean.shape, obsnorm_stdev.shape)
        normedobs_bo = (obs_bo - obsnorm_mean) / (obsnorm_stdev + 1e-6) # 1e-6 constant from Standardizer class in nn.py:409 in openai/imitation

        curr_activations_bd = normedobs_bo

        # Hidden layers next
        assert list(policy_params['hidden'].keys()) == ['FeedforwardNet']
        layer_params = policy_params['hidden']['FeedforwardNet']
        for layer_name in sorted(layer_params.keys()):
            l = layer_params[layer_name]
            W, b = read_layer(l)
            curr_activations_bd = apply_nonlin(tf.matmul(curr_activations_bd, W) + b)

        # Output layer
        W, b = read_layer(policy_params['out'])
        output_bo = tf.matmul(curr_activations_bd, W) + b
        return output_bo

    obs_bo = tf.placeholder(tf.float32, [None, None])
    a_ba = build_policy(obs_bo)
    policy_fn = tf_util.function([obs_bo], a_ba)
    return policy_fn

================================================
FILE: bc/plot_bc.py
================================================
"""
(c) April 2017 by Daniel Seita

Code for plotting behavioral cloning. No need to use command line arguments,
just run `python plot_bc.py`. Easy! Right now it generates two figures per
environment, one with validation set losses and the other with returns. The
latter is probably more interesting.
"""

import argparse
import gym
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import numpy as np
import os
import pickle
import sys
np.set_printoptions(edgeitems=100, linewidth=100, suppress=True)

# Some matplotlib settings.
plt.style.use('seaborn-darkgrid')
error_region_alpha = 0.25
LOGDIR = 'logs/'
FIGDIR = 'figures/'
title_size = 22
tick_size = 17
legend_size = 17
ysize = 18
xsize = 18
lw = 3
ms = 8
colors = ['red', 'blue', 'yellow', 'black']


def plot_bc_modern(edir):
    """ Plot the results for this particular environment. """
    subdirs = os.listdir(LOGDIR+edir)
    print("plotting subdirs {}".format(subdirs))

    # Make it easy to count how many of each numrollouts we have.
    R_TO_COUNT = {'4':0, '11':0, '18':0, '25':0}
    R_TO_IJ = {'4':(0,2), '11':(1,0), '18':(1,1), '25':(1,2)}

    fig,axarr = plt.subplots(2, 3, figsize=(24,15))
    axarr[0,2].set_title(edir+", Returns, 4 Rollouts", fontsize=title_size)
    axarr[1,0].set_title(edir+", Returns, 11 Rollouts", fontsize=title_size)
    axarr[1,1].set_title(edir+", Returns, 18 Rollouts", fontsize=title_size)
    axarr[1,2].set_title(edir+", Returns, 25 Rollouts", fontsize=title_size)

    # Don't forget to plot the expert performance!
    exp04 = np.mean(np.load("expert_data/"+edir+"_004.npy")[()]['returns'])
    exp11 = np.mean(np.load("expert_data/"+edir+"_011.npy")[()]['returns'])
    exp18 = np.mean(np.load("expert_data/"+edir+"_018.npy")[()]['returns'])
    axarr[0,2].axhline(y=exp04, color='brown', lw=lw, linestyle='--', label='expert')
    axarr[1,0].axhline(y=exp11, color='brown', lw=lw, linestyle='--', label='expert')
    axarr[1,1].axhline(y=exp18, color='brown', lw=lw, linestyle='--', label='expert')
    if 'Reacher' not in edir:
        exp25 = np.mean(np.load("expert_data/"+edir+"_025.npy")[()]['returns'])
        axarr[1,2].axhline(y=exp25, color='brown', lw=lw, linestyle='--', label='expert')

    for dd in subdirs:
        ddsplit = dd.split("_") # `dd` is of the form `numroll_X_seed_Y`
        numroll, seed = ddsplit[1], ddsplit[3]
        xcoord   = np.load(LOGDIR+edir+"/"+dd+"/iters.npy")
        tr_loss  = np.load(LOGDIR+edir+"/"+dd+"/tr_loss.npy")
        val_loss = np.load(LOGDIR+edir+"/"+dd+"/val_loss.npy")
        returns  = np.load(LOGDIR+edir+"/"+dd+"/returns.npy")
        mean_ret = np.mean(returns, axis=1)
        std_ret  = np.std(returns, axis=1)

        # Playing with dictionaries
        ijcoord = R_TO_IJ[numroll]
        cc = colors[ R_TO_COUNT[numroll] ]
        R_TO_COUNT[numroll] += 1

        axarr[ijcoord].plot(xcoord, mean_ret, lw=lw, color=cc, label=dd)
        axarr[ijcoord].fill_between(xcoord, 
                mean_ret-std_ret,
                mean_ret+std_ret,
                alpha=error_region_alpha,
                facecolor=cc)

        # Cram the training and validation losses on these subplots.
        axarr[0,0].plot(xcoord, tr_loss, lw=lw, label=dd)
        axarr[0,1].plot(xcoord, val_loss, lw=lw, label=dd)

    boring_stuff(axarr, edir)
    plt.tight_layout()
    plt.savefig(FIGDIR+edir+".png")


def plot_bc_humanoid(edir):
    """ Plots humanoid. The argument here is kind of redundant... also, I guess
    we'll have to ignore one of the plots here since Humanoid will have 5
    subplots. Yeah, it's a bit awkward.
    """ 
    assert edir == "Humanoid-v1"
    subdirs = os.listdir(LOGDIR+edir)
    print("plotting subdirs {}".format(subdirs))

    # Make it easy to count how many of each numrollouts we have.
    R_TO_COUNT = {'80':0, '160':0, '240':0}
    R_TO_IJ = {'80':(1,0), '160':(1,1), '240':(1,2)}

    fig,axarr = plt.subplots(2, 3, figsize=(24,15))
    axarr[0,2].set_title("Empty Plot", fontsize=title_size)
    axarr[1,0].set_title(edir+", Returns, 80 Rollouts", fontsize=title_size)
    axarr[1,1].set_title(edir+", Returns, 160 Rollouts", fontsize=title_size)
    axarr[1,2].set_title(edir+", Returns, 240 Rollouts", fontsize=title_size)

    # Plot expert performance (um, this takes a while...).
    exp080 = np.mean(np.load("expert_data/"+edir+"_080.npy")[()]['returns'])
    exp160 = np.mean(np.load("expert_data/"+edir+"_160.npy")[()]['returns'])
    exp240 = np.mean(np.load("expert_data/"+edir+"_240.npy")[()]['returns'])
    axarr[1,0].axhline(y=exp080, color='brown', lw=lw, linestyle='--', label='expert')
    axarr[1,1].axhline(y=exp160, color='brown', lw=lw, linestyle='--', label='expert')
    axarr[1,2].axhline(y=exp240, color='brown', lw=lw, linestyle='--', label='expert')

    for dd in subdirs:
        ddsplit = dd.split("_") # `dd` is of the form `numroll_X_seed_Y`
        numroll, seed = ddsplit[1], ddsplit[3]
        xcoord   = np.load(LOGDIR+edir+"/"+dd+"/iters.npy")
        tr_loss  = np.load(LOGDIR+edir+"/"+dd+"/tr_loss.npy")
        val_loss = np.load(LOGDIR+edir+"/"+dd+"/val_loss.npy")
        returns  = np.load(LOGDIR+edir+"/"+dd+"/returns.npy")
        mean_ret = np.mean(returns, axis=1)
        std_ret  = np.std(returns, axis=1)

        # Playing with dictionaries
        ijcoord = R_TO_IJ[numroll]
        cc = colors[ R_TO_COUNT[numroll] ]
        R_TO_COUNT[numroll] += 1

        axarr[ijcoord].plot(xcoord, mean_ret, lw=lw, color=cc, label=dd)
        axarr[ijcoord].fill_between(xcoord, 
                mean_ret-std_ret,
                mean_ret+std_ret,
                alpha=error_region_alpha,
                facecolor=cc)

        # Cram the training and validation losses on these subplots.
        axarr[0,0].plot(xcoord, tr_loss, lw=lw, label=dd)
        axarr[0,1].plot(xcoord, val_loss, lw=lw, label=dd)

    boring_stuff(axarr, edir)
    plt.tight_layout()
    plt.savefig(FIGDIR+edir+".png")


def boring_stuff(axarr, edir):
    """ Axes, titles, legends, etc. Yeah yeah ... """
    for i in range(2):
        for j in range(3):
            if i == 0 and j == 0:
                axarr[i,j].set_ylabel("Loss Training MBs", fontsize=ysize)
            if i == 0 and j == 1:
                axarr[i,j].set_ylabel("Loss Validation Set", fontsize=ysize)
            else:
                axarr[i,j].set_ylabel("Average Return", fontsize=ysize)
            axarr[i,j].set_xlabel("Training Minibatches", fontsize=xsize)
            axarr[i,j].tick_params(axis='x', labelsize=tick_size)
            axarr[i,j].tick_params(axis='y', labelsize=tick_size)
            axarr[i,j].legend(loc="best", prop={'size':legend_size})
            axarr[i,j].legend(loc="best", prop={'size':legend_size})
    axarr[0,0].set_title(edir+", Training Losses", fontsize=title_size)
    axarr[0,1].set_title(edir+", Validation Losses", fontsize=title_size)
    axarr[0,0].set_yscale('log')
    axarr[0,1].set_yscale('log')


def plot_bc(e):
    """ Split into cases. It makes things easier for me. """
    env_to_method = {'Ant-v1': plot_bc_modern, 
                     'HalfCheetah-v1': plot_bc_modern, 
                     'Hopper-v1': plot_bc_modern,
                     'Walker2d-v1': plot_bc_modern,
                     'Reacher-v1': plot_bc_modern,
                     'Humanoid-v1': plot_bc_humanoid}
    env_to_method[e](e)


if __name__ == "__main__":
    env_dirs = [e for e in os.listdir(LOGDIR) if "text" not in e]
    print("Plotting with one figure per env_dirs = {}".format(env_dirs))
    for e in env_dirs:
        plot_bc(e)


================================================
FILE: bc/random_logs/gen_exp_data.text
================================================
loading and building expert policy
('obs', (1, 11), (1, 11))
loaded and built
('roll/traj', 0)
('roll/traj', 1)
('roll/traj', 2)
('roll/traj', 3)
('steps', [50, 50, 50, 50])
('returns', [-5.0986563080722789, -2.3470820413705882, -3.4806488084021563, -5.3923959304937243])
('mean return', -4.0796957720846869)
('std of return', 1.2371610530104316)
obs.shape = (4, 50, 11)
act.shape = (4, 50, 2)
expert data has been saved.
loading and building expert policy
('obs', (1, 11), (1, 11))
loaded and built
('roll/traj', 0)
('roll/traj', 1)
('roll/traj', 2)
('roll/traj', 3)
('roll/traj', 4)
('roll/traj', 5)
('roll/traj', 6)
('roll/traj', 7)
('roll/traj', 8)
('roll/traj', 9)
('roll/traj', 10)
('steps', [50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50])
('returns', [-6.0676632781974975, -4.8437408266431676, -2.7646205850154639, -4.8883500259847468, -5.5523827036728743, -3.5066640100964652, -2.8579657682413164, -3.5256961737336221, -1.9851548296485364, -5.7610544616077641, -4.4387894971172894])
('mean return', -4.1992801963598856)
('std of return', 1.2933939982797207)
obs.shape = (11, 50, 11)
act.shape = (11, 50, 2)
expert data has been saved.
loading and building expert policy
('obs', (1, 11), (1, 11))
loaded and built
('roll/traj', 0)
('roll/traj', 1)
('roll/traj', 2)
('roll/traj', 3)
('roll/traj', 4)
('roll/traj', 5)
('roll/traj', 6)
('roll/traj', 7)
('roll/traj', 8)
('roll/traj', 9)
('roll/traj', 10)
('roll/traj', 11)
('roll/traj', 12)
('roll/traj', 13)
('roll/traj', 14)
('roll/traj', 15)
('roll/traj', 16)
('roll/traj', 17)
('steps', [50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50])
('returns', [-7.0167051921724175, -2.2971802003578921, -3.0596530187565367, -4.6337475094076872, -4.3389105427312291, -2.0122997124407345, -6.936080880634159, -3.8789676786284204, -3.0374973076118006, -5.8012000292179202, -5.282670215008328, -1.4478122702743865, -1.6729209302580386, -2.0049694364017943, -3.4791679686749157, -4.537885362741525, -6.7110718543266392, -2.6161470377728997])
('mean return', -3.9313826193009627)
('std of return', 1.7865687858709369)
obs.shape = (18, 50, 11)
act.shape = (18, 50, 2)
expert data has been saved.
loading and building expert policy
('obs', (1, 17), (1, 17))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000])
('returns', [4142.2707122049451, 4073.2926100930354, 4160.2641919997595, 4137.3904574971248])
('mean return', 4128.304492948716)
('std of return', 32.8836572283574)
obs.shape = (4, 1000, 17)
act.shape = (4, 1000, 6)
expert data has been saved.
loading and building expert policy
('obs', (1, 17), (1, 17))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [4163.926640414601, 4082.5260763172073, 4088.7890907004175, 4261.2822528361958, 4258.7193259277701, 4003.5642815073606, 4014.550963767033, 4236.2386061269835, 4110.2043680589713, 3979.6921007858841, 4082.8862028719918])
('mean return', 4116.5799917558552)
('std of return', 96.639402441900685)
obs.shape = (11, 1000, 17)
act.shape = (11, 1000, 6)
expert data has been saved.
loading and building expert policy
('obs', (1, 17), (1, 17))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 11)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 12)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 13)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 14)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 15)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 16)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 17)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [4086.4099200096416, 4131.2709022133558, 4093.6069841238532, 4224.1550004107994, 4040.8835600400148, 4205.8018352881491, 4121.3078008257926, 4059.7018278076703, 4068.9727601843761, 4158.2791893541871, 4133.9336557521283, 4248.6852156813147, 4117.3499451442194, 4105.1856269166738, 4118.7413080467604, 3999.8364477747714, 4070.2510751567406, 3998.4489992439239])
('mean return', 4110.156780776354)
('std of return', 67.104887652297236)
obs.shape = (18, 1000, 17)
act.shape = (18, 1000, 6)
expert data has been saved.
loading and building expert policy
('obs', (1, 17), (1, 17))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 11)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 12)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 13)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 14)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 15)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 16)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 17)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 18)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 19)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 20)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 21)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 22)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 23)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 24)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [4277.0649177373989, 4134.9545756917705, 4137.2293138303303, 4303.8329029398674, 4129.3451831595994, 4141.294405302373, 4113.7214986254858, 4160.1223028192289, 4097.0754966253426, 4027.5727894938377, 4076.0161230017616, 4066.6420436003768, 4110.8062524027519, 3910.0827144736527, 4301.7383387989348, 4050.0758406333521, 4244.5068451702218, 4211.9442440870207, 4198.5772637205318, 4113.6755909929007, 4058.1256463993295, 4174.9359832628352, 4110.7459681292185, 4143.0735610903803, 4032.5424608180847])
('mean return', 4133.0280905122636)
('std of return', 89.024141578474556)
obs.shape = (25, 1000, 17)
act.shape = (25, 1000, 6)
expert data has been saved.
loading and building expert policy
('obs', (1, 11), (1, 11))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000])
('returns', [3768.6945100631256, 3780.2323839436513, 3783.3528713417622, 3776.522165486092])
('mean return', 3777.200482708658)
('std of return', 5.4739381594168464)
obs.shape = (4, 1000, 11)
act.shape = (4, 1000, 3)
expert data has been saved.
loading and building expert policy
('obs', (1, 11), (1, 11))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [3776.8642751689367, 3781.2504298398662, 3780.2410426724359, 3775.3710552957577, 3779.3902980707512, 3773.8634184348703, 3770.8866723075585, 3780.4207844762936, 3781.523831445847, 3774.5681184694013, 3779.319841231219])
('mean return', 3777.6090697648124)
('std of return', 3.3513734023594219)
obs.shape = (11, 1000, 11)
act.shape = (11, 1000, 3)
expert data has been saved.
loading and building expert policy
('obs', (1, 11), (1, 11))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 11)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 12)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 13)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 14)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 15)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 16)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 17)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [3779.7050891800823, 3781.8261053421129, 3777.0348874624683, 3780.7564369429952, 3783.8023742794289, 3778.8540765735338, 3778.2838509060907, 3772.0179836318312, 3776.0436511846556, 3777.2671771759328, 3769.7932782467242, 3778.1612858500016, 3774.5197732402466, 3777.4526753302625, 3782.1303038320079, 3774.1603254054885, 3775.4472428566828, 3780.9678196906061])
('mean return', 3777.6791298406192)
('std of return', 3.5415301661767842)
obs.shape = (18, 1000, 11)
act.shape = (18, 1000, 3)
expert data has been saved.
loading and building expert policy
('obs', (1, 11), (1, 11))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 11)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 12)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 13)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 14)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 15)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 16)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 17)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 18)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 19)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 20)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 21)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 22)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 23)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 24)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [3778.8392263167784, 3779.5006762211756, 3775.2079612199959, 3775.2461797134188, 3786.7071619918361, 3774.4111189507835, 3781.201699112346, 3775.7413417770244, 3784.7917817859939, 3775.7521896648682, 3782.0005480505479, 3774.0474578427702, 3777.5763692084424, 3778.0439539881991, 3779.5541702189857, 3778.9263420042721, 3770.50841808619, 3770.0953559770523, 3776.9041194485176, 3778.243240343334, 3775.3263028992428, 3779.0318634723531, 3775.5486034283458, 3777.3872833124879, 3773.873717042653])
('mean return', 3777.3786832831042)
('std of return', 3.7442553033205863)
obs.shape = (25, 1000, 11)
act.shape = (25, 1000, 3)
expert data has been saved.
loading and building expert policy
('obs', (1, 17), (1, 17))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000])
('returns', [5547.48969243587, 5540.1468480510594, 5564.1321310009662, 5581.989894252667])
('mean return', 5558.4396414351404)
('std of return', 16.136499915608695)
obs.shape = (4, 1000, 17)
act.shape = (4, 1000, 6)
expert data has been saved.
loading and building expert policy
('obs', (1, 17), (1, 17))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [5502.1118374247135, 5578.4542887283278, 5512.2398724981667, 5480.4346126756691, 5449.5133369384566, 5502.7943665881003, 5480.9276296636272, 5569.665719912703, 5327.7553945147201, 5572.5422903562821, 5457.7194853879455])
('mean return', 5494.0144395171556)
('std of return', 67.950561745997945)
obs.shape = (11, 1000, 17)
act.shape = (11, 1000, 6)
expert data has been saved.
loading and building expert policy
('obs', (1, 17), (1, 17))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 11)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 12)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 13)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 14)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 15)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 16)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 17)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [5585.8651523527433, 5545.8725858595817, 5593.3858841511465, 5541.4536733079985, 5536.8433392198067, 5548.443185949196, 5498.6548050012489, 5475.8332938827907, 5549.4836474238609, 5459.2497362763906, 5459.0740379511835, 5528.9377877374009, 5556.5587139897398, 5604.7437566384324, 5514.978407911135, 5558.1672712935715, 5537.4998013342065, 5471.741539720425])
('mean return', 5531.4881455556042)
('std of return', 42.793853572241282)
obs.shape = (18, 1000, 17)
act.shape = (18, 1000, 6)
expert data has been saved.
loading and building expert policy
('obs', (1, 17), (1, 17))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 11)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 12)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 13)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 14)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 15)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 16)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 17)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 18)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 19)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 20)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 21)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 22)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 23)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 24)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [5578.8211167348663, 5426.9564890547936, 5495.0790952607813, 5543.7290426113759, 5592.7433886393574, 5575.0766246791163, 5600.9145321887609, 5453.7858257194703, 5537.535331383041, 5528.1267955137027, 5535.8466728732074, 5583.1063825847186, 5560.6501884998252, 5544.3170077259037, 5485.541927125917, 5485.1008470458091, 5497.1258420310323, 5569.4957143113088, 5564.0545535147003, 5531.2521617385428, 5543.0294088290084, 5527.3037651829809, 5541.4369527234321, 5575.584682356186, 5532.5198507543264])
('mean return', 5536.3653679632871)
('std of return', 42.16551288781001)
obs.shape = (25, 1000, 17)
act.shape = (25, 1000, 6)
expert data has been saved.
loading and building expert policy
('obs', (1, 111), (1, 111))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000])
('returns', [4681.2816282568092, 4787.1411375950993, 4962.6022230934659, 4718.8344067219441])
('mean return', 4787.4648489168294)
('std of return', 108.00256775663604)
obs.shape = (4, 1000, 111)
act.shape = (4, 1000, 8)
expert data has been saved.
loading and building expert policy
('obs', (1, 111), (1, 111))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [4970.2446233142127, 4791.8262870918743, 4990.8731983221924, 4874.367844853743, 4782.723353789389, 4945.6210534642532, 4865.8945243349517, 4671.9618293821068, 4867.5339908714595, 4704.239325228632, 4938.5563507509096])
('mean return', 4854.8947619457931)
('std of return', 101.36973550237347)
obs.shape = (11, 1000, 111)
act.shape = (11, 1000, 8)
expert data has been saved.
loading and building expert policy
('obs', (1, 111), (1, 111))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 11)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 12)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 13)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 14)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 15)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 16)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 17)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [4848.5062792733324, 4807.5472169657733, 5071.3916076107216, 4772.4505171213505, 4783.1801074056393, 4963.8292362281081, 4852.7933288258428, 4622.0828740587067, 4818.8940558568593, 4794.9563490573546, 4886.5164229950242, 4845.0225230289898, 4921.775360975862, 4793.3807631300815, 4778.7836096652518, 4896.3028634806333, 4806.808013490021, 4604.2827672429912])
('mean return', 4826.0279942451407)
('std of return', 105.15493988748189)
obs.shape = (18, 1000, 111)
act.shape = (18, 1000, 8)
expert data has been saved.
loading and building expert policy
('obs', (1, 111), (1, 111))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 11)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 12)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 13)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 14)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 15)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 16)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 17)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 18)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 19)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 20)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 21)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 22)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 23)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 24)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 848, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [4754.9337197919212, 4670.2130539203708, 4819.8898310669547, 4791.8539952710335, 4064.1610110578895, 4846.4238471685767, 4628.0066176373239, 4950.5587981938552, 4689.3336871789252, 4954.1424188340579, 4885.0033215638896, 4884.1785356975643, 4956.1598758246191, 4878.141685788979, 4668.1071285208754, 5037.0403488927232, 4775.9386244523239, 4720.482841645392, 4917.8331223202549, 4637.6262947721079, 4864.5559085830791, 4765.1690928682719, 4984.9211936248039, 4695.7990112358748, 4797.3908402291418])
('mean return', 4785.5145922456322)
('std of return', 185.68178387438161)
obs.shape = (25, 1000, 111)
act.shape = (25, 1000, 8)
expert data has been saved.
loading and building expert policy
('obs', (1, 376), (1, 376))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 11)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 12)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 13)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 14)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 15)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 16)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 17)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 18)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 19)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 20)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 21)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 22)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 23)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 24)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 25)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 26)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 27)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 28)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 29)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 30)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 31)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 32)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 33)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 34)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 35)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 36)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 37)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 38)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 39)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 40)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 41)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 42)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 43)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 44)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 45)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 46)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 47)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 48)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 49)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 50)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 51)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 52)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 53)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 54)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 55)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 56)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 57)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 58)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 59)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 60)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 61)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 62)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 63)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 64)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 65)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 66)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 67)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 68)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 69)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 70)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 71)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 72)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 73)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 74)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 75)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 76)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 77)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 78)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 79)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [10329.628439115793, 10398.40769333656, 10363.937395680477, 10427.772885501496, 10397.027566319715, 10480.217506670058, 10479.585647890968, 10424.793655049612, 10373.86341372582, 10451.954902520636, 10305.228173137471, 10387.087426785178, 10453.408417209403, 10459.609235675824, 10356.678362034181, 10398.510234478614, 10515.361236604076, 10436.751276740513, 10498.396235392102, 10370.772157990363, 10430.672213922286, 10426.891222992064, 10394.551522552107, 10445.06614354315, 10435.721466817618, 10386.238617037665, 10471.072341941242, 10402.245401001908, 10418.328439772547, 10379.46465504614, 10419.762363070555, 10392.181917234415, 10413.336573901424, 10402.971022743775, 10425.173517290992, 10397.79028895028, 10452.611322512847, 10386.092300526794, 10419.523884184186, 10301.980675007637, 10410.288969989273, 10385.581664314219, 10435.416096147484, 10429.411956568458, 10469.366651914606, 10498.987284536122, 10335.452597636722, 10361.801676501475, 10415.437900649116, 10359.093623262546, 10434.897488733488, 10489.523672114101, 10490.177397527659, 10388.55547638126, 10401.568538908834, 10439.716990475352, 10456.447774649503, 10436.715874399713, 10445.687900447212, 10459.767461022309, 10450.912812774666, 10400.022412165528, 10380.876547571568, 10404.604527280735, 10427.798994291225, 10465.62206122202, 10410.584622564747, 10379.356959054881, 10378.237741675022, 10351.699958008387, 10430.091076781926, 10447.880494902356, 10420.966302199247, 10376.269492971323, 10415.708895867358, 10387.617975152678, 10429.483388870894, 10468.32437980919, 10370.386539478415, 10366.425967282034])
('mean return', 10415.217948725152)
('std of return', 43.377480472278329)
obs.shape = (80, 1000, 376)
act.shape = (80, 1000, 17)
expert data has been saved.
loading and building expert policy
('obs', (1, 376), (1, 376))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 11)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 12)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 13)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 14)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 15)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 16)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 17)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 18)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 19)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 20)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 21)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 22)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 23)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 24)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 25)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 26)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 27)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 28)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 29)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 30)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 31)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 32)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 33)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 34)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 35)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 36)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 37)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 38)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 39)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 40)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 41)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 42)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 43)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 44)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 45)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 46)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 47)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 48)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 49)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 50)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 51)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 52)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 53)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 54)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 55)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 56)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 57)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 58)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 59)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 60)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 61)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 62)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 63)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 64)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 65)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 66)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 67)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 68)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 69)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 70)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 71)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 72)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 73)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 74)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 75)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 76)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 77)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 78)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 79)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 80)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 81)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 82)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 83)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 84)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 85)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 86)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 87)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 88)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 89)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 90)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 91)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 92)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 93)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 94)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 95)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 96)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 97)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 98)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 99)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 100)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 101)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 102)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 103)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 104)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 105)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 106)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 107)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 108)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 109)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 110)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 111)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 112)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 113)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 114)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 115)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 116)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 117)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 118)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 119)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 120)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 121)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 122)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 123)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 124)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 125)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 126)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 127)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 128)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 129)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 130)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 131)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 132)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 133)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 134)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 135)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 136)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 137)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 138)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 139)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 140)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 141)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 142)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 143)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 144)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 145)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 146)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 147)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 148)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 149)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 150)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 151)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 152)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 153)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 154)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 155)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 156)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 157)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 158)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 159)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [10348.039897018045, 10412.120600025346, 10414.600220855498, 10400.288598306544, 10425.754906339394, 10421.076194429483, 10400.632701710943, 10416.669621179697, 10276.638537743353, 10451.765342424274, 10325.61237757394, 10403.698634306571, 10440.485902785687, 10396.395646652431, 10381.090569354996, 10383.448626388783, 10409.214568325104, 10456.129785239242, 10396.076454384205, 10446.146525160382, 10444.714613069202, 10481.364722831169, 10482.614076759115, 10359.848966341915, 10448.104631147276, 10429.533199705202, 10349.853354558692, 10380.280019446729, 10423.117348283296, 10413.211546436249, 10346.124737317437, 10510.825034017083, 10486.313543964088, 10396.845288704506, 10401.494402286997, 10445.340512404808, 10434.403029474815, 10388.302267239604, 10412.754505634748, 10376.949443825182, 10393.018126891815, 10433.213791286351, 10421.152012774721, 10186.344624600346, 10344.49120008269, 10343.996541394808, 10413.359454476365, 10447.690701808311, 10328.198282106212, 10444.002274656699, 10379.84969917434, 10321.388161571269, 10433.82289639493, 10412.841410261621, 10371.889775892891, 10427.675977073088, 10390.497618321489, 10498.835163041722, 10394.654046401409, 10495.03278044463, 10427.264469624544, 10516.451735081086, 10440.677304396846, 10451.095179872322, 10339.01297731753, 10416.064377272531, 10385.277856504987, 10446.098255812693, 10477.363623988707, 10337.205649890591, 10433.560924742498, 10398.504043972673, 10444.008071255814, 10457.594456360755, 10486.141167680124, 10380.752306783172, 10341.330094617224, 10367.67561648005, 10440.95883467388, 10306.306715833394, 10445.656606537519, 10454.775640870996, 10290.568617039909, 10389.119039196186, 10325.744217813219, 10237.367014540105, 10261.826908243245, 10360.634578658341, 10471.424822712983, 10381.783356597498, 10433.17492549642, 10485.087354167277, 10354.714589113271, 10421.06611572363, 10424.774213241597, 10467.833384009717, 10371.979916305236, 10466.426261599292, 10409.449391913096, 10380.835049441102, 10410.073548892695, 10367.193723856519, 10379.088885072279, 10467.696826160809, 10464.128752202789, 10386.74244888222, 10293.183884741829, 10483.476608244462, 10458.08024040571, 10460.411334963712, 10433.210048462861, 10441.604706421725, 10389.483742197081, 10431.16297873758, 10440.07447219877, 10396.946535618768, 10418.482020834617, 10322.581761528521, 10414.728150806106, 10460.013436823698, 10415.143800597814, 10470.79940001007, 10548.32062854944, 10470.238487157361, 10366.476299413807, 10344.317240008371, 10305.634106269641, 10389.172160090457, 10354.519992588741, 10428.600682833614, 10319.42771561299, 10435.137602835264, 10421.888519098182, 10299.99848375721, 10415.667749808155, 10398.732209077052, 10469.894975191943, 10410.759600796177, 10453.668437671595, 10341.595393798012, 10393.504827407303, 10425.195590290434, 10428.973097295844, 10400.902230025302, 10405.970511773929, 10291.347476732706, 10356.527889358571, 10450.119402821079, 10455.911873713219, 10407.519825571921, 10390.519102863, 10331.557411498277, 10315.628237308545, 10360.292187022018, 10448.939786697532, 10376.699852817559, 10449.694663498973, 10326.637755602909, 10388.850561283136, 10425.00257666954])
('mean return', 10403.972335063499)
('std of return', 56.536074597213009)
obs.shape = (160, 1000, 376)
act.shape = (160, 1000, 17)
expert data has been saved.
loading and building expert policy
('obs', (1, 376), (1, 376))
loaded and built
('roll/traj', 0)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 1)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 2)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 3)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 4)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 5)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 6)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 7)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 8)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 9)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 10)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 11)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 12)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 13)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 14)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 15)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 16)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 17)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 18)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 19)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 20)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 21)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 22)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 23)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 24)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 25)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 26)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 27)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 28)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 29)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 30)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 31)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 32)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 33)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 34)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 35)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 36)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 37)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 38)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 39)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 40)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 41)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 42)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 43)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 44)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 45)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 46)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 47)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 48)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 49)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 50)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 51)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 52)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 53)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 54)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 55)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 56)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 57)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 58)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 59)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 60)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 61)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 62)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 63)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 64)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 65)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 66)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 67)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 68)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 69)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 70)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 71)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 72)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 73)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 74)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 75)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 76)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 77)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 78)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 79)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 80)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 81)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 82)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 83)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 84)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 85)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 86)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 87)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 88)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 89)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 90)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 91)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 92)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 93)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 94)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 95)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 96)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 97)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 98)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 99)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 100)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 101)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 102)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 103)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 104)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 105)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 106)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 107)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 108)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 109)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 110)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 111)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 112)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 113)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 114)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 115)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 116)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 117)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 118)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 119)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 120)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 121)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 122)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 123)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 124)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 125)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 126)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 127)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 128)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 129)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 130)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 131)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 132)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 133)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 134)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 135)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 136)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 137)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 138)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 139)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 140)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 141)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 142)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 143)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 144)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 145)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 146)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 147)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 148)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 149)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 150)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 151)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 152)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 153)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 154)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 155)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 156)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 157)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 158)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 159)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 160)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 161)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 162)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 163)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 164)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 165)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 166)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 167)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 168)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 169)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 170)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 171)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 172)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 173)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 174)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 175)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 176)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 177)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 178)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 179)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 180)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 181)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 182)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 183)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 184)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 185)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 186)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 187)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 188)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 189)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 190)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 191)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 192)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 193)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 194)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 195)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 196)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 197)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 198)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 199)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 200)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 201)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 202)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 203)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 204)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 205)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 206)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 207)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 208)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 209)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 210)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 211)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 212)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 213)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 214)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 215)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 216)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 217)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 218)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 219)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 220)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 221)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 222)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 223)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 224)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 225)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 226)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 227)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 228)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 229)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 230)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 231)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 232)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 233)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 234)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 235)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 236)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 237)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 238)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('roll/traj', 239)
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
('steps', [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000])
('returns', [10373.442886251316, 10504.355777991737, 10432.946322763684, 10375.656078906675, 10388.623410275639, 10462.871400254879, 10447.049903637855, 10468.962907885809, 10386.398832025012, 10413.565351255793, 10414.86442966, 10387.104676430643, 10291.832182996131, 10371.844971701123, 10313.509324820825, 10389.558033043339, 10462.230554781279, 10397.30590295671, 10369.880983921739, 10340.822300346397, 10353.539620925292, 10389.980886577376, 10395.351222058045, 10329.003792956428, 10396.362315753431, 10510.120904021258, 10430.37977017577, 10393.3237347747, 10405.577042784973, 10437.507940315096, 10389.40984919135, 10441.886814922711, 10354.146524368323, 10420.851929535253, 10409.924515427872, 10374.921911984509, 10444.903227638719, 10352.652058183083, 10356.484338998533, 10342.600534301917, 10432.670817777791, 10492.151961155751, 10402.912114518849, 10419.429942828756, 10317.839698843383, 10382.492006727474, 10446.530503288828, 10374.570492862162, 10397.208863404074, 10266.706805183734, 10416.179858245298, 10434.742134793834, 10373.126922936233, 10406.674622503893, 10418.229651572323, 10402.270458181169, 10368.692180294076, 10409.498573028788, 10363.260951873779, 10353.812669988863, 10430.942618772633, 10398.42252303828, 10416.788493524968, 10472.738858972663, 10349.481925487231, 10399.001205951523, 10344.758339293989, 10217.136648804568, 10295.514656104158, 10490.128550579466, 10374.027586273838, 10399.312488312025, 10430.910978172271, 10377.93738789266, 10374.857693934313, 10402.432197219552, 10450.470080731284, 10454.213608206494, 10441.299822640272, 10413.161081259577, 10273.410406100049, 10371.828428763667, 10437.77529662195, 10488.507151865075, 10345.316789336943, 10306.638089431763, 10420.653871506474, 10469.759091979537, 10329.174792354786, 10456.360935675255, 10447.272201189575, 10490.20181057477, 10450.01344325538, 10501.668125669008, 10274.301563853547, 10387.431930814124, 10353.122390058918, 10386.159658583991, 10311.433527062274, 10356.19524367908, 10439.575750982758, 10349.103374986189, 10430.670137365511, 10353.721451433979, 10379.745945867631, 10425.619261666705, 10376.787489417442, 10392.128358818254, 10290.753092985407, 10357.052935577196, 10380.63342076658, 10383.900730398709, 10321.091692639871, 10440.031320293632, 10277.768010621834, 10409.57389326697, 10209.579904225844, 10350.043036005216, 10453.173106207145, 10464.746220207116, 10358.489480404436, 10459.021927889535, 10425.247778259683, 10449.055511809527, 10396.232434623043, 10306.409812089978, 10423.704723896422, 10415.832776541676, 10466.721677830237, 10447.8303467038, 10393.301272103035, 10421.294879923726, 10411.417974321503, 10458.618584668042, 10419.387716003352, 10543.295254667795, 10384.609154498008, 10427.159882637794, 10360.761322763399, 10424.460520418381, 10435.096285852271, 10330.872645338357, 10347.01047740803, 10384.485490797815, 10455.289491820226, 10428.051036607136, 10437.291819751952, 10406.218134566681, 10410.740110757266, 10427.780958498059, 10377.37595987856, 10386.781010159781, 10405.597813276039, 10444.473158066465, 10357.521564866434, 10398.087978429083, 10449.471814554323, 10491.411052286205, 10394.943343575726, 10442.424813246422, 10351.247272789924, 10392.472367642747, 10459.498986757979, 10405.078737359368, 10431.369573219452, 10409.384507045541, 10429.920399449567, 10430.139483486206, 10400.166309104887, 10385.279828352872, 10323.38182495301, 10402.594885043121, 10395.234063557686, 10401.663069061864, 10354.731792675035, 10416.939236951337, 10390.76726513448, 10365.781944224063, 10399.47368367667, 10406.028495274742, 10303.548008764921, 10381.858554844968, 10391.171115721796, 10434.057022919505, 10526.400373624814, 10351.899858578001, 10331.53404736148, 10344.478510266285, 10360.782187448216, 10523.333948085099, 10391.527307762213, 10427.908910140995, 10427.087703234074, 10377.211022652822, 10422.78745307182, 10494.775008519529, 10441.8424524811, 10251.49765472824, 10445.902117181255, 10410.218595788947, 10459.329571664388, 10414.940464319339, 10430.810372776283, 10381.978232198366, 10456.979742591213, 10418.619527943896, 10452.278191124849, 10262.092508010348, 10374.296966847185, 10421.733506427247, 10398.138584384178, 10414.298366028546, 10478.326591074978, 10448.339810533314, 10421.75194610956, 10391.227528206966, 10420.032987012893, 10353.462399375221, 10294.670999946606, 10362.879212005173, 10476.953305078934, 10251.161544427283, 10350.515514970182, 10406.155922769996, 10380.15566584465, 10408.873197396146, 10465.964412718173, 10399.430126710264, 10378.477651181529, 10510.420584358521, 10351.237789878212, 10357.110547450866, 10410.980673340327, 10363.275786019311, 10393.692737450561, 10416.361115548065, 10450.104331704217, 10430.483607858447, 10482.403526232087, 10461.183494042869])
('mean return', 10399.47124849325)
('std of return', 54.868826640962055)
obs.shape = (240, 1000, 376)
act.shape = (240, 1000, 17)
expert data has been saved.


================================================
FILE: bc/run_expert.py
================================================
"""
Code to load an expert policy and generate roll-out data for behavioral cloning.
Example usage:

    python run_expert.py experts/Humanoid-v1.pkl Humanoid-v1 --render \
            --num_rollouts 20

Author of this script and included expert policies: Jonathan Ho (hoj@openai.com)

(Daniel) I save an array of trajectories of shape 

    (numtrajs, numtimes, obs_dim)  // observations
    (numtrajs, numtimes, act_dim)  // actions, squeezed as needed
    // and also a list of returns and steps, each of length `numtrajs`.

However this requires padding some zeros at the end for trajectories that didn't
manage to finish (should be rare with experts, but it can still happen). Thus, I
also save a trajectory *lengths* array which can tell us when to stop dealing
with a trajectory.
"""

import pickle
import tensorflow as tf
import numpy as np
import tf_util
import gym
import load_policy


def main():
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('expert_policy_file', type=str)
    parser.add_argument('envname', type=str)
    parser.add_argument('--render', action='store_true')
    parser.add_argument('--save', action='store_true')
    parser.add_argument('--max_timesteps', type=int)
    parser.add_argument('--num_rollouts', type=int, default=20,
                        help='Number of expert roll outs')
    args = parser.parse_args()

    print('loading and building expert policy')
    policy_fn = load_policy.load_policy(args.expert_policy_file)
    print('loaded and built')

    with tf.Session():
        tf_util.initialize()

        import gym
        env = gym.make(args.envname)
        max_steps = args.max_timesteps or env.spec.timestep_limit

        all_observations = []
        all_actions = []
        all_steps = []
        all_returns = []

        for i in range(args.num_rollouts):
            print('roll/traj', i)
            obs = env.reset()
            done = False
            totalr = 0.
            steps = 0
            observations = []
            actions = []
            while not done:
                action = policy_fn(obs[None,:])
                observations.append(obs)
                actions.append(action)
                obs, r, done, _ = env.step(action)
                totalr += r
                steps += 1
                if args.render:
                    env.render()
                if steps % 100 == 0: print("%i/%i"%(steps, max_steps))
                if steps >= max_steps:
                    break
            all_returns.append(totalr)
            all_steps.append(steps)

            # Ensure that observations and actions lengths are at max_steps.
            # To make it easy, just append the last obs/action since they are
            # automatically the correct dimension, reduces headaches.
            while steps < max_steps:
                observations.append(obs)
                actions.append(action)
                steps += 1
            assert len(observations) == max_steps, "{}".format(len(observations))
            assert len(actions) == max_steps, "{}".format(len(actions))
            all_observations.append(observations)
            all_actions.append(actions)

        # Squeezing since we know MuJoCo does some (1,D)-dim actions.
        expert_data = {'observations': np.array(all_observations),
                       'actions': np.squeeze(np.array(all_actions)),
                       'returns': all_returns,
                       'steps': all_steps}

        print('steps', all_steps)
        print('returns', all_returns)
        print('mean return', np.mean(all_returns))
        print('std of return', np.std(all_returns))
        print("obs.shape = {}".format(expert_data['observations'].shape))
        print("act.shape = {}".format(expert_data['actions'].shape))

        if args.save:
            str_roll = str(args.num_rollouts).zfill(3)
            np.save("expert_data/" +args.envname+ "_" +str_roll, expert_data)
            print("expert data has been saved.")


if __name__ == '__main__':
    main()


================================================
FILE: bc/tf_util.py
================================================
import numpy as np
import tensorflow as tf # pylint: ignore-module
#import builtins
import functools
import copy
import os
import collections

# ================================================================
# Import all names into common namespace
# ================================================================

clip = tf.clip_by_value

# Make consistent with numpy
# ----------------------------------------

def sum(x, axis=None, keepdims=False):
    return tf.reduce_sum(x, reduction_indices=None if axis is None else [axis], keep_dims = keepdims)
def mean(x, axis=None, keepdims=False):
    return tf.reduce_mean(x, reduction_indices=None if axis is None else [axis], keep_dims = keepdims)
def var(x, axis=None, keepdims=False):
    meanx = mean(x, axis=axis, keepdims=keepdims)
    return mean(tf.square(x - meanx), axis=axis, keepdims=keepdims)
def std(x, axis=None, keepdims=False):
    return tf.sqrt(var(x, axis=axis, keepdims=keepdims))
def max(x, axis=None, keepdims=False):
    return tf.reduce_max(x, reduction_indices=None if axis is None else [axis], keep_dims = keepdims)
def min(x, axis=None, keepdims=False):
    return tf.reduce_min(x, reduction_indices=None if axis is None else [axis], keep_dims = keepdims)
def concatenate(arrs, axis=0):
    return tf.concat(axis, arrs)
def argmax(x, axis=None):
    return tf.argmax(x, dimension=axis)

def switch(condition, then_expression, else_expression):
    '''Switches between two operations depending on a scalar value (int or bool).
    Note that both `then_expression` and `else_expression`
    should be symbolic tensors of the *same shape*.

    # Arguments
        condition: scalar tensor.
        then_expression: TensorFlow operation.
        else_expression: TensorFlow operation.
    '''
    x_shape = copy.copy(then_expression.get_shape())
    x = tf.cond(tf.cast(condition, 'bool'),
                lambda: then_expression,
                lambda: else_expression)
    x.set_shape(x_shape)
    return x

# Extras
# ----------------------------------------
def l2loss(params):
    if len(params) == 0:
        return tf.constant(0.0)
    else:
        return tf.add_n([sum(tf.square(p)) for p in params])
def lrelu(x, leak=0.2):
    f1 = 0.5 * (1 + leak)
    f2 = 0.5 * (1 - leak)
    return f1 * x + f2 * abs(x)
def categorical_sample_logits(X):
    # https://github.com/tensorflow/tensorflow/issues/456
    U = tf.random_uniform(tf.shape(X))
    return argmax(X - tf.log(-tf.log(U)), axis=1)

# ================================================================
# Global session
# ================================================================

def get_session():
    return tf.get_default_session()

def single_threaded_session():
    tf_config = tf.ConfigProto(
        inter_op_parallelism_threads=1,
        intra_op_parallelism_threads=1)
    return tf.Session(config=tf_config)

def make_session(num_cpu):
    tf_config = tf.ConfigProto(
        inter_op_parallelism_threads=num_cpu,
        intra_op_parallelism_threads=num_cpu)
    return tf.Session(config=tf_config)


ALREADY_INITIALIZED = set()
def initialize():
    new_variables = set(tf.global_variables()) - ALREADY_INITIALIZED
    get_session().run(tf.variables_initializer(new_variables))
    ALREADY_INITIALIZED.update(new_variables)


def eval(expr, feed_dict=None):
    if feed_dict is None: feed_dict = {}
    return get_session().run(expr, feed_dict=feed_dict)

def set_value(v, val):
    get_session().run(v.assign(val))

def load_state(fname):
    saver = tf.train.Saver()
    saver.restore(get_session(), fname)

def save_state(fname):
    os.makedirs(os.path.dirname(fname), exist_ok=True)
    saver = tf.train.Saver()
    saver.save(get_session(), fname)

# ================================================================
# Model components
# ================================================================


def normc_initializer(std=1.0):
    def _initializer(shape, dtype=None, partition_info=None): #pylint: disable=W0613
        out = np.random.randn(*shape).astype(np.float32)
        out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
        return tf.constant(out)
    return _initializer


def conv2d(x, num_filters, name, filter_size=(3, 3), stride=(1, 1), pad="SAME", dtype=tf.float32, collections=None,
           summary_tag=None):
    with tf.variable_scope(name):
        stride_shape = [1, stride[0], stride[1], 1]
        filter_shape = [filter_size[0], filter_size[1], int(x.get_shape()[3]), num_filters]

        # there are "num input feature maps * filter height * filter width"
        # inputs to each hidden unit
        fan_in = intprod(filter_shape[:3])
        # each unit in the lower layer receives a gradient from:
        # "num output feature maps * filter height * filter width" /
        #   pooling size
        fan_out = intprod(filter_shape[:2]) * num_filters
        # initialize weights with random weights
        w_bound = np.sqrt(6. / (fan_in + fan_out))

        w = tf.get_variable("W", filter_shape, dtype, tf.random_uniform_initializer(-w_bound, w_bound),
                            collections=collections)
        b = tf.get_variable("b", [1, 1, 1, num_filters], initializer=tf.zeros_initializer,
                            collections=collections)

        if summary_tag is not None:
            tf.image_summary(summary_tag,
                             tf.transpose(tf.reshape(w, [filter_size[0], filter_size[1], -1, 1]),
                                          [2, 0, 1, 3]),
                             max_images=10)

        return tf.nn.conv2d(x, w, stride_shape, pad) + b


def dense(x, size, name, weight_init=None, bias=True):
    w = tf.get_variable(name + "/w", [x.get_shape()[1], size], initializer=weight_init)
    ret = tf.matmul(x, w)
    if bias:
        b = tf.get_variable(name + "/b", [size], initializer=tf.zeros_initializer)
        return ret + b
    else:
        return ret

def wndense(x, size, name, init_scale=1.0):
    v = tf.get_variable(name + "/V", [int(x.get_shape()[1]), size],
                        initializer=tf.random_normal_initializer(0, 0.05))
    g = tf.get_variable(name + "/g", [size], initializer=tf.constant_initializer(init_scale))
    b = tf.get_variable(name + "/b", [size], initializer=tf.constant_initializer(0.0))

    # use weight normalization (Salimans & Kingma, 2016)
    x = tf.matmul(x, v)
    scaler = g / tf.sqrt(sum(tf.square(v), axis=0, keepdims=True))
    return tf.reshape(scaler, [1, size]) * x + tf.reshape(b, [1, size])

def densenobias(x, size, name, weight_init=None):
    return dense(x, size, name, weight_init=weight_init, bias=False)

def dropout(x, pkeep, phase=None, mask=None):
    mask = tf.floor(pkeep + tf.random_uniform(tf.shape(x))) if mask is None else mask
    if phase is None:
        return mask * x
    else:
        return switch(phase, mask*x, pkeep*x)

def batchnorm(x, name, phase, updates, gamma=0.96):
    k = x.get_shape()[1]
    runningmean = tf.get_variable(name+"/mean", shape=[1, k], initializer=tf.constant_initializer(0.0), trainable=False)
    runningvar = tf.get_variable(name+"/var", shape=[1, k], initializer=tf.constant_initializer(1e-4), trainable=False)
    testy = (x - runningmean) / tf.sqrt(runningvar)

    mean_ = mean(x, axis=0, keepdims=True)
    var_ = mean(tf.square(x), axis=0, keepdims=True)
    std = tf.sqrt(var_)
    trainy = (x - mean_) / std

    updates.extend([
        tf.assign(runningmean, runningmean * gamma + mean_ * (1 - gamma)),
        tf.assign(runningvar, runningvar * gamma + var_ * (1 - gamma))
    ])

    y = switch(phase, trainy, testy)

    out = y * tf.get_variable(name+"/scaling", shape=[1, k], initializer=tf.constant_initializer(1.0), trainable=True)\
            + tf.get_variable(name+"/translation", shape=[1,k], initializer=tf.constant_initializer(0.0), trainable=True)
    return out



# ================================================================
# Basic Stuff
# ================================================================

def function(inputs, outputs, updates=None, givens=None):
    if isinstance(outputs, list):
        return _Function(inputs, outputs, updates, givens=givens)
    elif isinstance(outputs, (dict, collections.OrderedDict)):
        f = _Function(inputs, outputs.values(), updates, givens=givens)
        return lambda *inputs : type(outputs)(zip(outputs.keys(), f(*inputs)))
    else:
        f = _Function(inputs, [outputs], updates, givens=givens)
        return lambda *inputs : f(*inputs)[0]

class _Function(object):
    def __init__(self, inputs, outputs, updates, givens, check_nan=False):
        assert all(len(i.op.inputs)==0 for i in inputs), "inputs should all be placeholders"
        self.inputs = inputs
        updates = updates or []
        self.update_group = tf.group(*updates)
        self.outputs_update = list(outputs) + [self.update_group]
        self.givens = {} if givens is None else givens
        self.check_nan = check_nan
    def __call__(self, *inputvals):
        assert len(inputvals) == len(self.inputs)
        feed_dict = dict(zip(self.inputs, inputvals))
        feed_dict.update(self.givens)
        results = get_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
        if self.check_nan:
            if any(np.isnan(r).any() for r in results):
                raise RuntimeError("Nan detected")
        return results

def mem_friendly_function(nondata_inputs, data_inputs, outputs, batch_size):
    if isinstance(outputs, list):
        return _MemFriendlyFunction(nondata_inputs, data_inputs, outputs, batch_size)
    else:
        f = _MemFriendlyFunction(nondata_inputs, data_inputs, [outputs], batch_size)
        return lambda *inputs : f(*inputs)[0]

class _MemFriendlyFunction(object):
    def __init__(self, nondata_inputs, data_inputs, outputs, batch_size):
        self.nondata_inputs = nondata_inputs
        self.data_inputs = data_inputs
        self.outputs = list(outputs)
        self.batch_size = batch_size
    def __call__(self, *inputvals):
        assert len(inputvals) == len(self.nondata_inputs) + len(self.data_inputs)
        nondata_vals = inputvals[0:len(self.nondata_inputs)]
        data_vals = inputvals[len(self.nondata_inputs):]
        feed_dict = dict(zip(self.nondata_inputs, nondata_vals))
        n = data_vals[0].shape[0]
        for v in data_vals[1:]:
            assert v.shape[0] == n
        for i_start in range(0, n, self.batch_size):
            slice_vals = [v[i_start:min(i_start+self.batch_size, n)] for v in data_vals]
            for (var,val) in zip(self.data_inputs, slice_vals):
                feed_dict[var]=val
            results = tf.get_default_session().run(self.outputs, feed_dict=feed_dict)
            if i_start==0:
                sum_results = results
            else:
                for i in range(len(results)):
                    sum_results[i] = sum_results[i] + results[i]
        for i in range(len(results)):
            sum_results[i] = sum_results[i] / n
        return sum_results

# ================================================================
# Modules
# ================================================================

class Module(object):
    def __init__(self, name):
        self.name = name
        self.first_time = True
        self.scope = None
        self.cache = {}
    def __call__(self, *args):
        if args in self.cache:
            print("(%s) retrieving value from cache"%self.name)
            return self.cache[args]
        with tf.variable_scope(self.name, reuse=not self.first_time):
            scope = tf.get_variable_scope().name
            if self.first_time:
                self.scope = scope
                print("(%s) running function for the first time"%self.name)
            else:
                assert self.scope == scope, "Tried calling function with a different scope"
                print("(%s) running function on new inputs"%self.name)
            self.first_time = False
            out = self._call(*args)
        self.cache[args] = out
        return out
    def _call(self, *args):
        raise NotImplementedError

    @property
    def trainable_variables(self):
        assert self.scope is not None, "need to call module once before getting variables"
        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, self.scope)

    @property
    def variables(self):
        assert self.scope is not None, "need to call module once before getting variables"
        return tf.get_collection(tf.GraphKeys.VARIABLES, self.scope)


def module(name):
    @functools.wraps
    def wrapper(f):
        class WrapperModule(Module):
            def _call(self, *args):
                return f(*args)
        return WrapperModule(name)
    return wrapper

# ================================================================
# Graph traversal
# ================================================================

VARIABLES = {}


def get_parents(node):
    return node.op.inputs

def topsorted(outputs):
    """
    Topological sort via non-recursive depth-first search
    """
    assert isinstance(outputs, (list,tuple))
    marks = {}
    out = []
    stack = [] #pylint: disable=W0621
    # i: node
    # jidx = number of children visited so far from that node
    # marks: state of each node, which is one of
    #   0: haven't visited
    #   1: have visited, but not done visiting children
    #   2: done visiting children
    for x in outputs:
        stack.append((x,0))
        while stack:
            (i,jidx) = stack.pop()
            if jidx == 0:
                m = marks.get(i,0)
                if m == 0:
                    marks[i] = 1
                elif m == 1:
                    raise ValueError("not a dag")
                else:
                    continue
            ps = get_parents(i)
            if jidx == len(ps):
                marks[i] = 2
                out.append(i)
            else:
                stack.append((i,jidx+1))
                j = ps[jidx]
                stack.append((j,0))
    return out


# ================================================================
# Flat vectors
# ================================================================

def var_shape(x):
    out = [k.value for k in x.get_shape()]
    assert all(isinstance(a, int) for a in out), \
        "shape function assumes that shape is fully known"
    return out

def numel(x):
    return intprod(var_shape(x))

def intprod(x):
    return int(np.prod(x))

def flatgrad(loss, var_list):
    grads = tf.gradients(loss, var_list)
    return tf.concat(0, [tf.reshape(grad, [numel(v)])
        for (v, grad) in zip(var_list, grads)])

class SetFromFlat(object):
    def __init__(self, var_list, dtype=tf.float32):
        assigns = []
        shapes = list(map(var_shape, var_list))
        total_size = np.sum([intprod(shape) for shape in shapes])

        self.theta = theta = tf.placeholder(dtype,[total_size])
        start=0
        assigns = []
        for (shape,v) in zip(shapes,var_list):
            size = intprod(shape)
            assigns.append(tf.assign(v, tf.reshape(theta[start:start+size],shape)))
            start+=size
        self.op = tf.group(*assigns)
    def __call__(self, theta):
        get_session().run(self.op, feed_dict={self.theta:theta})

class GetFlat(object):
    def __init__(self, var_list):
        self.op = tf.concat(0, [tf.reshape(v, [numel(v)]) for v in var_list])
    def __call__(self):
        return get_session().run(self.op)

# ================================================================
# Misc
# ================================================================


def fancy_slice_2d(X, inds0, inds1):
    """
    like numpy X[inds0, inds1]
    XXX this implementation is bad
    """
    inds0 = tf.cast(inds0, tf.int64)
    inds1 = tf.cast(inds1, tf.int64)
    shape = tf.cast(tf.shape(X), tf.int64)
    ncols = shape[1]
    Xflat = tf.reshape(X, [-1])
    return tf.gather(Xflat, inds0 * ncols + inds1)


def scope_vars(scope, trainable_only):
    """
    Get variables inside a scope
    The scope can be specified as a string
    """
    return tf.get_collection(
        tf.GraphKeys.TRAINABLE_VARIABLES if trainable_only else tf.GraphKeys.VARIABLES,
        scope=scope if isinstance(scope, str) else scope.name
    )

def lengths_to_mask(lengths_b, max_length):
    """
    Turns a vector of lengths into a boolean mask

    Args:
        lengths_b: an integer vector of lengths
        max_length: maximum length to fill the mask

    Returns:
        a boolean array of shape (batch_size, max_length)
        row[i] consists of True repeated lengths_b[i] times, followed by False
    """
    lengths_b = tf.convert_to_tensor(lengths_b)
    assert lengths_b.get_shape().ndims == 1
    mask_bt = tf.expand_dims(tf.range(max_length), 0) < tf.expand_dims(lengths_b, 1)
    return mask_bt


def in_session(f):
    @functools.wraps(f)
    def newfunc(*args, **kwargs):
        with tf.Session():
            f(*args, **kwargs)
    return newfunc


_PLACEHOLDER_CACHE = {} # name -> (placeholder, dtype, shape)
def get_placeholder(name, dtype, shape):
    print("calling get_placeholder", name)
    if name in _PLACEHOLDER_CACHE:
        out, dtype1, shape1 = _PLACEHOLDER_CACHE[name]
        assert dtype1==dtype and shape1==shape
        return out
    else:
        out = tf.placeholder(dtype=dtype, shape=shape, name=name)
        _PLACEHOLDER_CACHE[name] = (out,dtype,shape)
        return out
def get_placeholder_cached(name):
    return _PLACEHOLDER_CACHE[name][0]

def flattenallbut0(x):
    return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])

def reset():
    global _PLACEHOLDER_CACHE
    global VARIABLES
    _PLACEHOLDER_CACHE = {}
    VARIABLES = {}
    tf.reset_default_graph()


================================================
FILE: ddpg/README.md
================================================
# Deep Deterministic Policy Gradients

- Python 3.5
- Tensorflow 1.2

I'm following the original DDPG paper as much as possible, and using their
"low-dimensional" representation, not the pixels-based one.

## Pendulum-v0

Action space: -2 to 2.

```
python main.py Pendulum-v0
```

Status: not yet working. I think it's done but alas there is some bug somewhere.
Ugh.


## References

(These might be useful to supplement the original paper.)

- http://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html
- https://github.com/rmst/ddpg
- https://github.com/openai/rllab
- https://github.com/yukezhu/tensorflow-reinforce
- https://github.com/stevenpjg/ddpg-aigym


================================================
FILE: ddpg/ddpg.py
================================================
"""
Deep Deterministic Policy Gradients

Make Actor and Critic subclasses of a NNet class? Not sure ...  for now, I'll
put everything here but that might take a lot.
"""

import gym
import numpy as np
import sys
import tensorflow as tf
import tensorflow.contrib.layers as layers
import time
from replay_buffer import ReplayBuffer
from collections import defaultdict
sys.path.append("../")
from utils import logz


class DDPGAgent(object):

    def __init__(self, sess, env, test_env, args):
        self.sess = sess
        self.args = args
        self.env = env
        self.test_env = test_env
        self.ob_dim = env.observation_space.shape[0]
        self.ac_dim = env.action_space.shape[0]

        # Construct the networks and the experience replay buffer.
        self.actor   = Actor(sess, env, args)
        self.critic  = Critic(sess, env, args)
        self.rbuffer = ReplayBuffer(args.replay_size, self.ob_dim, self.ac_dim)

        # Initialize then run, also setting current=target to start.
        self._debug_print()
        self.sess.run(tf.global_variables_initializer())
        self.actor.update_target_net(smooth=False)
        self.critic.update_target_net(smooth=False)


    def train(self):
        """ 
        Algorithm 1 in the DDPG paper. 
        """
        num_episodes = 0
        t_start = time.time()
        obs = self.env.reset()

        for t in range(self.args.n_iter):
            if (t % self.args.log_every_t_iter == 0) and (t > self.args.wait_until_rbuffer):
                print("\n*** DDPG Iteration {} ***".format(t))

            # Sample actions with noise injection and manage buffer.
            act = self.actor.sample_action(obs, train=True)
            new_obs, rew, done, info = self.env.step(act)
            self.rbuffer.add_sample(s=obs, a=act, r=rew, done=done)
            if done:
                obs = self.env.reset()
                num_episodes += 1
            else:
                obs = new_obs

            if (t > self.args.wait_until_rbuffer) and (t % self.args.learning_freq == 0):
                # Sample from the replay buffer.
                states_t_BO, actions_t_BA, rewards_t_B, states_tp1_BO, done_mask_B = \
                        self.rbuffer.sample(num=self.args.batch_size)

                feed = {'obs_t_BO':    states_t_BO, 
                        'act_t_BA':    actions_t_BA, 
                        'rew_t_B':     rewards_t_B, 
                        'obs_tp1_BO':  states_tp1_BO, 
                        'done_mask_B': done_mask_B}

                # Update the critic, get sampled policy gradients, update actor.
                a_grads_BA, l2_error = self.critic.update_weights(feed)
                actor_gradients = self.actor.update_weights(feed, a_grads_BA)

                # Update both target networks.
                self.critic.update_target_net()
                self.actor.update_target_net()

            if (t % self.args.log_every_t_iter == 0) and (t > self.args.wait_until_rbuffer):
                # Do some rollouts here and then record statistics.  Note that
                # some of these stats rely on stuff computed from sampling the
                # replay buffer, so be careful interpreting these. The code
                # probably needs to guard against this case as well.
                stats = self._do_rollouts()
                hours = (time.time()-t_start) / (60*60.)
                logz.log_tabular("MeanReward",     np.mean(stats['reward']))
                logz.log_tabular("MaxReward",      np.max(stats['reward']))
                logz.log_tabular("MinReward",      np.min(stats['reward']))
                logz.log_tabular("StdReward",      np.std(stats['reward']))
                logz.log_tabular("MeanLength",     np.mean(stats['length']))
                logz.log_tabular("NumTrainingEps", num_episodes)
                logz.log_tabular("L2ErrorCritic",  l2_error)
                logz.log_tabular("QaGradL2Norm",   np.linalg.norm(a_grads_BA))
                logz.log_tabular("TimeHours",      hours)
                logz.log_tabular("Iterations",     t)
                logz.dump_tabular()


    def _do_rollouts(self):
        """ 
        Some rollouts to evaluate the agent's progress.  Returns a dictionary
        containing relevant statistics. Later, I should parallelize this using
        an array of environments.
        """
        num_episodes = 50
        stats = defaultdict(list)

        for i in range(num_episodes):
            obs = self.test_env.reset()
            ep_time = 0
            ep_reward = 0

            # Run one episode ...
            while True:
                act = self.actor.sample_action(obs, train=False)
                new_obs, rew, done, info = self.test_env.step(act)
                ep_time += 1
                ep_reward += rew
                if done:
                    break

            # ... and collect its information here.
            stats['length'].append(ep_time)
            stats['reward'].append(ep_reward)

        return stats


    def _debug_print(self):
        print("\n\t(A bunch of debug prints)\n")

        print("\nActor weights")
        for v in self.actor.weights:
            shp = v.get_shape().as_list()
            print("- {} shape:{} size:{}".format(v.name, shp, np.prod(shp)))
        print("Total # of weights: {}.".format(self.actor.num_weights))

        print("\nCritic weights")
        for v in self.critic.weights:
            shp = v.get_shape().as_list()
            print("- {} shape:{} size:{}".format(v.name, shp, np.prod(shp)))
        print("Total # of weights: {}.".format(self.critic.num_weights))



class Network(object):
    """ 
    Just so the Actor and Critic nets don't have more duplicate code. This way
    they can refer to the similar sets of placeholders (but not the exact same
    ones in memory, just a copy) and I can change it easily here.
    """

    def __init__(self, sess, env, args):
        self.sess = sess
        self.args = args

        # Some random stuff.
        assert len(env.observation_space.shape) == 1
        assert len(env.action_space.shape) == 1
        self.ob_dim = env.observation_space.shape[0]
        self.ac_dim = env.action_space.shape[0]
        self.ac_high = env.action_space.high
        self.ac_low = env.action_space.low

        # Placeholders for minibatches of data. End of episode = 1 for mask.
        self.obs_t_BO    = tf.placeholder(tf.float32, [None,self.ob_dim])
        self.act_t_BA    = tf.placeholder(tf.float32, [None,self.ac_dim])
        self.rew_t_B     = tf.placeholder(tf.float32, [None])
        self.obs_tp1_BO  = tf.placeholder(tf.float32, [None,self.ob_dim])
        self.done_mask_B = tf.placeholder(tf.float32, [None])



class Actor(Network):
    """ Given input as a batch of states, the actor deterministically provides
    us with actions, indicated as "mu" in the paper. 
    
    Since DDPG is off-policy, we can treat the problem of exploration
    independently from the learning algorithm. External to this class, I add
    Gaussian noise for this purpose.
    """

    def __init__(self, sess, env, args):
        super().__init__(sess, env, args)

        # The action network and its corresponding taget.
        self.actions_BA      = self._build_net(self.obs_t_BO, scope='ActorNet')
        self.actions_targ_BA = self._build_net(self.obs_t_BO, scope='TargActorNet')

        # Collect weights since it's generally convenient to do so.
        self.weights      = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='ActorNet')
        self.weights_targ = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='TargActorNet')
        self.weights_v      = tf.concat([tf.reshape(w, [-1]) for w in self.weights], axis=0)
        self.weights_v_targ = tf.concat([tf.reshape(w, [-1]) for w in self.weights_targ], axis=0)

        # These should be the same among both nets.
        self.w_shapes = [w.get_shape().as_list() for w in self.weights]
        self.num_weights = np.sum([np.prod(sh) for sh in self.w_shapes])

        # Update the target action network. Provide hard and smooth updates.
        target_smooth = []
        target_hard = []
        for var, var_target in zip(sorted(self.weights,      key=lambda v: v.name),
                                   sorted(self.weights_targ, key=lambda v: v.name)):
            update_sm = self.args.tau * var + (1 - self.args.tau) * var_target
            target_smooth.append(var_target.assign(update_sm))
            target_hard.append(var_target.assign(var))
        self.update_target_smooth = tf.group(*target_smooth)
        self.update_target_hard   = tf.group(*target_hard)

        # The Actor _update_, with one gradient provided by the critic which
        # serves as initialization (I think) since we need to multiply. Negate
        # it (I think) since we we want to minimize a loss function.
        self.a_grads_BA = tf.placeholder(tf.float32, [None,self.ac_dim])
        self.actor_gradients = tf.gradients(self.actions_BA, self.weights, -self.a_grads_BA)
        self.optimize_a = tf.train.AdamOptimizer(self.args.step_size_actor).\
                    apply_gradients(zip(self.actor_gradients, self.weights))


    def _build_net(self, input_BO, scope):
        """ The Actor network.
        
        Uses ReLUs for all hidden layers, but a tanh to the output to bound the
        action. This follows their 'low-dimensional networks' using 400 and 300
        units for the hidden layers. Set `reuse=False`. I don't use batch
        normalization or their precise weight initialization.
        """
        with tf.variable_scope(scope, reuse=False):
            hidden1 = layers.fully_connected(input_BO,
                    num_outputs=400,
                    weights_initializer=layers.xavier_initializer(),
                    activation_fn=tf.nn.relu)
            hidden2 = layers.fully_connected(hidden1, 
                    num_outputs=300,
                    weights_initializer=layers.xavier_initializer(),
                    activation_fn=tf.nn.relu)
            actions_BA = layers.fully_connected(hidden2,
                    num_outputs=self.ac_dim,
                    weights_initializer=layers.xavier_initializer(),
                    activation_fn=tf.nn.tanh) # Note the tanh!
            # This should broadcast, but haven't tested with ac_dim > 1.
            actions_BA = tf.multiply(actions_BA, self.ac_high)
            return actions_BA


    def sample_action(self, obs, train=True):
        """ Samples an action.
        
        TODO we don't have their exact Gaussian noise injection process because
        I can't figure out how to implement it. :-(

        Parameters
        ----------
        obs: [np.array]
            Represents current states. We assume we need to expand it.
        train: [boolean]
            True means we need to inject noise. False is for test evaluation.
        """
        act = self.sess.run(self.actions_BA, {self.obs_t_BO: obs[None]})
        act = act[0]
        assert self.ac_low < act < self.ac_high
        if train:
            return act + np.random.normal(loc=self.args.ou_noise_theta,
                    scale=self.args.ou_noise_sigma, size=act.shape)
        else:
            return act

    
    def update_target_net(self, smooth=True):
        """ 
        Update the target network based on the current weights. Normally we do
        this with smooth=True except for the first step, or unless we want to
        see how poorly hard updates perform generally.
        """
        if smooth:
            self.sess.run(self.update_target_smooth)
        else:
            self.sess.run(self.update_target_hard)


    def update_weights(self, f, a_grads_BA):
        """ Gradient-based update of current actor parameters. """
        feed = {self.obs_t_BO: f['obs_t_BO'], self.a_grads_BA: a_grads_BA}
        _, actor_gradients = self.sess.run([self.optimize_a, \
                self.actor_gradients], feed)
        return actor_gradients



class Critic(Network):
    """ Computes Q(s,a) values to encourage the Actor to learn better policies.

    This is colloquially referred to as 'Q' in the paper.
    """

    def __init__(self, sess, env, args):
        super().__init__(sess, env, args)

        # The critic network (i.e. Q-values) and its corresponding target.
        self.qvals_B      = self._build_net(self.obs_t_BO, self.act_t_BA, scope='CriticNet')
        self.qvals_targ_B = self._build_net(self.obs_t_BO, self.act_t_BA, scope='TargCriticNet')

        # Collect weights since it's generally convenient to do so.
        self.weights      = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='CriticNet')
        self.weights_targ = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='TargCriticNet')
        self.weights_v      = tf.concat([tf.reshape(w, [-1]) for w in self.weights], axis=0)
        self.weights_v_targ = tf.concat([tf.reshape(w, [-1]) for w in self.weights_targ], axis=0)

        # These should be the same among both nets.
        self.w_shapes = [w.get_shape().as_list() for w in self.weights]
        self.num_weights = np.sum([np.prod(sh) for sh in self.w_shapes])

        # Update the target action network. Provide hard and smooth updates.
        target_smooth = []
        target_hard = []
        for var, var_target in zip(sorted(self.weights,      key=lambda v: v.name),
                                   sorted(self.weights_targ, key=lambda v: v.name)):
            update_sm = self.args.tau * var + (1 - self.args.tau) * var_target
            target_smooth.append(var_target.assign(update_sm))
            target_hard.append(var_target.assign(var))
        self.update_target_smooth = tf.group(*target_smooth)
        self.update_target_hard   = tf.group(*target_hard)

        # The _critic_ uses y_i, the target for its loss. Depends on `done` mask! 
        self.target_val_B = self.rew_t_B + (self.args.Q_gamma * self.qvals_targ_B) * (1 - self.done_mask_B)
        self.l2_error = tf.reduce_mean(tf.square(self.target_val_B - self.qvals_B))
        # TODO l2 weight decay?

        # Use the built-in Adam optimizer, but might want to try gradient clipping?
        self.optimize_c = tf.train.AdamOptimizer(self.args.step_size_critic).minimize(self.l2_error) 

        # Then return this in the gradient step to provide to the Actor.
        # TODO should check this, it _should_ deal with gradients row-wise, and
        # then the gradient can be summed over B. Where is the summing over B?
        # Is this also equivalent if I did targ = tf.reduce_sum(self.qvals_B)? I
        # think so because it doesn't matter if we sum, action in b-th minibatch
        # only (directly) affects the b-th Q-value and has a gradient, right?
        self.act_grads_BA = tf.gradients(self.qvals_B, self.act_t_BA)


    def _build_net(self, input_BO, acts_BO, scope):
        """ The critic network.
        
        Use ReLUs for all hidden layers. The output consists of one Q-value for
        each batch. Set `reuse=False`. I don't use batch normalization or their
        precise weight initialization.

        Unlike the critic, it uses actions here but they are NOT included in the
        first hidden layer. In addition, we do a tf.reshape to get an output of
        shape (B,), not (B,1). Seems like tf.squeeze doesn't work with `?`.
        """
        with tf.variable_scope(scope, reuse=False):
            hidden1 = layers.fully_connected(input_BO,
                    num_outputs=400,
                    weights_initializer=layers.xavier_initializer(),
                    activation_fn=tf.nn.relu)
            # Insert the concatenation here. This should be fine, I think.
            state_action = tf.concat(axis=1, values=[hidden1, acts_BO])
            hidden2 = layers.fully_connected(state_action,
                    num_outputs=300,
                    weights_initializer=layers.xavier_initializer(),
                    activation_fn=tf.nn.relu)
            qvals_B = layers.fully_connected(hidden2,
                    num_outputs=1,
                    weights_initializer=layers.xavier_initializer(),
                    activation_fn=None)
            return tf.reshape(qvals_B, shape=[-1])


    def update_target_net(self, smooth=True):
        """ 
        Update the target network based on the current weights. Normally we do
        this with smooth=True except for the first step, or unless we want to
        see how poorly hard updates perform generally.
        """
        if smooth:
            self.sess.run(self.update_target_smooth)
        else:
            self.sess.run(self.update_target_hard)


    def update_weights(self, f):
        """ 
        Gradient-based update of current Critic parameters.  Also return the
        action gradients for the Actor update later. This is the dQ/da in the
        paper, and Q is the current Q network, not the target Q network.
        """
        feed = {
            self.obs_t_BO:    f['obs_t_BO'],
            self.act_t_BA:    f['act_t_BA'],
            self.rew_t_B:     f['rew_t_B'],
            self.obs_tp1_BO:  f['obs_tp1_BO'],
            self.done_mask_B: f['done_mask_B']
        }
        action_grads_BA, _, l2_error = self.sess.run([self.act_grads_BA, \
                self.optimize_c, self.l2_error], feed)

        # We assume that the only item in the list has what we want.
        assert len(action_grads_BA) == 1
        return action_grads_BA[0], l2_error


================================================
FILE: ddpg/main.py
================================================
"""
Main script for DDPG code, for CONTINUOUS control environments.

(c) 2017 by Daniel Seita, though mostly building upon other code as usual, with
credit attrbuted to in the DDPG's README.
"""

from ddpg import DDPGAgent
import argparse
import gym
import numpy as np
np.set_printoptions(suppress=True, precision=5, edgeitems=10)
import pickle
import sys
import tensorflow as tf
if "../" not in sys.path:
    sys.path.append("../")
from utils import utils_pg as utils
from utils import value_functions as vfuncs
from utils import logz
from utils import policies


if __name__ == "__main__":
    p = argparse.ArgumentParser()
    p.add_argument('envname', type=str)

    # DDPG stuff, all directly from the paper.
    p.add_argument('--batch_size', type=int, default=64) # Used 16 on pixels.
    p.add_argument('--ou_noise_theta', type=float, default=0.15)
    p.add_argument('--ou_noise_sigma', type=float, default=0.2)
    p.add_argument('--Q_gamma', type=float, default=0.99)
    p.add_argument('--Q_l2_weight_decay', type=float, default=1e-2)
    p.add_argument('--replay_size', type=int, default=1000000)
    p.add_argument('--step_size_actor', type=float, default=1e-4)
    p.add_argument('--step_size_critic', type=float, default=1e-3)
    p.add_argument('--tau', type=float, default=0.001)

    # Other stuff that I use for my own or based on other code.
    p.add_argument('--do_not_save', action='store_true')
    p.add_argument('--learning_freq', type=int, default=50)
    p.add_argument('--log_every_t_iter', type=int, default=50)
    p.add_argument('--max_gradient', type=float, default=10.0)
    p.add_argument('--n_iter', type=int, default=10000)
    p.add_argument('--seed', type=int, default=0)
    p.add_argument('--wait_until_rbuffer', type=int, default=1000)
    args = p.parse_args()

    # Handle the log directory and save the arguments.
    logdir = 'out/' +args.envname+ '/seed' +str(args.seed).zfill(2)
    if args.do_not_save:
        logdir = None
    logz.configure_output_dir(logdir)
    if logdir is not None:
        with open(logdir+'/args.pkl', 'wb') as f:
            pickle.dump(args, f)
    print("Saving in logdir: {}".format(logdir))

    # Other stuff for seeding and getting things set up.
    tf.set_random_seed(args.seed)
    np.random.seed(args.seed)
    env = gym.make(args.envname)
    test_env = gym.make(args.envname)
    tf_config = tf.ConfigProto(inter_op_parallelism_threads=1, 
                               intra_op_parallelism_threads=1) 
    sess = tf.Session(config=tf_config)

    ddpg = DDPGAgent(sess, env, test_env, args)
    ddpg.train()


================================================
FILE: ddpg/replay_buffer.py
================================================
import numpy as np
import sys


class ReplayBuffer(object):

    def __init__(self, size, ob_dim, ac_dim):
        """ A replay buffer to store transitions (s,a,r,s') for DDPG.

        - We save with numpy arrays, because there doesn't seem to be a better
          alternative. (I'm not sure if deques are memory efficient.) Create a
          *fixed* numpy array to start. 

        - Use `self.end_idx` to identify the index of the most recent transition. It
          starts at 0, increases towards the buffer limit, then wraps around 0 as
          needed. Instead of "throwing transitions away" we simply override them.

        - It can be tricky to think about successor states. Since a full
          observation sequence is {s0,a0,r0,s1,a1,r1,s2,...} we should always
          have an equal amount of states, actions, and reward stored, but the
          successor state will be "known" before the action and the reward at
          its correponding index. I think the easiest way to resolve this is to
          limit the API of `add_sample` so that it doesn't have to worry about
          successor states. Then here, we have a "done" mask which can tell us
          when to ignore the successor. In general, there will still _be_ a
          state stored at the next time index, but the "done" mask informs us
          about if it's actually a successor, or simply the beginning state of
          the next episode.
        
        - Values are 1 in the "done mask" if the next state corresponds to the
          end of an episode when doing env.step(), which is equivalent to saying
          that the next state stored in this buffer is a start state.

        Parameters
        ----------
        size: [int]
            Maximum number of transitions to store in the buffer. When the
            buffer overflows the old memories are over-written.
        ob_dim: [int]
            State dimension, assumes an integer and not a list or tuple.
        ac_dim: [int]
            Action dimension, assumes an integer and not a list or tuple.
        """
        self.next_idx = 0
        self.num_in_buffer = 0
        self.size = size
        self.states_NO  = np.zeros((size, ob_dim), dtype=np.float32)
        self.actions_NA = np.zeros((size, ac_dim), dtype=np.float32)
        self.rewards_N  = np.zeros((size,), dtype=np.float32)
        self.done_N     = np.zeros((size,), dtype=np.uint8)


    def add_sample(self, s, a, r, done):
        """ Stores transition (s,a,r) along with the `done` boolean.
        
        States (`ob`) that exist as a result of `ob = env.reset()` should be
        added like usual states. The action and reward as a result of

            `obsucc, rew, done, _ = env.step(act)` 
            
        should be added to the same index as `ob`. Successor states (`obsucc`
        here) are stored in the _next_ index, which in rare cases wraps around
        the buffer size to be zero. However, we add the successor state in the
        next set of calls outside the code.

        Use `self.next_idx` to store the index, NOT `self.num_in_buffer`. The
        former will automatically override old samples.
        """
        self.states_NO[self.next_idx] = s 
        self.actions_NA[self.next_idx] = a
        self.rewards_N[self.next_idx] = r
        self.done_N[self.next_idx] = int(done)
        self.num_in_buffer += 1
        self.next_idx = (self.next_idx + 1) % self.size


    def sample(self, num):
        """ Sample `num` transitions (s,a,r,s') for a minibatch. 
        
        We can use the minimum of the number we've added so far and the max
        buffer size to determine the range of indices to consider when sampling
        (_without_ replacement). When taking the successor states, we increment
        the indices by one and wrap to zero as needed.

        Don't forget the `done` mask! This means we ignore the state at time t
        plus one ("tp1", i.e. the successor state) since it is ignored with the
        loss function. And the successor at that point would actually be the
        start of the _next_ episode.

        The `-1` in the `max_index` computation handles annoying corner case of
        having buffer partially filled and avoiding an un-touched index.
        """
        assert num < self.num_in_buffer
        max_index = min(self.num_in_buffer-1, self.size)
        indices = np.random.choice(max_index, num, replace=False)

        # Make next indices (+1) equal to index `self.size` back to zero.
        below_thresh = ((indices+1) < self.size).astype(int)
        indices_next = (indices+1) * below_thresh 

        # Get the minibatches for training purposes.
        states_t_BO   = self.states_NO[indices]
        actions_t_BA  = self.actions_NA[indices]
        rewards_t_B   = self.rewards_N[indices]
        states_tp1_BO = self.states_NO[indices_next]
        done_mask_B   = self.done_N[indices]
        return (states_t_BO, actions_t_BA, rewards_t_B, states_tp1_BO, done_mask_B)


================================================
FILE: dqn/README.md
================================================
# Deep Q-Networks

The starter code is from UC Berkeley's Deep Reinforcement Learning class.  These
are their comments:

> See http://rll.berkeley.edu/deeprlcourse/docs/hw3.pdf for instructions
> 
> The starter code was based on an implementation of Q-learning for Atari
> generously provided by Szymon Sidor from OpenAI

The rest of this README contains my comments and results. 

# Usage, Games, etc.

First, here is example usage (slashes are only for readability here):

```
python run_dqn_atari.py --game Pong --seed 1 --num_timesteps 30000000 | tee logs_text/Pong_s001.text
```

With these settings, the statistics for plotting data will be stored in the
`log_pkls/Pong_s001.pkl` file.

Here are some of the `task` stuff in the code, ordered by index (i.e. 0, 1,
etc.).

```
Task<env_id=BeamRiderNoFrameskip-v3 trials=2 max_timesteps=40000000 max_seconds=None reward_floor=363.9 reward_ceiling=60000.0>
Task<env_id=BreakoutNoFrameskip-v3 trials=2 max_timesteps=40000000 max_seconds=None reward_floor=1.7 reward_ceiling=800.0>
Task<env_id=EnduroNoFrameskip-v3 trials=2 max_timesteps=40000000 max_seconds=None reward_floor=0.0 reward_ceiling=5000.0>
Task<env_id=PongNoFrameskip-v3 trials=2 max_timesteps=40000000 max_seconds=None reward_floor=-20.7 reward_ceiling=21.0>
Task<env_id=QbertNoFrameskip-v3 trials=2 max_timesteps=40000000 max_seconds=None reward_floor=163.9 reward_ceiling=40000.0>
```

The default for these is 40 million episodes, but that's not always needed for
the easier games.

The `num_timesteps` parameter corresponds to the number of steps in the
"underlying" environment, *not* the "wrapped" environment. See the stopping
criterion:

```
def stopping_criterion(env, t):
    # notice that here t is the number of steps of the wrapped env,
    # which is different from the number of steps in the underlying env
    return get_wrapper_by_name(env, "Monitor").get_total_steps() >= num_timesteps
```

The `t` here is what I think of as "the number of steps." It's confusing, I
know. There might be a better way to handle this. From now on, when I say
"steps", it refers to the `t`-like number, and *not* the `num_timesteps`
parameter.

# Results

Notes:

- Timing results are based on running with an NVIDIA Titan X with Pascal GPU.
- Scores per Episode indicate scores for every episode.
- Scores per Timestep indicate the score of the current episode at a given
  timestep; each episode requires some number of timesteps for the agent to
  complete it. Theres' a lot of them, so I take every 10,000.
- Blocks mean taking an interval of some size (100) and taking the mean.

## Pong

Commands:

```
python run_dqn_atari.py --game Pong --seed 1 --num_timesteps 30000000 | tee logs_text/Pong_s001.text
python run_dqn_atari.py --game Pong --seed 2 --num_timesteps 30000000 | tee logs_text/Pong_s002.text
```

- `num_timesteps`: 30 million
- Training steps: about 7.5 million
- Episodes: about 4200
- Time: about 9.0 hours

![pong](figures/Pong.png?raw=true)

## Breakout

Command:

```
python run_dqn_atari.py --game Breakout --seed 1 --num_timesteps 40000000 | tee logs_text/Breakout_s001.text
python run_dqn_atari.py --game Breakout --seed 2 --num_timesteps 40000000 | tee logs_text/Breakout_s002.text
```

- `num_timesteps`: 40 million
- Training steps: about 9.7 million
- Episodes: about 10,200
- Time: about 11.8 hours

I have no idea why the performance plummeted after 4500 episodes. I may need to
investigate. Note that a few times we get the absolute perfect highest score
(two full boards cleared); I think the "800" value as the maximum score is wrong
...

![breakout](figures/Breakout.png?raw=true)


## BeamRider

Command:

```
python run_dqn_atari.py --game BeamRider --seed 1 --num_timesteps 40000000 | tee logs_text/BeamRider_s001.text
python run_dqn_atari.py --game BeamRider --seed 2 --num_timesteps 40000000 | tee logs_text/BeamRider_s002.text
```

- `num_timesteps`: 40 million
- Training steps: about 10.0 million
- Episodes: about 2,500 to 4,000
- Time: about 12.5 hours

The results look different among the seeds since seed 2 apparently had better
peformance earlier, thus meaning its episodes became longer sooner than the seed
1 version. At least they look reasonably good. A3C got roughly 13k on this game.

![beamrider](figures/BeamRider.png?raw=true)


## Enduro

Command:

```
python run_dqn_atari.py --game Enduro --seed 1 --num_timesteps 40000000 | tee logs_text/Enduro_s001.text
python run_dqn_atari.py --game Enduro --seed 2 --num_timesteps 40000000 | tee logs_text/Enduro_s002.text
```

- `num_timesteps`: 40 million
- Training steps: about 10.0 million
- Episodes: about 2,500 to 3,000
- Time: about 12.5 hours

Ack, what happened to seed 2?!? The first seed matches the DQN result (475.6)
from the Nature script, yet I don't know why the second one failed to learn
much. It's worth noting, though, that the A3C paper actually reported -82.2 on
this game (really?!?). Oh well ...

![enduro](figures/Enduro.png?raw=true)


================================================
FILE: dqn/atari_wrappers.py
================================================
import cv2
import numpy as np
from collections import deque
import gym
from gym import spaces


class NoopResetEnv(gym.Wrapper):
    def __init__(self, env=None, noop_max=30):
        """Sample initial states by taking random number of no-ops on reset.
        No-op is assumed to be action 0.
        """
        super(NoopResetEnv, self).__init__(env)
        self.noop_max = noop_max
        assert env.unwrapped.get_action_meanings()[0] == 'NOOP'

    def _reset(self):
        """ Do no-op action for a number of steps in [1, noop_max]."""
        self.env.reset()
        noops = np.random.randint(1, self.noop_max + 1)
        for _ in range(noops):
            obs, _, _, _ = self.env.step(0)
        return obs

class FireResetEnv(gym.Wrapper):
    def __init__(self, env=None):
        """Take action on reset for environments that are fixed until firing."""
        super(FireResetEnv, self).__init__(env)
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        assert len(env.unwrapped.get_action_meanings()) >= 3

    def _reset(self):
        self.env.reset()
        obs, _, _, _ = self.env.step(1)
        obs, _, _, _ = self.env.step(2)
        return obs

class EpisodicLifeEnv(gym.Wrapper):
    def __init__(self, env=None):
        """Make end-of-life == end-of-episode, but only reset on true game over.
        Done by DeepMind for the DQN and co. since it helps value estimation.
        """
        super(EpisodicLifeEnv, self).__init__(env)
        self.lives = 0
        self.was_real_done  = True
        self.was_real_reset = False

    def _step(self, action):
        obs, reward, done, info = self.env.step(action)
        self.was_real_done = done
        # check current lives, make loss of life terminal,
        # then update lives to handle bonus lives
        lives = self.env.unwrapped.ale.lives()
        if lives < self.lives and lives > 0:
            # for Qbert somtimes we stay in lives == 0 condtion for a few frames
            # so its important to keep lives > 0, so that we only reset once
            # the environment advertises done.
            done = True
        self.lives = lives
        return obs, reward, done, info

    def _reset(self):
        """Reset only when lives are exhausted.
        This way all states are still reachable even though lives are episodic,
        and the learner need not know about any of this behind-the-scenes.
        """
        if self.was_real_done:
            obs = self.env.reset()
            self.was_real_reset = True
        else:
            # no-op step to advance from terminal/lost life state
            obs, _, _, _ = self.env.step(0)
            self.was_real_reset = False
        self.lives = self.env.unwrapped.ale.lives()
        return obs

class MaxAndSkipEnv(gym.Wrapper):
    def __init__(self, env=None, skip=4):
        """Return only every `skip`-th frame"""
        super(MaxAndSkipEnv, self).__init__(env)
        # most recent raw observations (for max pooling across time steps)
        self._obs_buffer = deque(maxlen=2)
        self._skip       = skip

    def _step(self, action):
        total_reward = 0.0
        done = None
        for _ in range(self._skip):
            obs, reward, done, info = self.env.step(action)
            self._obs_buffer.append(obs)
            total_reward += reward
            if done:
                break

        max_frame = np.max(np.stack(self._obs_buffer), axis=0)

        return max_frame, total_reward, done, info

    def _reset(self):
        """Clear past frame buffer and init. to first obs. from inner env."""
        self._obs_buffer.clear()
        obs = self.env.reset()
        self._obs_buffer.append(obs)
        return obs

def _process_frame84(frame):
    img = np.reshape(frame, [210, 160, 3]).astype(np.float32)
    img = img[:, :, 0] * 0.299 + img[:, :, 1] * 0.587 + img[:, :, 2] * 0.114
    resized_screen = cv2.resize(img, (84, 110),  interpolation=cv2.INTER_LINEAR)
    x_t = resized_screen[18:102, :]
    x_t = np.reshape(x_t, [84, 84, 1])
    return x_t.astype(np.uint8)

class ProcessFrame84(gym.Wrapper):
    def __init__(self, env=None):
        super(ProcessFrame84, self).__init__(env)
        self.observation_space = spaces.Box(low=0, high=255, shape=(84, 84, 1))

    def _step(self, action):
        obs, reward, done, info = self.env.step(action)
        return _process_frame84(obs), reward, done, info

    def _reset(self):
        return _process_frame84(self.env.reset())

class ClippedRewardsWrapper(gym.Wrapper):
    def _step(self, action):
        obs, reward, done, info = self.env.step(action)
        return obs, np.sign(reward), done, info

def wrap_deepmind_ram(env):
    env = EpisodicLifeEnv(env)
    env = NoopResetEnv(env, noop_max=30)
    env = MaxAndSkipEnv(env, skip=4)
    if 'FIRE' in env.unwrapped.get_action_meanings():
        env = FireResetEnv(env)
    env = ClippedRewardsWrapper(env)
    return env

def wrap_deepmind(env):
    assert 'NoFrameskip' in env.spec.id
    env = EpisodicLifeEnv(env)
    env = NoopResetEnv(env, noop_max=30)
    env = MaxAndSkipEnv(env, skip=4)
    if 'FIRE' in env.unwrapped.get_action_meanings():
        env = FireResetEnv(env)
    env = ProcessFrame84(env)
    env = ClippedRewardsWrapper(env)
    return env


================================================
FILE: dqn/dqn.py
================================================
import sys
import time
import pickle
import gym.spaces
import itertools
import numpy as np
import random
import tensorflow                as tf
import tensorflow.contrib.layers as layers
from collections import namedtuple
from dqn_utils import *

OptimizerSpec = namedtuple("OptimizerSpec", ["constructor", "kwargs", "lr_schedule"])

def learn(env,
          q_func,
          optimizer_spec,
          session,
          exploration=LinearSchedule(1000000, 0.1),
          stopping_criterion=None,
          replay_buffer_size=1000000,
          batch_size=32,
          gamma=0.99,
          learning_starts=50000,
          learning_freq=4,
          frame_history_len=4,
          target_update_freq=10000,
          grad_norm_clipping=10,
          log_file='./logs_pkls/rewards.pkl'):
    """Run Deep Q-learning algorithm.

    You can specify your own convnet using q_func.

    All schedules are w.r.t. total number of steps taken in the environment.

    Parameters
    ----------
    env: gym.Env
        gym environment to train on.
    q_func: function
        Model to use for computing the q function. It should accept the
        following named arguments:
            img_in: tf.Tensor
                tensorflow tensor representing the input image
            num_actions: int
                number of actions
            scope: str
                scope in which all the model related variables
                should be created
            reuse: bool
                whether previously created variables should be reused.
    optimizer_spec: OptimizerSpec
        Specifying the constructor and kwargs, as well as learning rate schedule
        for the optimizer
    session: tf.Session
        tensorflow ses
Download .txt
gitextract_kv5056s4/

├── .gitignore
├── LICENSE
├── README.md
├── bc/
│   ├── README.md
│   ├── bash_scripts/
│   │   ├── demo.bash
│   │   ├── gen_exp_data.sh
│   │   └── runbc_allmujoco.sh
│   ├── bc.py
│   ├── experts/
│   │   ├── Ant-v1.pkl
│   │   ├── HalfCheetah-v1.pkl
│   │   ├── Hopper-v1.pkl
│   │   ├── Humanoid-v1.pkl
│   │   ├── Reacher-v1.pkl
│   │   └── Walker2d-v1.pkl
│   ├── load_policy.py
│   ├── plot_bc.py
│   ├── random_logs/
│   │   └── gen_exp_data.text
│   ├── run_expert.py
│   └── tf_util.py
├── ddpg/
│   ├── README.md
│   ├── ddpg.py
│   ├── main.py
│   └── replay_buffer.py
├── dqn/
│   ├── README.md
│   ├── atari_wrappers.py
│   ├── dqn.py
│   ├── dqn_utils.py
│   ├── logs_pkls/
│   │   ├── BeamRider_s001.pkl
│   │   ├── BeamRider_s002.pkl
│   │   ├── Breakout_s001.pkl
│   │   ├── Breakout_s002.pkl
│   │   ├── Enduro_s001.pkl
│   │   ├── Enduro_s002.pkl
│   │   ├── Pong_s001.pkl
│   │   └── Pong_s002.pkl
│   ├── logs_text/
│   │   ├── BeamRider_s001.text
│   │   ├── BeamRider_s002.text
│   │   ├── Breakout_s001.text
│   │   ├── Breakout_s002.text
│   │   ├── Enduro_s001.text
│   │   ├── Enduro_s002.text
│   │   ├── Pong_s001.text
│   │   └── Pong_s002.text
│   ├── plot_dqn.py
│   ├── run_dqn_atari.py
│   └── run_dqn_ram.py
├── es/
│   ├── README.md
│   ├── bash_scripts/
│   │   └── InvertedPendulum-v1.sh
│   ├── es.py
│   ├── logz.py
│   ├── main.py
│   ├── optimizers.py
│   ├── plot.py
│   ├── test.py
│   ├── toy_es.py
│   └── utils.py
├── g_learning/
│   ├── G-Learning.py
│   ├── README.md
│   └── __init__.py
├── lib/
│   ├── __init__.py
│   ├── envs/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── blackjack.py
│   │   ├── cliff_walking.py
│   │   ├── gridworld.py
│   │   ├── two_room_domain.py
│   │   └── windy_gridworld.py
│   └── plotting.py
├── q_learning/
│   ├── Q-Learning.py
│   ├── README.md
│   └── __init__.py
├── trpo/
│   ├── README.md
│   ├── fxn_approx.py
│   ├── main.py
│   ├── trpo.py
│   └── utils_trpo.py
├── utils/
│   ├── __init__.py
│   ├── logz.py
│   ├── policies.py
│   ├── utils_pg.py
│   └── value_functions.py
└── vpg/
    ├── README.md
    ├── bash_scripts/
    │   ├── CartPole-v0.sh
    │   ├── Pendulum-v0.sh
    │   ├── halfcheetah.sh
    │   ├── hopper.sh
    │   └── walker.sh
    ├── main.py
    └── plot_learning_curves.py
Download .txt
SYMBOL INDEX (305 symbols across 36 files)

FILE: bc/bc.py
  function get_tf_session (line 33) | def get_tf_session():
  function load_dataset (line 50) | def load_dataset(args):
  function policy_model (line 119) | def policy_model(data_in, action_dim):
  function get_batch (line 152) | def get_batch(expert_obs, expert_act, batch_size):
  function run_bc (line 165) | def run_bc(session, args, log_dir):
  function run_bc_test (line 237) | def run_bc_test(args, session, policy_fn, x, env):

FILE: bc/load_policy.py
  function load_policy (line 3) | def load_policy(filename):

FILE: bc/plot_bc.py
  function plot_bc_modern (line 36) | def plot_bc_modern(edir):
  function plot_bc_humanoid (line 93) | def plot_bc_humanoid(edir):
  function boring_stuff (line 151) | def boring_stuff(axarr, edir):
  function plot_bc (line 172) | def plot_bc(e):

FILE: bc/run_expert.py
  function main (line 30) | def main():

FILE: bc/tf_util.py
  function sum (line 18) | def sum(x, axis=None, keepdims=False):
  function mean (line 20) | def mean(x, axis=None, keepdims=False):
  function var (line 22) | def var(x, axis=None, keepdims=False):
  function std (line 25) | def std(x, axis=None, keepdims=False):
  function max (line 27) | def max(x, axis=None, keepdims=False):
  function min (line 29) | def min(x, axis=None, keepdims=False):
  function concatenate (line 31) | def concatenate(arrs, axis=0):
  function argmax (line 33) | def argmax(x, axis=None):
  function switch (line 36) | def switch(condition, then_expression, else_expression):
  function l2loss (line 55) | def l2loss(params):
  function lrelu (line 60) | def lrelu(x, leak=0.2):
  function categorical_sample_logits (line 64) | def categorical_sample_logits(X):
  function get_session (line 73) | def get_session():
  function single_threaded_session (line 76) | def single_threaded_session():
  function make_session (line 82) | def make_session(num_cpu):
  function initialize (line 90) | def initialize():
  function eval (line 96) | def eval(expr, feed_dict=None):
  function set_value (line 100) | def set_value(v, val):
  function load_state (line 103) | def load_state(fname):
  function save_state (line 107) | def save_state(fname):
  function normc_initializer (line 117) | def normc_initializer(std=1.0):
  function conv2d (line 125) | def conv2d(x, num_filters, name, filter_size=(3, 3), stride=(1, 1), pad=...
  function dense (line 155) | def dense(x, size, name, weight_init=None, bias=True):
  function wndense (line 164) | def wndense(x, size, name, init_scale=1.0):
  function densenobias (line 175) | def densenobias(x, size, name, weight_init=None):
  function dropout (line 178) | def dropout(x, pkeep, phase=None, mask=None):
  function batchnorm (line 185) | def batchnorm(x, name, phase, updates, gamma=0.96):
  function function (line 213) | def function(inputs, outputs, updates=None, givens=None):
  class _Function (line 223) | class _Function(object):
    method __init__ (line 224) | def __init__(self, inputs, outputs, updates, givens, check_nan=False):
    method __call__ (line 232) | def __call__(self, *inputvals):
  function mem_friendly_function (line 242) | def mem_friendly_function(nondata_inputs, data_inputs, outputs, batch_si...
  class _MemFriendlyFunction (line 249) | class _MemFriendlyFunction(object):
    method __init__ (line 250) | def __init__(self, nondata_inputs, data_inputs, outputs, batch_size):
    method __call__ (line 255) | def __call__(self, *inputvals):
  class Module (line 281) | class Module(object):
    method __init__ (line 282) | def __init__(self, name):
    method __call__ (line 287) | def __call__(self, *args):
    method _call (line 303) | def _call(self, *args):
    method trainable_variables (line 307) | def trainable_variables(self):
    method variables (line 312) | def variables(self):
  function module (line 317) | def module(name):
  function get_parents (line 333) | def get_parents(node):
  function topsorted (line 336) | def topsorted(outputs):
  function var_shape (line 377) | def var_shape(x):
  function numel (line 383) | def numel(x):
  function intprod (line 386) | def intprod(x):
  function flatgrad (line 389) | def flatgrad(loss, var_list):
  class SetFromFlat (line 394) | class SetFromFlat(object):
    method __init__ (line 395) | def __init__(self, var_list, dtype=tf.float32):
    method __call__ (line 408) | def __call__(self, theta):
  class GetFlat (line 411) | class GetFlat(object):
    method __init__ (line 412) | def __init__(self, var_list):
    method __call__ (line 414) | def __call__(self):
  function fancy_slice_2d (line 422) | def fancy_slice_2d(X, inds0, inds1):
  function scope_vars (line 435) | def scope_vars(scope, trainable_only):
  function lengths_to_mask (line 445) | def lengths_to_mask(lengths_b, max_length):
  function in_session (line 463) | def in_session(f):
  function get_placeholder (line 472) | def get_placeholder(name, dtype, shape):
  function get_placeholder_cached (line 482) | def get_placeholder_cached(name):
  function flattenallbut0 (line 485) | def flattenallbut0(x):
  function reset (line 488) | def reset():

FILE: ddpg/ddpg.py
  class DDPGAgent (line 20) | class DDPGAgent(object):
    method __init__ (line 22) | def __init__(self, sess, env, test_env, args):
    method train (line 42) | def train(self):
    method _do_rollouts (line 103) | def _do_rollouts(self):
    method _debug_print (line 133) | def _debug_print(self):
  class Network (line 150) | class Network(object):
    method __init__ (line 157) | def __init__(self, sess, env, args):
  class Actor (line 178) | class Actor(Network):
    method __init__ (line 187) | def __init__(self, sess, env, args):
    method _build_net (line 224) | def _build_net(self, input_BO, scope):
    method sample_action (line 250) | def sample_action(self, obs, train=True):
    method update_target_net (line 273) | def update_target_net(self, smooth=True):
    method update_weights (line 285) | def update_weights(self, f, a_grads_BA):
  class Critic (line 294) | class Critic(Network):
    method __init__ (line 300) | def __init__(self, sess, env, args):
    method _build_net (line 345) | def _build_net(self, input_BO, acts_BO, scope):
    method update_target_net (line 374) | def update_target_net(self, smooth=True):
    method update_weights (line 386) | def update_weights(self, f):

FILE: ddpg/replay_buffer.py
  class ReplayBuffer (line 5) | class ReplayBuffer(object):
    method __init__ (line 7) | def __init__(self, size, ob_dim, ac_dim):
    method add_sample (line 53) | def add_sample(self, s, a, r, done):
    method sample (line 77) | def sample(self, num):

FILE: dqn/atari_wrappers.py
  class NoopResetEnv (line 8) | class NoopResetEnv(gym.Wrapper):
    method __init__ (line 9) | def __init__(self, env=None, noop_max=30):
    method _reset (line 17) | def _reset(self):
  class FireResetEnv (line 25) | class FireResetEnv(gym.Wrapper):
    method __init__ (line 26) | def __init__(self, env=None):
    method _reset (line 32) | def _reset(self):
  class EpisodicLifeEnv (line 38) | class EpisodicLifeEnv(gym.Wrapper):
    method __init__ (line 39) | def __init__(self, env=None):
    method _step (line 48) | def _step(self, action):
    method _reset (line 62) | def _reset(self):
  class MaxAndSkipEnv (line 77) | class MaxAndSkipEnv(gym.Wrapper):
    method __init__ (line 78) | def __init__(self, env=None, skip=4):
    method _step (line 85) | def _step(self, action):
    method _reset (line 99) | def _reset(self):
  function _process_frame84 (line 106) | def _process_frame84(frame):
  class ProcessFrame84 (line 114) | class ProcessFrame84(gym.Wrapper):
    method __init__ (line 115) | def __init__(self, env=None):
    method _step (line 119) | def _step(self, action):
    method _reset (line 123) | def _reset(self):
  class ClippedRewardsWrapper (line 126) | class ClippedRewardsWrapper(gym.Wrapper):
    method _step (line 127) | def _step(self, action):
  function wrap_deepmind_ram (line 131) | def wrap_deepmind_ram(env):
  function wrap_deepmind (line 140) | def wrap_deepmind(env):

FILE: dqn/dqn.py
  function learn (line 15) | def learn(env,

FILE: dqn/dqn_utils.py
  function huber_loss (line 8) | def huber_loss(x, delta=1.0):
  function sample_n_unique (line 16) | def sample_n_unique(sampling_f, n):
  class Schedule (line 27) | class Schedule(object):
    method value (line 28) | def value(self, t):
  class ConstantSchedule (line 32) | class ConstantSchedule(object):
    method __init__ (line 33) | def __init__(self, value):
    method value (line 42) | def value(self, t):
  function linear_interpolation (line 46) | def linear_interpolation(l, r, alpha):
  class PiecewiseSchedule (line 49) | class PiecewiseSchedule(object):
    method __init__ (line 50) | def __init__(self, endpoints, interpolation=linear_interpolation, outs...
    method value (line 74) | def value(self, t):
  class LinearSchedule (line 85) | class LinearSchedule(object):
    method __init__ (line 86) | def __init__(self, schedule_timesteps, final_p, initial_p=1.0):
    method value (line 104) | def value(self, t):
  function compute_exponential_averages (line 109) | def compute_exponential_averages(variables, decay):
  function minimize_and_clip (line 130) | def minimize_and_clip(optimizer, objective, var_list, clip_val=10):
  function initialize_interdependent_variables (line 141) | def initialize_interdependent_variables(session, vars_list, feed_dict):
  function get_wrapper_by_name (line 164) | def get_wrapper_by_name(env, classname):
  class ReplayBuffer (line 174) | class ReplayBuffer(object):
    method __init__ (line 175) | def __init__(self, size, frame_history_len):
    method can_sample (line 212) | def can_sample(self, batch_size):
    method _encode_sample (line 216) | def _encode_sample(self, idxes):
    method sample (line 226) | def sample(self, batch_size):
    method encode_recent_observation (line 263) | def encode_recent_observation(self):
    method _encode_observation (line 276) | def _encode_observation(self, idx):
    method store_frame (line 302) | def store_frame(self, frame):
    method store_effect (line 330) | def store_effect(self, idx, action, reward, done):

FILE: dqn/plot_dqn.py
  function smoothed_block (line 19) | def smoothed_block(x, n):

FILE: dqn/run_dqn_atari.py
  function atari_model (line 17) | def atari_model(img_in, num_actions, scope, reuse=False):
  function atari_learn (line 33) | def atari_learn(env,
  function get_available_gpus (line 86) | def get_available_gpus():
  function set_global_seeds (line 92) | def set_global_seeds(i):
  function get_session (line 103) | def get_session():
  function get_env (line 123) | def get_env(task, seed):
  function main (line 134) | def main():

FILE: dqn/run_dqn_ram.py
  function atari_model (line 15) | def atari_model(ram_in, num_actions, scope, reuse=False):
  function atari_learn (line 27) | def atari_learn(env,
  function get_available_gpus (line 77) | def get_available_gpus():
  function set_global_seeds (line 82) | def set_global_seeds(i):
  function get_session (line 92) | def get_session():
  function get_env (line 101) | def get_env(seed):
  function main (line 113) | def main():

FILE: es/es.py
  class ESAgent (line 23) | class ESAgent:
    method __init__ (line 25) | def __init__(self, session, args, log_dir=None, continuous=True):
    method _make_network (line 77) | def _make_network(self, data_in, out_dim):
    method _compute_return (line 101) | def _compute_return(self, test=False, store_info=False):
    method _print_summary (line 147) | def _print_summary(self):
    method run_es (line 160) | def run_es(self):
    method test (line 247) | def test(self, just_one=True):
    method generate_rollout_data (line 286) | def generate_rollout_data(self, weights, num_rollouts):

FILE: es/logz.py
  function colorize (line 30) | def colorize(string, color, bold=False, highlight=False):
  class G (line 38) | class G:
  function configure_output_dir (line 45) | def configure_output_dir(d=None):
  function log_tabular (line 61) | def log_tabular(key, val):
  function dump_tabular (line 73) | def dump_tabular():

FILE: es/optimizers.py
  class Optimizer (line 10) | class Optimizer(object):
    method __init__ (line 11) | def __init__(self, pi):
    method update (line 16) | def update(self, globalg):
    method _compute_step (line 24) | def _compute_step(self, globalg):
  class SGD (line 28) | class SGD(Optimizer):
    method __init__ (line 29) | def __init__(self, pi, stepsize, momentum=0.9):
    method _compute_step (line 34) | def _compute_step(self, globalg):
  class Adam (line 40) | class Adam(Optimizer):
    method __init__ (line 41) | def __init__(self, pi, stepsize, beta1=0.9, beta2=0.999, epsilon=1e-08):
    method _compute_step (line 50) | def _compute_step(self, globalg):

FILE: es/plot.py
  function plot_one_dir (line 55) | def plot_one_dir(args, directory):

FILE: es/toy_es.py
  function f (line 26) | def f(w, sol):
  function run_es (line 31) | def run_es(args):

FILE: es/utils.py
  function compute_ranks (line 12) | def compute_ranks(x):
  function compute_centered_ranks (line 26) | def compute_centered_ranks(x):
  function get_tf_session (line 60) | def get_tf_session():
  function normc_initializer (line 77) | def normc_initializer(std=1.0):

FILE: g_learning/G-Learning.py
  class GLearningAgent (line 30) | class GLearningAgent():
    method __init__ (line 32) | def __init__(self, env, k):
    method policy_exploration (line 54) | def policy_exploration(self, state, epsilon=0.0):
    method alpha_schedule (line 72) | def alpha_schedule(self, t, state, action):
    method beta_schedule (line 88) | def beta_schedule(self, t):
    method g_learning (line 104) | def g_learning(self, num_episodes, max_ep_steps=10000, discount=1.0, e...

FILE: lib/envs/blackjack.py
  function cmp (line 5) | def cmp(a, b):
  function draw_card (line 12) | def draw_card(np_random):
  function draw_hand (line 16) | def draw_hand(np_random):
  function usable_ace (line 20) | def usable_ace(hand):  # Does this hand have a usable ace?
  function sum_hand (line 24) | def sum_hand(hand):  # Return current hand total
  function is_bust (line 30) | def is_bust(hand):  # Is this hand a bust?
  function score (line 34) | def score(hand):  # What is the score of this hand (0 if bust)
  function is_natural (line 38) | def is_natural(hand):  # Is this hand a natural blackjack?
  class BlackjackEnv (line 42) | class BlackjackEnv(gym.Env):
    method __init__ (line 67) | def __init__(self, natural=False):
    method _seed (line 82) | def _seed(self, seed=None):
    method _step (line 86) | def _step(self, action):
    method _get_obs (line 105) | def _get_obs(self):
    method _reset (line 108) | def _reset(self):

FILE: lib/envs/cliff_walking.py
  class CliffWalkingEnv (line 15) | class CliffWalkingEnv(discrete.DiscreteEnv):
    method _limit_coordinates (line 19) | def _limit_coordinates(self, coord):
    method _calculate_transition_prob (line 26) | def _calculate_transition_prob(self, current, delta):
    method __init__ (line 42) | def __init__(self):
    method _render (line 68) | def _render(self, mode='human', close=False):

FILE: lib/envs/gridworld.py
  class GridworldEnv (line 10) | class GridworldEnv(discrete.DiscreteEnv):
    method __init__ (line 32) | def __init__(self, shape=[4,4]):
    method _render (line 85) | def _render(self, mode='human', close=False):

FILE: lib/envs/two_room_domain.py
  class TwoRooms (line 53) | class TwoRooms:
    method __init__ (line 55) | def __init__(self, length=9):
    method _init_grid (line 64) | def _init_grid(self):
    method _check_coords_and_move (line 90) | def _check_coords_and_move(self, coord):
    method step (line 103) | def step(self, action):
    method reset (line 136) | def reset(self):
    method render (line 141) | def render(self):
    method action_space_sample (line 147) | def action_space_sample(self):
    method _pretty_print (line 153) | def _pretty_print(self):
  function test_nine_rooms (line 157) | def test_nine_rooms():

FILE: lib/envs/windy_gridworld.py
  class WindyGridworldEnv (line 11) | class WindyGridworldEnv(discrete.DiscreteEnv):
    method _limit_coordinates (line 15) | def _limit_coordinates(self, coord):
    method _calculate_transition_prob (line 22) | def _calculate_transition_prob(self, current, delta, winds):
    method __init__ (line 29) | def __init__(self):
    method _render (line 56) | def _render(self, mode='human', close=False):

FILE: lib/plotting.py
  function plot_cost_to_go_mountain_car (line 10) | def plot_cost_to_go_mountain_car(env, estimator, num_tiles=20):
  function plot_value_function (line 28) | def plot_value_function(V, title="Value Function"):
  function plot_episode_stats (line 63) | def plot_episode_stats(stats, smoothing_window=10, noshow=False, dosave=...

FILE: q_learning/Q-Learning.py
  class QLearningAgent (line 51) | class QLearningAgent():
    method __init__ (line 53) | def __init__(self, env):
    method policy_exploration (line 70) | def policy_exploration(self, state, epsilon=0.0):
    method alpha_schedule (line 88) | def alpha_schedule(self, t, state, action):
    method q_learning (line 105) | def q_learning(self, num_episodes, max_ep_steps=10000, discount=1.0, e...

FILE: trpo/fxn_approx.py
  class LinearValueFunction (line 19) | class LinearValueFunction(object):
    method fit (line 23) | def fit(self, X, y):
    method predict (line 37) | def predict(self, X):
    method preproc (line 44) | def preproc(self, X):
  class NnValueFunction (line 49) | class NnValueFunction(object):
    method __init__ (line 52) | def __init__(self, session, ob_dim=None, n_epochs=10, stepsize=1e-3):
    method fit (line 71) | def fit(self, X, y):
    method predict (line 98) | def predict(self, X):
    method preproc (line 107) | def preproc(self, X):

FILE: trpo/main.py
  function run_trpo_algorithm (line 30) | def run_trpo_algorithm(args, vf_params, logdir):

FILE: trpo/trpo.py
  class TRPO (line 26) | class TRPO:
    method __init__ (line 29) | def __init__(self, args, sess, env, vf_params):
    method update_policy (line 155) | def update_policy(self, paths, infodict):
    method _flatgrad (line 239) | def _flatgrad(self, loss, var_list):
    method _act (line 260) | def _act(self, ob):
    method get_paths (line 277) | def get_paths(self, seed_iter, env):
    method compute_advantages (line 325) | def compute_advantages(self, paths):
    method fit_value_function (line 350) | def fit_value_function(self, paths, vfdict):
    method log_diagnostics (line 363) | def log_diagnostics(self, paths, infodict, vfdict):

FILE: trpo/utils_trpo.py
  function cg (line 14) | def cg(f_Ax, b, cg_iters=10, verbose=False, residual_tol=1e-10):
  function backtracking_line_search (line 68) | def backtracking_line_search(f, x, fullstep, expected_improve_rate,

FILE: utils/logz.py
  function colorize (line 29) | def colorize(string, color, bold=False, highlight=False):
  class G (line 38) | class G:
  function configure_output_dir (line 46) | def configure_output_dir(d=None):
  function log_tabular (line 63) | def log_tabular(key, val):
  function dump_tabular (line 76) | def dump_tabular():

FILE: utils/policies.py
  class StochasticPolicy (line 23) | class StochasticPolicy(object):
    method __init__ (line 25) | def __init__(self, sess, ob_dim, ac_dim):
    method sample_action (line 32) | def sample_action(self, x):
  class GibbsPolicy (line 37) | class GibbsPolicy(StochasticPolicy):
    method __init__ (line 41) | def __init__(self, sess, ob_dim, ac_dim):
    method sample_action (line 87) | def sample_action(self, ob):
    method update_policy (line 91) | def update_policy(self, ob_no, ac_n, std_adv_n, stepsize):
    method kldiv_and_entropy (line 106) | def kldiv_and_entropy(self, ob_no, oldlogits_na):
  class GaussianPolicy (line 117) | class GaussianPolicy(StochasticPolicy):
    method __init__ (line 121) | def __init__(self, sess, ob_dim, ac_dim):
    method sample_action (line 167) | def sample_action(self, ob):
    method update_policy (line 171) | def update_policy(self, ob_no, ac_n, std_adv_n, stepsize):
    method kldiv_and_entropy (line 190) | def kldiv_and_entropy(self, ob_no, oldmean_na, oldlogstd_a):

FILE: utils/utils_pg.py
  function gauss_log_prob_1 (line 12) | def gauss_log_prob_1(mu, logstd, x):
  function gauss_log_prob (line 23) | def gauss_log_prob(mu, logstd, x):
  function gauss_KL_1 (line 43) | def gauss_KL_1(mu1, logstd1, mu2, logstd2):
  function gauss_KL (line 56) | def gauss_KL(mu1, logstd1, mu2, logstd2):
  function normc_initializer (line 80) | def normc_initializer(std=1.0):
  function dense (line 89) | def dense(x, size, name, weight_init=None):
  function fancy_slice_2d (line 96) | def fancy_slice_2d(X, inds0, inds1):
  function discount (line 106) | def discount(x, gamma):
  function lrelu (line 114) | def lrelu(x, leak=0.2):
  function explained_variance_1d (line 121) | def explained_variance_1d(ypred,y):
  function categorical_sample_logits (line 131) | def categorical_sample_logits(logits):
  function pathlength (line 146) | def pathlength(path):

FILE: utils/value_functions.py
  class LinearValueFunction (line 13) | class LinearValueFunction(object):
    method __init__ (line 16) | def __init__(self):
    method fit (line 19) | def fit(self, X, y):
    method predict (line 33) | def predict(self, X):
    method preproc (line 40) | def preproc(self, X):
  class NnValueFunction (line 45) | class NnValueFunction(object):
    method __init__ (line 48) | def __init__(self, session, ob_dim=None, n_epochs=20, stepsize=1e-3):
    method fit (line 86) | def fit(self, X, y, session=None):
    method predict (line 103) | def predict(self, X):
    method preproc (line 113) | def preproc(self, X):

FILE: vpg/main.py
  function run_vpg (line 29) | def run_vpg(args, vf_params, logdir, env, sess, continuous_control):
Condensed preview — 89 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (3,249K chars).
[
  {
    "path": ".gitignore",
    "chars": 92,
    "preview": "__pycache__\n*.pyc\n*.swp\n*.swo\n.DS_Store\n\nvpg/outputs/*/*/a.diff\n\nbc/data/*\nbc/expert_data/*\n"
  },
  {
    "path": "LICENSE",
    "chars": 1069,
    "preview": "MIT License\n\nCopyright (c) 2017 Daniel Seita\n\nPermission is hereby granted, free of charge, to any person obtaining a co"
  },
  {
    "path": "README.md",
    "chars": 4121,
    "preview": "Note: this repository has not been updated in years and I don't have plans to do so. I recommend using rlpyt for future "
  },
  {
    "path": "bc/README.md",
    "chars": 4093,
    "preview": "# Behavioral Cloning\n\n## Main Idea\n\nThis runs Behavioral Cloning (BC) on MuJoCo environments, with settings inspired\nby "
  },
  {
    "path": "bc/bash_scripts/demo.bash",
    "chars": 174,
    "preview": "#!/bin/bash\nset -eux\nfor e in Hopper-v1 Ant-v1 HalfCheetah-v1 Humanoid-v1 Reacher-v1 Walker2d-v1\ndo\n    python run_exper"
  },
  {
    "path": "bc/bash_scripts/gen_exp_data.sh",
    "chars": 1768,
    "preview": "python run_expert.py experts/Reacher-v1.pkl Reacher-v1 --save --num_rollouts 4\npython run_expert.py experts/Reacher-v1.p"
  },
  {
    "path": "bc/bash_scripts/runbc_allmujoco.sh",
    "chars": 933,
    "preview": "#!/bin/bash\nset -eux\nfor e in 4 11 18 25; do\n    for s in 0 1 2; do\n        python bc.py Ant-v1         $e --test_rollou"
  },
  {
    "path": "bc/bc.py",
    "chars": 12382,
    "preview": "\"\"\"\n(c) June 2017 by Daniel Seita\n\nBehavioral cloning (continuous actions only).  For results, see the README(s)\nnearby."
  },
  {
    "path": "bc/load_policy.py",
    "chars": 2511,
    "preview": "import pickle, tensorflow as tf, tf_util, numpy as np\n\ndef load_policy(filename):\n    with open(filename, 'rb') as f:\n  "
  },
  {
    "path": "bc/plot_bc.py",
    "chars": 7597,
    "preview": "\"\"\"\n(c) April 2017 by Daniel Seita\n\nCode for plotting behavioral cloning. No need to use command line arguments,\njust ru"
  },
  {
    "path": "bc/random_logs/gen_exp_data.text",
    "chars": 103137,
    "preview": "loading and building expert policy\n('obs', (1, 11), (1, 11))\nloaded and built\n('roll/traj', 0)\n('roll/traj', 1)\n('roll/t"
  },
  {
    "path": "bc/run_expert.py",
    "chars": 4058,
    "preview": "\"\"\"\nCode to load an expert policy and generate roll-out data for behavioral cloning.\nExample usage:\n\n    python run_expe"
  },
  {
    "path": "bc/tf_util.py",
    "chars": 17792,
    "preview": "import numpy as np\nimport tensorflow as tf # pylint: ignore-module\n#import builtins\nimport functools\nimport copy\nimport "
  },
  {
    "path": "ddpg/README.md",
    "chars": 660,
    "preview": "# Deep Deterministic Policy Gradients\n\n- Python 3.5\n- Tensorflow 1.2\n\nI'm following the original DDPG paper as much as p"
  },
  {
    "path": "ddpg/ddpg.py",
    "chars": 17541,
    "preview": "\"\"\"\nDeep Deterministic Policy Gradients\n\nMake Actor and Critic subclasses of a NNet class? Not sure ...  for now, I'll\np"
  },
  {
    "path": "ddpg/main.py",
    "chars": 2600,
    "preview": "\"\"\"\nMain script for DDPG code, for CONTINUOUS control environments.\n\n(c) 2017 by Daniel Seita, though mostly building up"
  },
  {
    "path": "ddpg/replay_buffer.py",
    "chars": 5006,
    "preview": "import numpy as np\nimport sys\n\n\nclass ReplayBuffer(object):\n\n    def __init__(self, size, ob_dim, ac_dim):\n        \"\"\" A"
  },
  {
    "path": "dqn/README.md",
    "chars": 4978,
    "preview": "# Deep Q-Networks\n\nThe starter code is from UC Berkeley's Deep Reinforcement Learning class.  These\nare their comments:\n"
  },
  {
    "path": "dqn/atari_wrappers.py",
    "chars": 5290,
    "preview": "import cv2\nimport numpy as np\nfrom collections import deque\nimport gym\nfrom gym import spaces\n\n\nclass NoopResetEnv(gym.W"
  },
  {
    "path": "dqn/dqn.py",
    "chars": 15983,
    "preview": "import sys\nimport time\nimport pickle\nimport gym.spaces\nimport itertools\nimport numpy as np\nimport random\nimport tensorfl"
  },
  {
    "path": "dqn/dqn_utils.py",
    "chars": 13985,
    "preview": "\"\"\"This file includes a collection of utility functions that are useful for\nimplementing DQN.\"\"\"\nimport gym\nimport tenso"
  },
  {
    "path": "dqn/logs_pkls/BeamRider_s001.pkl",
    "chars": 128273,
    "preview": "(lp0\n(I60000\ncnumpy.core.multiarray\nscalar\np1\n(cnumpy\ndtype\np2\n(S'f8'\np3\nI0\nI1\ntp4\nRp5\n(I3\nS'<'\np6\nNNNI-1\nI-1\nI0\ntp7\nbS'"
  },
  {
    "path": "dqn/logs_pkls/BeamRider_s002.pkl",
    "chars": 122307,
    "preview": "(lp0\n(I60000\ncnumpy.core.multiarray\nscalar\np1\n(cnumpy\ndtype\np2\n(S'f8'\np3\nI0\nI1\ntp4\nRp5\n(I3\nS'<'\np6\nNNNI-1\nI-1\nI0\ntp7\nbS'"
  },
  {
    "path": "dqn/logs_pkls/Breakout_s001.pkl",
    "chars": 158743,
    "preview": "(lp0\n(I60000\ncnumpy.core.multiarray\nscalar\np1\n(cnumpy\ndtype\np2\n(S'f8'\np3\nI0\nI1\ntp4\nRp5\n(I3\nS'<'\np6\nNNNI-1\nI-1\nI0\ntp7\nbS'"
  },
  {
    "path": "dqn/logs_pkls/Breakout_s002.pkl",
    "chars": 164673,
    "preview": "(lp0\n(I60000\ncnumpy.core.multiarray\nscalar\np1\n(cnumpy\ndtype\np2\n(S'f8'\np3\nI0\nI1\ntp4\nRp5\n(I3\nS'<'\np6\nNNNI-1\nI-1\nI0\ntp7\nbS'"
  },
  {
    "path": "dqn/logs_pkls/Enduro_s001.pkl",
    "chars": 112402,
    "preview": "(lp0\n(I60000\ncnumpy.core.multiarray\nscalar\np1\n(cnumpy\ndtype\np2\n(S'f8'\np3\nI0\nI1\ntp4\nRp5\n(I3\nS'<'\np6\nNNNI-1\nI-1\nI0\ntp7\nbS'"
  },
  {
    "path": "dqn/logs_pkls/Enduro_s002.pkl",
    "chars": 102946,
    "preview": "(lp0\n(I60000\ncnumpy.core.multiarray\nscalar\np1\n(cnumpy\ndtype\np2\n(S'f8'\np3\nI0\nI1\ntp4\nRp5\n(I3\nS'<'\np6\nNNNI-1\nI-1\nI0\ntp7\nbS'"
  },
  {
    "path": "dqn/logs_pkls/Pong_s001.pkl",
    "chars": 100273,
    "preview": "(lp0\n(I60000\ncnumpy.core.multiarray\nscalar\np1\n(cnumpy\ndtype\np2\n(S'f8'\np3\nI0\nI1\ntp4\nRp5\n(I3\nS'<'\np6\nNNNI-1\nI-1\nI0\ntp7\nbS'"
  },
  {
    "path": "dqn/logs_pkls/Pong_s002.pkl",
    "chars": 101616,
    "preview": "(lp0\n(I60000\ncnumpy.core.multiarray\nscalar\np1\n(cnumpy\ndtype\np2\n(S'f8'\np3\nI0\nI1\ntp4\nRp5\n(I3\nS'<'\np6\nNNNI-1\nI-1\nI0\ntp7\nbS'"
  },
  {
    "path": "dqn/logs_text/BeamRider_s001.text",
    "chars": 217216,
    "preview": "('AVAILABLE GPUS: ', [u'device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0'])\ntask = Task<env_id=BeamRiderNoFra"
  },
  {
    "path": "dqn/logs_text/BeamRider_s002.text",
    "chars": 218673,
    "preview": "('AVAILABLE GPUS: ', [u'device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0'])\ntask = Task<env_id=BeamRiderNoFra"
  },
  {
    "path": "dqn/logs_text/Breakout_s001.text",
    "chars": 209462,
    "preview": "('AVAILABLE GPUS: ', [u'device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0'])\ntask = Task<env_id=BreakoutNoFram"
  },
  {
    "path": "dqn/logs_text/Breakout_s002.text",
    "chars": 208892,
    "preview": "('AVAILABLE GPUS: ', [u'device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0'])\ntask = Task<env_id=BreakoutNoFram"
  },
  {
    "path": "dqn/logs_text/Enduro_s001.text",
    "chars": 213042,
    "preview": "('AVAILABLE GPUS: ', [u'device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0'])\ntask = Task<env_id=EnduroNoFrames"
  },
  {
    "path": "dqn/logs_text/Enduro_s002.text",
    "chars": 211866,
    "preview": "('AVAILABLE GPUS: ', [u'device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0'])\ntask = Task<env_id=EnduroNoFrames"
  },
  {
    "path": "dqn/logs_text/Pong_s001.text",
    "chars": 159682,
    "preview": "('AVAILABLE GPUS: ', [u'device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0'])\ntask = Task<env_id=PongNoFrameski"
  },
  {
    "path": "dqn/logs_text/Pong_s002.text",
    "chars": 159626,
    "preview": "('AVAILABLE GPUS: ', [u'device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0'])\ntask = Task<env_id=PongNoFrameski"
  },
  {
    "path": "dqn/plot_dqn.py",
    "chars": 9789,
    "preview": "import matplotlib.pyplot as plt\nimport numpy as np\nimport pickle\nimport seaborn as sns\nplt.style.use('seaborn-darkgrid')"
  },
  {
    "path": "dqn/run_dqn_atari.py",
    "chars": 5459,
    "preview": "import argparse\nimport gym\nfrom gym import wrappers\nimport os.path as osp\nimport random\nimport numpy as np\nimport tensor"
  },
  {
    "path": "dqn/run_dqn_ram.py",
    "chars": 3780,
    "preview": "import argparse\nimport gym\nfrom gym import wrappers\nimport os.path as osp\nimport random\nimport numpy as np\nimport tensor"
  },
  {
    "path": "es/README.md",
    "chars": 4274,
    "preview": "# Evolution Strategies\n\nInspired by [recent work from OpenAI][1].\n\n# Code Usage\n\nSee the bash scripts.\n\nThe ES code I am"
  },
  {
    "path": "es/bash_scripts/InvertedPendulum-v1.sh",
    "chars": 245,
    "preview": "#!/bin/bash\nclear\npython main.py InvertedPendulum-v1 \\\n    --es_iters 700 \\\n    --lrate_es 0.005 \\\n    --log_every_t_ite"
  },
  {
    "path": "es/es.py",
    "chars": 15529,
    "preview": "\"\"\"\nThis is Natural Evolution Strategies, designed to run on one computer and not a\ncluster.\n\n(c) May 2017 by Daniel Sei"
  },
  {
    "path": "es/logz.py",
    "chars": 3037,
    "preview": "\"\"\"\n\nSome simple logging functionality, inspired by rllab's logging.\nAssumes that each diagnostic gets logged each itera"
  },
  {
    "path": "es/main.py",
    "chars": 2542,
    "preview": "\"\"\"\nUse this script for setting the arguments.\n\n(c) May 2017 by Daniel Seita\n\"\"\"\n\nimport argparse\nimport logz\nimport os\n"
  },
  {
    "path": "es/optimizers.py",
    "chars": 1757,
    "preview": "\"\"\"\nThis code was written by Jonathan Ho. See:\n\nhttps://github.com/openai/evolution-strategies-starter/blob/master/es_di"
  },
  {
    "path": "es/plot.py",
    "chars": 4221,
    "preview": "\"\"\"\nTo plot this, you need to provide the experiment directory plus an output stem.\nI use this for InvertedPendulum:\n\n  "
  },
  {
    "path": "es/test.py",
    "chars": 2059,
    "preview": "\"\"\"\nThis will load in snapshots of weights generated from Evolution Strategies and\nevaluate the agent by generating roll"
  },
  {
    "path": "es/toy_es.py",
    "chars": 2850,
    "preview": "\"\"\"\nBasic evolution strategies, based on Andrej Karpathy's starter code. We have the\nactual solutions here only for dida"
  },
  {
    "path": "es/utils.py",
    "chars": 3228,
    "preview": "\"\"\"\nRandom supporting methods.\n\n(c) May 2017 by Daniel Seita\n\"\"\"\n\nimport numpy as np\nimport sys\nimport tensorflow as tf\n"
  },
  {
    "path": "g_learning/G-Learning.py",
    "chars": 6427,
    "preview": "\"\"\"\n(c) 2017 by Daniel Seita\n\nG-Learning, as described in:\n\n    Taming the Noise in Reinforcement Learning via Soft Upda"
  },
  {
    "path": "g_learning/README.md",
    "chars": 3104,
    "preview": "# Standard Tabular G-learning (not Q-Learning!)\n\nThis is mainly to benchmark against tabular Q-learning, [which I've imp"
  },
  {
    "path": "g_learning/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lib/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lib/envs/README.md",
    "chars": 199,
    "preview": "Different (custom) environments that we can load in and test. \n\nFrom Denny Britz:\n\n- `blackjack.py`\n- `cliff_walking.py`"
  },
  {
    "path": "lib/envs/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lib/envs/blackjack.py",
    "chars": 4136,
    "preview": "import gym\nfrom gym import spaces\nfrom gym.utils import seeding\n\ndef cmp(a, b):\n    return int((a > b)) - int((a < b))\n\n"
  },
  {
    "path": "lib/envs/cliff_walking.py",
    "chars": 2953,
    "preview": "\"\"\"\nThis is adapted from Denny Britz's repository. I will modify as needed to\nreplicate existing literature results.\n\"\"\""
  },
  {
    "path": "lib/envs/gridworld.py",
    "chars": 3488,
    "preview": "import numpy as np\nimport sys\nfrom gym.envs.toy_text import discrete\n\nUP = 0\nRIGHT = 1\nDOWN = 2\nLEFT = 3\n\nclass Gridworl"
  },
  {
    "path": "lib/envs/two_room_domain.py",
    "chars": 5364,
    "preview": "\"\"\"\n(c) December 2016 by Daniel Seita\n\nThis implements the two room domain, as described in the experiment of:\n\n    Prin"
  },
  {
    "path": "lib/envs/windy_gridworld.py",
    "chars": 2509,
    "preview": "import gym\nimport numpy as np\nimport sys\nfrom gym.envs.toy_text import discrete\n\nUP = 0\nRIGHT = 1\nDOWN = 2\nLEFT = 3\n\ncla"
  },
  {
    "path": "lib/plotting.py",
    "chars": 3832,
    "preview": "import matplotlib\nimport numpy as np\nimport pandas as pd\nfrom collections import namedtuple\nfrom matplotlib import pyplo"
  },
  {
    "path": "q_learning/Q-Learning.py",
    "chars": 6414,
    "preview": "\"\"\"\nCode for basic tabular Q-learning. This is adapted from Denny Britz's\nrepository. I updated it to use a class to mor"
  },
  {
    "path": "q_learning/README.md",
    "chars": 2054,
    "preview": "# Standard Tabular Q-learning\n\n## Cliff World\n\n### Environment\n\nI tested with `CliffWorldEnv` using the following settin"
  },
  {
    "path": "q_learning/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "trpo/README.md",
    "chars": 332,
    "preview": "# Trust Region Policy Optimization\n\nCode outline:\n\n- `main.py` sets up the options and the top-level call to TRPO.\n- `tr"
  },
  {
    "path": "trpo/fxn_approx.py",
    "chars": 4511,
    "preview": "\"\"\"\nThis will make some function approximators that we can use, particularly: linear\nand neural network value functions."
  },
  {
    "path": "trpo/main.py",
    "chars": 3752,
    "preview": "\"\"\"\nThis is the main point for the Trust Region Policy Optimization (TRPO)\nalgorithm. Call this code using one of the ba"
  },
  {
    "path": "trpo/trpo.py",
    "chars": 19601,
    "preview": "\"\"\"\nThis contains the TRPO class. Following John's code, this will contain the bulk\nof the Tensorflow construction and r"
  },
  {
    "path": "trpo/utils_trpo.py",
    "chars": 4843,
    "preview": "\"\"\"\nSome TRPO-specific stuff, which conatins the conjugate gradient and backtracking\nline search methods. Both of these "
  },
  {
    "path": "utils/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "utils/logz.py",
    "chars": 3040,
    "preview": "\"\"\"\nSome simple logging functionality, inspired by rllab's logging.\nAssumes that each diagnostic gets logged each iterat"
  },
  {
    "path": "utils/policies.py",
    "chars": 9618,
    "preview": "\"\"\"\nFor managing policies. This seems like a better way to organize things. For now,\nwe have **stochastic** policies. Th"
  },
  {
    "path": "utils/utils_pg.py",
    "chars": 5503,
    "preview": "\"\"\"\nSeveral utilities to reduce clutter in my policy gradient codes.\n\n(c) April-June 2017 (mostly) by Daniel Seita\n\"\"\"\n\n"
  },
  {
    "path": "utils/value_functions.py",
    "chars": 4598,
    "preview": "\"\"\"\nValue functions, for now we apply them to policy gradients. This uses Python3\nimporting syntax.\n\"\"\"\n\nimport numpy as"
  },
  {
    "path": "vpg/README.md",
    "chars": 2928,
    "preview": "# Vanilla Policy Gradients\n\nThis is the standard vanilla policy gradients with stochastic policies, either\ncontinuous or"
  },
  {
    "path": "vpg/bash_scripts/CartPole-v0.sh",
    "chars": 542,
    "preview": "python main.py CartPole-v0 --vf_type linear --seed 0 --initial_stepsize 0.01 --n_iter 100\npython main.py CartPole-v0 --v"
  },
  {
    "path": "vpg/bash_scripts/Pendulum-v0.sh",
    "chars": 624,
    "preview": "python main.py Pendulum-v0 --vf_type linear --seed 0 --desired_kl 2e-3 --use_kl_heuristic --n_iter 400 \npython main.py P"
  },
  {
    "path": "vpg/bash_scripts/halfcheetah.sh",
    "chars": 434,
    "preview": "#!/bin/bash\npython main.py HalfCheetah-v1 --vf_type linear --seed 4 --n_iter 3000\npython main.py HalfCheetah-v1 --vf_typ"
  },
  {
    "path": "vpg/bash_scripts/hopper.sh",
    "chars": 404,
    "preview": "#!/bin/bash\npython main.py Hopper-v1 --vf_type linear --seed 4 --n_iter 3000\npython main.py Hopper-v1 --vf_type nn     -"
  },
  {
    "path": "vpg/bash_scripts/walker.sh",
    "chars": 420,
    "preview": "#!/bin/bash\npython main.py Walker2d-v1 --vf_type linear --seed 4 --n_iter 3000\npython main.py Walker2d-v1 --vf_type nn  "
  },
  {
    "path": "vpg/main.py",
    "chars": 8664,
    "preview": "\"\"\"\nVanilla Policy Gradients, aka REINFORCE, aka Monte Carlo Policy Gradients.\n\nTo quickly test you can do:\n\n    python "
  },
  {
    "path": "vpg/plot_learning_curves.py",
    "chars": 4205,
    "preview": "\"\"\" \nTo plot, you need to provide the experiment directory. \n\npython plot_learning_curves.py outputs/Pendulum-v0 --out f"
  }
]

// ... and 6 more files (download for full content)

About this extraction

This page contains the full source code of the DanielTakeshi/rl_algorithms GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 89 files (2.9 MB), approximately 750.9k tokens, and a symbol index with 305 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!