Repository: PacktPublishing/Reinforcement-Learning-Algorithms-with-Python
Branch: master
Commit: d144d314b3b5
Files: 26
Total size: 205.3 KB

Directory structure:
gitextract_o5sg2_86/

├── Chapter02/
│   └── Code.ipynb
├── Chapter03/
│   ├── frozenlake8x8_policyiteration.py
│   └── frozenlake8x8_valueiteration.py
├── Chapter04/
│   └── SARSA Q_learning Taxi-v2.py
├── Chapter05/
│   ├── .ipynb_checkpoints/
│   │   └── Untitled-checkpoint.ipynb
│   ├── DQN_Atari.py
│   ├── DQN_variations_Atari.py
│   ├── Untitled.ipynb
│   ├── atari_wrappers.py
│   └── untitled
├── Chapter06/
│   ├── AC.py
│   ├── REINFORCE.py
│   └── REINFORCE_baseline.py
├── Chapter07/
│   ├── PPO.py
│   └── TRPO.py
├── Chapter08/
│   ├── DDPG.py
│   └── TD3.py
├── Chapter09/
│   └── ME-TRPO.py
├── Chapter10/
│   ├── DAgger.py
│   └── expert/
│       ├── checkpoint
│       ├── model.ckpt.data-00000-of-00001
│       ├── model.ckpt.index
│       └── model.ckpt.meta
├── Chapter11/
│   └── ES.py
├── Chapter12/
│   └── ESBAS.py
└── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: Chapter02/Code.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TensorFlow installation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pip3 install tensorflow`\n",
    "\n",
    "or\n",
    "\n",
    "`pip3 install tensorflow-gpu`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### OpenAI Gym installation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On OSX: \n",
    "\n",
    "`brew install cmake boost boost-python sdl2 swig wget`\n",
    " \n",
    "On Ubuntu 16.04:\n",
    "\n",
    "`apt-get install -y python-pyglet python3-opengl zlib1g-dev libjpeg-dev patchelf cmake swig libboost-all-dev libsdl2-dev libosmesa6-dev xvfb ffmpeg`\n",
    "\n",
    "On Ubuntu 18.04\n",
    "\n",
    "`sudo apt install -y python3-dev zlib1g-dev libjpeg-dev cmake swig python-pyglet python3-opengl libboost-all-dev libsdl2-dev libosmesa6-dev patchelf ffmpeg xvfb `\n",
    "\n",
    "Then:\n",
    "\n",
    "```\n",
    "git clone https://github.com/openai/gym.git \n",
    "\n",
    "cd gym\n",
    "\n",
    "pip install -e '.[all]'\n",
    "```\n",
    "\n",
    "PyBox2D:\n",
    "\n",
    "```\n",
    "git clone https://github.com/pybox2d/pybox2d\n",
    "cd pybox2d\n",
    "pip3 install -e .\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Duckietown installation\n",
    "\n",
    "```\n",
    "git clone https://github.com/duckietown/gym-duckietown.git\n",
    "cd gym-duckietown\n",
    "pip3 install -e .\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Roboschool installation\n",
    "\n",
    "```\n",
    "git clone https://github.com/openai/roboschool\n",
    "cd roboschool\n",
    "ROBOSCHOOL_PATH=`pwd`\n",
    "git clone https://github.com/olegklimov/bullet3 -b roboschool_self_collision\n",
    "mkdir bullet3/build\n",
    "cd    bullet3/build\n",
    "cmake -DBUILD_SHARED_LIBS=ON -DUSE_DOUBLE_PRECISION=1 -DCMAKE_INSTALL_PREFIX:PATH=$ROBOSCHOOL_PATH/roboschool/cpp-household/bullet_local_install -DBUILD_CPU_DEMOS=OFF -DBUILD_BULLET2_DEMOS=OFF -DBUILD_EXTRAS=OFF  -DBUILD_UNIT_TESTS=OFF -DBUILD_CLSOCKET=OFF -DBUILD_ENET=OFF -DBUILD_OPENGL3_DEMOS=OFF ..\n",
    "\n",
    "make -j4\n",
    "make install\n",
    "cd ../..\n",
    "pip3 install -e $ROBOSCHOOL_PATH\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## RL cycle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "import gym\n",
    "\n",
    "# create the environment \n",
    "env = gym.make(\"CartPole-v1\")\n",
    "# reset the environment before starting\n",
    "env.reset()\n",
    "\n",
    "# loop 10 times\n",
    "for i in range(10):\n",
    "    # take a random action\n",
    "    env.step(env.action_space.sample())\n",
    "    # render the game\n",
    "    env.render()\n",
    "\n",
    "# close the environment\n",
    "env.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.\u001b[0m\n",
      "Episode 0 finished, reward:15\n",
      "Episode 1 finished, reward:13\n",
      "Episode 2 finished, reward:20\n",
      "Episode 3 finished, reward:22\n",
      "Episode 4 finished, reward:13\n",
      "Episode 5 finished, reward:18\n",
      "Episode 6 finished, reward:15\n",
      "Episode 7 finished, reward:12\n",
      "Episode 8 finished, reward:58\n",
      "Episode 9 finished, reward:15\n"
     ]
    }
   ],
   "source": [
    "import gym\n",
    "\n",
    "# create and initialize the environment\n",
    "env = gym.make(\"CartPole-v1\")\n",
    "env.reset()\n",
    "\n",
    "# play 10 games\n",
    "for i in range(10):\n",
    "    # initialize the variables\n",
    "    done = False\n",
    "    game_rew = 0\n",
    "\n",
    "    while not done:\n",
    "        # choose a random action\n",
    "        action = env.action_space.sample()\n",
    "        # take a step in the environment\n",
    "        new_obs, rew, done, info = env.step(action)\n",
    "        game_rew += rew\n",
    "    \n",
    "        # when is done, print the cumulative reward of the game and reset the environment\n",
    "        if done:\n",
    "            print('Episode %d finished, reward:%d' % (i, game_rew))\n",
    "            env.reset()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.\u001b[0m\n",
      "Box(4,)\n"
     ]
    }
   ],
   "source": [
    "import gym\n",
    "\n",
    "env = gym.make('CartPole-v1')\n",
    "print(env.observation_space)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Discrete(2)\n"
     ]
    }
   ],
   "source": [
    "print(env.action_space)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "0\n",
      "0\n"
     ]
    }
   ],
   "source": [
    "print(env.action_space.sample())\n",
    "print(env.action_space.sample())\n",
    "print(env.action_space.sample())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]\n"
     ]
    }
   ],
   "source": [
    "print(env.observation_space.low)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]\n"
     ]
    }
   ],
   "source": [
    "print(env.observation_space.high)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TensorFlow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "c:\\users\\andrea\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\h5py\\__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tensor(\"add:0\", shape=(), dtype=int32)\n",
      "7\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "# create two constants: a and b\n",
    "a = tf.constant(4)\n",
    "b = tf.constant(3)\n",
    "\n",
    "# perform a computation\n",
    "c = a + b\n",
    "print(c) # print the shape of c\n",
    "\n",
    "# create a session\n",
    "session = tf.Session()\n",
    "# run the session. It compute the sum\n",
    "res = session.run(c)\n",
    "print(res) # print the actual result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# reset the graph\n",
    "tf.reset_default_graph()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tensor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "()\n"
     ]
    }
   ],
   "source": [
    "a = tf.constant(1)\n",
    "print(a.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(5,)\n"
     ]
    }
   ],
   "source": [
    "# array of five elements\n",
    "b = tf.constant([1,2,3,4,5])\n",
    "print(b.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1 2 3]\n"
     ]
    }
   ],
   "source": [
    "#NB: a can be of any type of tensor\n",
    "a = tf.constant([1,2,3,4,5])\n",
    "first_three_elem = a[:3]\n",
    "fourth_elem = a[3]\n",
    "\n",
    "sess = tf.Session()\n",
    "print(sess.run(first_three_elem))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4\n"
     ]
    }
   ],
   "source": [
    "print(sess.run(fourth_elem))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Constant"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tensor(\"a_const:0\", shape=(4,), dtype=float32)\n"
     ]
    }
   ],
   "source": [
    "a = tf.constant([1.0, 1.1, 2.1, 3.1], dtype=tf.float32, name='a_const')\n",
    "print(a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Placeholder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[10.1 10.2 10.3]]\n"
     ]
    }
   ],
   "source": [
    "a = tf.placeholder(shape=(1,3), dtype=tf.float32)\n",
    "b = tf.constant([[10,10,10]], dtype=tf.float32)\n",
    "\n",
    "c = a + b\n",
    "\n",
    "sess = tf.Session()\n",
    "res = sess.run(c, feed_dict={a:[[0.1,0.2,0.3]]})\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf.reset_default_graph()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tensor(\"Placeholder:0\", shape=(?, 3), dtype=float32)\n",
      "[[10.1 10.2 10.3]]\n",
      "[[7. 7. 7.]\n",
      " [7. 7. 7.]]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "# NB: the fist dimension is 'None', meaning that it can be of any lenght\n",
    "a = tf.placeholder(shape=(None,3), dtype=tf.float32)\n",
    "b = tf.placeholder(shape=(None,3), dtype=tf.float32)\n",
    "\n",
    "c = a + b\n",
    "\n",
    "print(a)\n",
    "\n",
    "sess = tf.Session()\n",
    "print(sess.run(c, feed_dict={a:[[0.1,0.2,0.3]], b:[[10,10,10]]}))\n",
    "\n",
    "v_a = np.array([[1,2,3],[4,5,6]])\n",
    "v_b = np.array([[6,5,4],[3,2,1]])\n",
    "print(sess.run(c, feed_dict={a:v_a, b:v_b}))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[10.1 10.2 10.3]]\n"
     ]
    }
   ],
   "source": [
    "sess = tf.Session()\n",
    "print(sess.run(c, feed_dict={a:[[0.1,0.2,0.3]], b:[[10,10,10]]}))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.4478302  0.7014905  0.36300516]]\n",
      "[[4 5]]\n"
     ]
    }
   ],
   "source": [
    "tf.reset_default_graph()\n",
    "\n",
    "# variable initialized using the glorot uniform initializer\n",
    "var = tf.get_variable(\"first_variable\", shape=[1,3], dtype=tf.float32, initializer=tf.glorot_uniform_initializer)\n",
    "\n",
    "# variable initialized with constant values\n",
    "init_val = np.array([4,5])\n",
    "var2 = tf.get_variable(\"second_variable\", shape=[1,2], dtype=tf.int32, initializer=tf.constant_initializer(init_val))\n",
    "\n",
    "# create the session\n",
    "sess = tf.Session()\n",
    "# initialize all the variables\n",
    "sess.run(tf.global_variables_initializer())\n",
    "\n",
    "print(sess.run(var))\n",
    "\n",
    "print(sess.run(var2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# not trainable variable\n",
    "var2 = tf.get_variable(\"variable\", shape=[1,2], trainable=False, dtype=tf.int32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<tf.Variable 'first_variable:0' shape=(1, 3) dtype=float32_ref>, <tf.Variable 'second_variable:0' shape=(1, 2) dtype=int32_ref>, <tf.Variable 'variable:0' shape=(1, 2) dtype=int32_ref>]\n"
     ]
    }
   ],
   "source": [
    "print(tf.global_variables())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-0.015899599"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.reset_default_graph()\n",
    "\n",
    "const1 = tf.constant(3.0, name='constant1')\n",
    "\n",
    "var = tf.get_variable(\"variable1\", shape=[1,2], dtype=tf.float32)\n",
    "var2 = tf.get_variable(\"variable2\", shape=[1,2], trainable=False, dtype=tf.float32)\n",
    "\n",
    "op1 = const1 * var\n",
    "op2 = op1 + var2\n",
    "op3 = tf.reduce_mean(op2)\n",
    "\n",
    "sess = tf.Session()\n",
    "sess.run(tf.global_variables_initializer())\n",
    "sess.run(op3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simple Linear Regression Example\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch:   0, MSE: 4617.4390, W: 1.295, b: -0.407\n",
      "Epoch:  40, MSE: 5.3334, W: 0.496, b: -0.727\n",
      "Epoch:  80, MSE: 4.5894, W: 0.529, b: -0.012\n",
      "Epoch: 120, MSE: 4.1029, W: 0.512, b: 0.608\n",
      "Epoch: 160, MSE: 3.8552, W: 0.506, b: 1.092\n",
      "Epoch: 200, MSE: 3.7597, W: 0.501, b: 1.418\n",
      "Final weight: 0.500, bias: 1.473\n"
     ]
    }
   ],
   "source": [
    "tf.reset_default_graph()\n",
    "\n",
    "np.random.seed(10)\n",
    "tf.set_random_seed(10)\n",
    "\n",
    "W, b = 0.5, 1.4\n",
    "# create a dataset of 100 examples\n",
    "X = np.linspace(0,100, num=100)\n",
    "# add random noise to the y labels\n",
    "y = np.random.normal(loc=W * X + b, scale=2.0, size=len(X))\n",
    "\n",
    "# create the placeholders\n",
    "x_ph = tf.placeholder(shape=[None,], dtype=tf.float32)\n",
    "y_ph = tf.placeholder(shape=[None,], dtype=tf.float32)\n",
    "\n",
    "# create the variables.\n",
    "v_weight = tf.get_variable(\"weight\", shape=[1], dtype=tf.float32)\n",
    "v_bias = tf.get_variable(\"bias\", shape=[1], dtype=tf.float32)\n",
    "\n",
    "# linear computation\n",
    "out = v_weight * x_ph + v_bias\n",
    "\n",
    "# compute the Mean Squared Error\n",
    "loss = tf.reduce_mean((out - y_ph)**2)\n",
    "\n",
    "# optimizer\n",
    "opt = tf.train.AdamOptimizer(0.4).minimize(loss)\n",
    "\n",
    "# create the session\n",
    "session = tf.Session()\n",
    "session.run(tf.global_variables_initializer())\n",
    "\n",
    "# loop to train the parameters\n",
    "for ep in range(210):\n",
    "    # run the optimizer and get the loss\n",
    "    train_loss, _ = session.run([loss, opt], feed_dict={x_ph:X, y_ph:y})\n",
    " \n",
    "    # print epoch number and loss\n",
    "    if ep % 40 == 0:\n",
    "        print('Epoch: %3d, MSE: %.4f, W: %.3f, b: %.3f' % (ep, train_loss, session.run(v_weight), session.run(v_bias)))\n",
    "        \n",
    "print('Final weight: %.3f, bias: %.3f' % (session.run(v_weight), session.run(v_bias)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### .. with TensorBoard"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch:   0, MSE: 4617.4390, W: 1.295, b: -0.407\n",
      "Epoch:  40, MSE: 5.3334, W: 0.496, b: -0.727\n",
      "Epoch:  80, MSE: 4.5894, W: 0.529, b: -0.012\n",
      "Epoch: 120, MSE: 4.1029, W: 0.512, b: 0.608\n",
      "Epoch: 160, MSE: 3.8552, W: 0.506, b: 1.092\n",
      "Epoch: 200, MSE: 3.7597, W: 0.501, b: 1.418\n",
      "Final weight: 0.500, bias: 1.473\n"
     ]
    }
   ],
   "source": [
    "from datetime import datetime\n",
    "\n",
    "tf.reset_default_graph()\n",
    "\n",
    "np.random.seed(10)\n",
    "tf.set_random_seed(10)\n",
    "\n",
    "W, b = 0.5, 1.4\n",
    "# create a dataset of 100 examples\n",
    "X = np.linspace(0,100, num=100)\n",
    "# add random noise to the y labels\n",
    "y = np.random.normal(loc=W * X + b, scale=2.0, size=len(X))\n",
    "\n",
    "# create the placeholders\n",
    "x_ph = tf.placeholder(shape=[None,], dtype=tf.float32)\n",
    "y_ph = tf.placeholder(shape=[None,], dtype=tf.float32)\n",
    "\n",
    "# create the variables.\n",
    "v_weight = tf.get_variable(\"weight\", shape=[1], dtype=tf.float32)\n",
    "v_bias = tf.get_variable(\"bias\", shape=[1], dtype=tf.float32)\n",
    "\n",
    "# linear computation\n",
    "out = v_weight * x_ph + v_bias\n",
    "\n",
    "# compute the Mean Squared Error\n",
    "loss = tf.reduce_mean((out - y_ph)**2)\n",
    "\n",
    "# optimizer\n",
    "opt = tf.train.AdamOptimizer(0.4).minimize(loss)\n",
    "\n",
    "\n",
    "tf.summary.scalar('MSEloss', loss)\n",
    "tf.summary.histogram('model_weight', v_weight)\n",
    "tf.summary.histogram('model_bias', v_bias)\n",
    "all_summary = tf.summary.merge_all()\n",
    "\n",
    "now = datetime.now()\n",
    "clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, now.second)\n",
    "file_writer = tf.summary.FileWriter('log_dir/'+clock_time, tf.get_default_graph())\n",
    "\n",
    "\n",
    "# create the session\n",
    "session = tf.Session()\n",
    "session.run(tf.global_variables_initializer())\n",
    "\n",
    "# loop to train the parameters\n",
    "for ep in range(210):\n",
    "    # run the optimizer and get the loss\n",
    "    train_loss, _, train_summary = session.run([loss, opt, all_summary], feed_dict={x_ph:X, y_ph:y})\n",
    "    file_writer.add_summary(train_summary, ep)\n",
    " \n",
    "    # print epoch number and loss\n",
    "    if ep % 40 == 0:\n",
    "        print('Epoch: %3d, MSE: %.4f, W: %.3f, b: %.3f' % (ep, train_loss, session.run(v_weight), session.run(v_bias)))\n",
    "        \n",
    "print('Final weight: %.3f, bias: %.3f' % (session.run(v_weight), session.run(v_bias)))\n",
    "file_writer.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: Chapter03/frozenlake8x8_policyiteration.py
================================================
import numpy as np
import gym

def eval_state_action(V, s, a, gamma=0.99):
    return np.sum([p * (rew + gamma*V[next_s]) for p, next_s, rew, _ in env.P[s][a]])

def policy_evaluation(V, policy, eps=0.0001):
    '''
    Policy evaluation. Update the value function until it reach a steady state
    '''
    while True:
        delta = 0
        # loop over all states
        for s in range(nS):
            old_v = V[s]
            # update V[s] using the Bellman equation
            V[s] = eval_state_action(V, s, policy[s])
            delta = max(delta, np.abs(old_v - V[s]))

        if delta < eps:
            break

def policy_improvement(V, policy):
    '''
    Policy improvement. Update the policy based on the value function
    '''
    policy_stable = True
    for s in range(nS):
        old_a = policy[s]
        # update the policy with the action that bring to the highest state value
        policy[s] = np.argmax([eval_state_action(V, s, a) for a in range(nA)])
        if old_a != policy[s]: 
            policy_stable = False

    return policy_stable


def run_episodes(env, policy, num_games=100):
    '''
    Run some games to test a policy
    '''
    tot_rew = 0
    state = env.reset()

    for _ in range(num_games):
        done = False
        while not done:
            # select the action accordingly to the policy
            next_state, reward, done, _ = env.step(policy[state])
                
            state = next_state
            tot_rew += reward 
            if done:
                state = env.reset()

    print('Won %i of %i games!'%(tot_rew, num_games))

            
if __name__ == '__main__':
    # create the environment
    env = gym.make('FrozenLake-v0')
    # enwrap it to have additional information from it
    env = env.unwrapped

    # spaces dimension
    nA = env.action_space.n
    nS = env.observation_space.n
    
    # initializing value function and policy
    V = np.zeros(nS)
    policy = np.zeros(nS)

    # some useful variable
    policy_stable = False
    it = 0

    while not policy_stable:
        policy_evaluation(V, policy)
        policy_stable = policy_improvement(V, policy)
        it += 1

    print('Converged after %i policy iterations'%(it))
    run_episodes(env, policy)
    print(V.reshape((4,4)))
    print(policy.reshape((4,4)))

================================================
FILE: Chapter03/frozenlake8x8_valueiteration.py
================================================
import numpy as np
import gym

def eval_state_action(V, s, a, gamma=0.99):
    return np.sum([p * (rew + gamma*V[next_s]) for p, next_s, rew, _ in env.P[s][a]])

def value_iteration(eps=0.0001):
    '''
    Value iteration algorithm
    '''
    V = np.zeros(nS)
    it = 0

    while True:
        delta = 0
        # update the value of each state using as "policy" the max operator
        for s in range(nS):
            old_v = V[s]
            V[s] = np.max([eval_state_action(V, s, a) for a in range(nA)])
            delta = max(delta, np.abs(old_v - V[s]))

        if delta < eps:
            break
        else:
            print('Iter:', it, ' delta:', np.round(delta, 5))
        it += 1

    return V

def run_episodes(env, V, num_games=100):
    '''
    Run some test games
    '''
    tot_rew = 0
    state = env.reset()

    for _ in range(num_games):
        done = False
        while not done:
            action = np.argmax([eval_state_action(V, state, a) for a in range(nA)])
            next_state, reward, done, _ = env.step(action)

            state = next_state
            tot_rew += reward 
            if done:
                state = env.reset()

    print('Won %i of %i games!'%(tot_rew, num_games))

            
if __name__ == '__main__':
    # create the environment
    env = gym.make('FrozenLake-v0')
    # enwrap it to have additional information from it
    env = env.unwrapped

    # spaces dimension
    nA = env.action_space.n
    nS = env.observation_space.n

    # Value iteration
    V = value_iteration(eps=0.0001)
    # test the value function on 100 games
    run_episodes(env, V, 100)
    # print the state values
    print(V.reshape((4,4)))


================================================
FILE: Chapter04/SARSA Q_learning Taxi-v2.py
================================================
import numpy as np 
import gym


def eps_greedy(Q, s, eps=0.1):
    '''
    Epsilon greedy policy
    '''
    if np.random.uniform(0,1) < eps:
        # Choose a random action
        return np.random.randint(Q.shape[1])
    else:
        # Choose the action of a greedy policy
        return greedy(Q, s)


def greedy(Q, s):
    '''
    Greedy policy

    return the index corresponding to the maximum action-state value
    '''
    return np.argmax(Q[s])


def run_episodes(env, Q, num_episodes=100, to_print=False):
    '''
    Run some episodes to test the policy
    '''
    tot_rew = []
    state = env.reset()

    for _ in range(num_episodes):
        done = False
        game_rew = 0

        while not done:
            # select a greedy action
            next_state, rew, done, _ = env.step(greedy(Q, state))

            state = next_state
            game_rew += rew 
            if done:
                state = env.reset()
                tot_rew.append(game_rew)

    if to_print:
        print('Mean score: %.3f of %i games!'%(np.mean(tot_rew), num_episodes))

    return np.mean(tot_rew)

def Q_learning(env, lr=0.01, num_episodes=10000, eps=0.3, gamma=0.95, eps_decay=0.00005):
    nA = env.action_space.n
    nS = env.observation_space.n

    # Initialize the Q matrix
    # Q: matrix nS*nA where each row represent a state and each colums represent a different action
    Q = np.zeros((nS, nA))
    games_reward = []
    test_rewards = []

    for ep in range(num_episodes):
        state = env.reset()
        done = False
        tot_rew = 0
        
        # decay the epsilon value until it reaches the threshold of 0.01
        if eps > 0.01:
            eps -= eps_decay

        # loop the main body until the environment stops
        while not done:
            # select an action following the eps-greedy policy
            action = eps_greedy(Q, state, eps)

            next_state, rew, done, _ = env.step(action) # Take one step in the environment

            # Q-learning update the state-action value (get the max Q value for the next state)
            Q[state][action] = Q[state][action] + lr*(rew + gamma*np.max(Q[next_state]) - Q[state][action])

            state = next_state
            tot_rew += rew
            if done:
                games_reward.append(tot_rew)

        # Test the policy every 300 episodes and print the results
        if (ep % 300) == 0:
            test_rew = run_episodes(env, Q, 1000)
            print("Episode:{:5d}  Eps:{:2.4f}  Rew:{:2.4f}".format(ep, eps, test_rew))
            test_rewards.append(test_rew)
            
    return Q


def SARSA(env, lr=0.01, num_episodes=10000, eps=0.3, gamma=0.95, eps_decay=0.00005):
    nA = env.action_space.n
    nS = env.observation_space.n

    # Initialize the Q matrix
    # Q: matrix nS*nA where each row represent a state and each colums represent a different action
    Q = np.zeros((nS, nA))
    games_reward = []
    test_rewards = []

    for ep in range(num_episodes):
        state = env.reset()
        done = False
        tot_rew = 0

        # decay the epsilon value until it reaches the threshold of 0.01
        if eps > 0.01:
            eps -= eps_decay


        action = eps_greedy(Q, state, eps) 

        # loop the main body until the environment stops
        while not done:
            next_state, rew, done, _ = env.step(action) # Take one step in the environment

            # choose the next action (needed for the SARSA update)
            next_action = eps_greedy(Q, next_state, eps) 
            # SARSA update
            Q[state][action] = Q[state][action] + lr*(rew + gamma*Q[next_state][next_action] - Q[state][action])

            state = next_state
            action = next_action
            tot_rew += rew
            if done:
                games_reward.append(tot_rew)

        # Test the policy every 300 episodes and print the results
        if (ep % 300) == 0:
            test_rew = run_episodes(env, Q, 1000)
            print("Episode:{:5d}  Eps:{:2.4f}  Rew:{:2.4f}".format(ep, eps, test_rew))
            test_rewards.append(test_rew)

    return Q


if __name__ == '__main__':
    env = gym.make('Taxi-v2')
    
    Q_qlearning = Q_learning(env, lr=.1, num_episodes=5000, eps=0.4, gamma=0.95, eps_decay=0.001)

    Q_sarsa = SARSA(env, lr=.1, num_episodes=5000, eps=0.4, gamma=0.95, eps_decay=0.001)

================================================
FILE: Chapter05/.ipynb_checkpoints/Untitled-checkpoint.ipynb
================================================
{
 "cells": [],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: Chapter05/DQN_Atari.py
================================================
import numpy as np 
import tensorflow as tf
import gym
from datetime import datetime
from collections import deque
import time
import sys

from atari_wrappers import make_env


gym.logger.set_level(40)

current_milli_time = lambda: int(round(time.time() * 1000))

def cnn(x):
    '''
    Convolutional neural network
    '''
    x = tf.layers.conv2d(x, filters=16, kernel_size=8, strides=4, padding='valid', activation='relu') 
    x = tf.layers.conv2d(x, filters=32, kernel_size=4, strides=2, padding='valid', activation='relu') 
    return tf.layers.conv2d(x, filters=32, kernel_size=3, strides=1, padding='valid', activation='relu') 
    

def fnn(x, hidden_layers, output_layer, activation=tf.nn.relu, last_activation=None):
    '''
    Feed-forward neural network
    '''
    for l in hidden_layers:
        x = tf.layers.dense(x, units=l, activation=activation)
    return tf.layers.dense(x, units=output_layer, activation=last_activation)

def qnet(x, hidden_layers, output_size, fnn_activation=tf.nn.relu, last_activation=None):
    '''
    Deep Q network: CNN followed by FNN
    '''
    x = cnn(x)
    x = tf.layers.flatten(x)

    return fnn(x, hidden_layers, output_size, fnn_activation, last_activation)


class ExperienceBuffer():
    '''
    Experience Replay Buffer
    '''
    def __init__(self, buffer_size):
        self.obs_buf = deque(maxlen=buffer_size)
        self.rew_buf = deque(maxlen=buffer_size)
        self.act_buf = deque(maxlen=buffer_size)
        self.obs2_buf = deque(maxlen=buffer_size)
        self.done_buf = deque(maxlen=buffer_size)


    def add(self, obs, rew, act, obs2, done):
        # Add a new transition to the buffers
        self.obs_buf.append(obs)
        self.rew_buf.append(rew)
        self.act_buf.append(act)
        self.obs2_buf.append(obs2)
        self.done_buf.append(done)
        

    def sample_minibatch(self, batch_size):
        # Sample a minibatch of size batch_size
        mb_indices = np.random.randint(len(self.obs_buf), size=batch_size)

        mb_obs = scale_frames([self.obs_buf[i] for i in mb_indices])
        mb_rew = [self.rew_buf[i] for i in mb_indices]
        mb_act = [self.act_buf[i] for i in mb_indices]
        mb_obs2 = scale_frames([self.obs2_buf[i] for i in mb_indices])
        mb_done = [self.done_buf[i] for i in mb_indices]

        return mb_obs, mb_rew, mb_act, mb_obs2, mb_done

    def __len__(self):
        return len(self.obs_buf)


def q_target_values(mini_batch_rw, mini_batch_done, av, discounted_value):   
    '''
    Calculate the target value y for each transition
    '''
    max_av = np.max(av, axis=1)
    
    # if episode terminate, y take value r
    # otherwise, q-learning step
    
    ys = []
    for r, d, av in zip(mini_batch_rw, mini_batch_done, max_av):
        if d:
            ys.append(r)
        else:
            q_step = r + discounted_value * av
            ys.append(q_step)
    
    assert len(ys) == len(mini_batch_rw)
    return ys

def greedy(action_values):
    '''
    Greedy policy
    '''
    return np.argmax(action_values)

def eps_greedy(action_values, eps=0.1):
    '''
    Eps-greedy policy
    '''
    if np.random.uniform(0,1) < eps:
        # Choose a uniform random action
        return np.random.randint(len(action_values))
    else:
        # Choose the greedy action
        return np.argmax(action_values)

def test_agent(env_test, agent_op, num_games=20):
    '''
    Test an agent
    '''
    games_r = []

    for _ in range(num_games):
        d = False
        game_r = 0
        o = env_test.reset()

        while not d:
            # Use an eps-greedy policy with eps=0.05 (to add stochasticity to the policy)
            # Needed because Atari envs are deterministic
            # If you would use a greedy policy, the results will be always the same
            a = eps_greedy(np.squeeze(agent_op(o)), eps=0.05)
            o, r, d, _ = env_test.step(a)

            game_r += r

        games_r.append(game_r)

    return games_r

def scale_frames(frames):
    '''
    Scale the frame with number between 0 and 1
    '''
    return np.array(frames, dtype=np.float32) / 255.0

def DQN(env_name, hidden_sizes=[32], lr=1e-2, num_epochs=2000, buffer_size=100000, discount=0.99, render_cycle=100, update_target_net=1000, 
        batch_size=64, update_freq=4, frames_num=2, min_buffer_size=5000, test_frequency=20, start_explor=1, end_explor=0.1, explor_steps=100000):

    # Create the environment both for train and test
    env = make_env(env_name, frames_num=frames_num, skip_frames=True, noop_num=20)
    env_test = make_env(env_name, frames_num=frames_num, skip_frames=True, noop_num=20)
    # Add a monitor to the test env to store the videos
    env_test = gym.wrappers.Monitor(env_test, "VIDEOS/TEST_VIDEOS"+env_name+str(current_milli_time()),force=True, video_callable=lambda x: x%20==0)

    tf.reset_default_graph()

    obs_dim = env.observation_space.shape
    act_dim = env.action_space.n 

    # Create all the placeholders
    obs_ph = tf.placeholder(shape=(None, obs_dim[0], obs_dim[1], obs_dim[2]), dtype=tf.float32, name='obs')
    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')
    y_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='y')

    # Create the target network
    with tf.variable_scope('target_network'):
        target_qv = qnet(obs_ph, hidden_sizes, act_dim)
    target_vars = tf.trainable_variables()

    # Create the online network (i.e. the behavior policy)
    with tf.variable_scope('online_network'):
        online_qv = qnet(obs_ph, hidden_sizes, act_dim)
    train_vars = tf.trainable_variables()

    # Update the target network by assigning to it the variables of the online network
    # Note that the target network and the online network have the same exact architecture
    update_target = [train_vars[i].assign(train_vars[i+len(target_vars)]) for i in range(len(train_vars) - len(target_vars))]
    update_target_op = tf.group(*update_target)

    # One hot encoding of the action
    act_onehot = tf.one_hot(act_ph, depth=act_dim)
    # We are interested only in the Q-values of those actions
    q_values = tf.reduce_sum(act_onehot * online_qv, axis=1)
    
    # MSE loss function
    v_loss = tf.reduce_mean((y_ph - q_values)**2)
    # Adam optimize that minimize the loss v_loss
    v_opt = tf.train.AdamOptimizer(lr).minimize(v_loss)

    def agent_op(o):
        '''
        Forward pass to obtain the Q-values from the online network of a single observation
        '''
        # Scale the frames
        o = scale_frames(o)
        return sess.run(online_qv, feed_dict={obs_ph:[o]})

    # Time
    now = datetime.now()
    clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, int(now.second))
    print('Time:', clock_time)

    mr_v = tf.Variable(0.0)
    ml_v = tf.Variable(0.0)


    # TensorBoard summaries
    tf.summary.scalar('v_loss', v_loss)
    tf.summary.scalar('Q-value', tf.reduce_mean(q_values))
    tf.summary.histogram('Q-values', q_values)

    scalar_summary = tf.summary.merge_all()
    reward_summary = tf.summary.scalar('test_rew', mr_v) 
    mean_loss_summary = tf.summary.scalar('mean_loss', ml_v)

    LOG_DIR = 'log_dir/'+env_name
    hyp_str = "-lr_{}-upTN_{}-upF_{}-frms_{}" .format(lr, update_target_net, update_freq, frames_num)

    # initialize the File Writer for writing TensorBoard summaries
    file_writer = tf.summary.FileWriter(LOG_DIR+'/DQN_'+clock_time+'_'+hyp_str, tf.get_default_graph())

    # open a session
    sess = tf.Session()
    # and initialize all the variables
    sess.run(tf.global_variables_initializer())
    
    render_the_game = False
    step_count = 0
    last_update_loss = []
    ep_time = current_milli_time()
    batch_rew = []
    old_step_count = 0

    obs = env.reset()

    # Initialize the experience buffer
    buffer = ExperienceBuffer(buffer_size)
    
    # Copy the online network in the target network
    sess.run(update_target_op)

    ########## EXPLORATION INITIALIZATION ######
    eps = start_explor
    eps_decay = (start_explor - end_explor) / explor_steps

    for ep in range(num_epochs):
        g_rew = 0
        done = False

        # Until the environment does not end..
        while not done:
                
            # Epsilon decay
            if eps > end_explor:
                eps -= eps_decay

            # Choose an eps-greedy action 
            act = eps_greedy(np.squeeze(agent_op(obs)), eps=eps)

            # execute the action in the environment
            obs2, rew, done, _ = env.step(act)

            # Render the game if you want to
            if render_the_game:
                env.render()

            # Add the transition to the replay buffer
            buffer.add(obs, rew, act, obs2, done)

            obs = obs2
            g_rew += rew
            step_count += 1

            ################ TRAINING ###############
            # If it's time to train the network:
            if len(buffer) > min_buffer_size and (step_count % update_freq == 0):
                
                # sample a minibatch from the buffer
                mb_obs, mb_rew, mb_act, mb_obs2, mb_done = buffer.sample_minibatch(batch_size)

 
                mb_trg_qv = sess.run(target_qv, feed_dict={obs_ph:mb_obs2})
                y_r = q_target_values(mb_rew, mb_done, mb_trg_qv, discount)

                # TRAINING STEP
                # optimize, compute the loss and return the TB summary
                train_summary, train_loss, _ = sess.run([scalar_summary, v_loss, v_opt], feed_dict={obs_ph:mb_obs, y_ph:y_r, act_ph: mb_act})

                # Add the train summary to the file_writer
                file_writer.add_summary(train_summary, step_count)
                last_update_loss.append(train_loss)

            # Every update_target_net steps, update the target network
            if (len(buffer) > min_buffer_size) and (step_count % update_target_net == 0):

                # run the session to update the target network and get the mean loss sumamry 
                _, train_summary = sess.run([update_target_op, mean_loss_summary], feed_dict={ml_v:np.mean(last_update_loss)})
                file_writer.add_summary(train_summary, step_count)
                last_update_loss = []


            # If the environment is ended, reset it and initialize the variables
            if done:
                obs = env.reset()
                batch_rew.append(g_rew)
                g_rew, render_the_game = 0, False

        # every test_frequency episodes, test the agent and write some stats in TensorBoard
        if ep % test_frequency == 0:
            # Test the agent to 10 games
            test_rw = test_agent(env_test, agent_op, num_games=10)

            # Run the test stats and add them to the file_writer
            test_summary = sess.run(reward_summary, feed_dict={mr_v: np.mean(test_rw)})
            file_writer.add_summary(test_summary, step_count)

            # Print some useful stats
            ep_sec_time = int((current_milli_time()-ep_time) / 1000)
            print('Ep:%4d Rew:%4.2f, Eps:%2.2f -- Step:%5d -- Test:%4.2f %4.2f -- Time:%d -- Ep_Steps:%d' %
                        (ep,np.mean(batch_rew), eps, step_count, np.mean(test_rw), np.std(test_rw), ep_sec_time, (step_count-old_step_count)/test_frequency))

            ep_time = current_milli_time()
            batch_rew = []
            old_step_count = step_count
                            
        if ep % render_cycle == 0:
            render_the_game = True

    file_writer.close()
    env.close()


if __name__ == '__main__':

    DQN('PongNoFrameskip-v4', hidden_sizes=[128], lr=2e-4, buffer_size=100000, update_target_net=1000, batch_size=32, 
        update_freq=2, frames_num=2, min_buffer_size=10000, render_cycle=10000)

================================================
FILE: Chapter05/DQN_variations_Atari.py
================================================
import numpy as np 
import tensorflow as tf
import gym
from datetime import datetime
from collections import deque
import time
import sys

from atari_wrappers import make_env


gym.logger.set_level(40)

current_milli_time = lambda: int(round(time.time() * 1000))


def cnn(x):
    '''
    Convolutional neural network
    '''
    x = tf.layers.conv2d(x, filters=16, kernel_size=8, strides=4, padding='valid', activation='relu') 
    x = tf.layers.conv2d(x, filters=32, kernel_size=4, strides=2, padding='valid', activation='relu') 
    return tf.layers.conv2d(x, filters=32, kernel_size=3, strides=1, padding='valid', activation='relu') 
    

def fnn(x, hidden_layers, output_layer, activation=tf.nn.relu, last_activation=None):
    '''
    Feed-forward neural network
    '''
    for l in hidden_layers:
        x = tf.layers.dense(x, units=l, activation=activation)
    return tf.layers.dense(x, units=output_layer, activation=last_activation)

def qnet(x, hidden_layers, output_size, fnn_activation=tf.nn.relu, last_activation=None):
    '''
    Deep Q network: CNN followed by FNN
    '''
    x = cnn(x)
    x = tf.layers.flatten(x)

    return fnn(x, hidden_layers, output_size, fnn_activation, last_activation)

def greedy(action_values):
    '''
    Greedy policy
    '''
    return np.argmax(action_values)

def eps_greedy(action_values, eps=0.1):
    '''
    Eps-greedy policy
    '''
    if np.random.uniform(0,1) < eps:
        # Choose a uniform random action
        return np.random.randint(len(action_values))
    else:
        # Choose the greedy action
        return np.argmax(action_values)

def q_target_values(mini_batch_rw, mini_batch_done, av, discounted_value):   
    '''
    Calculate the target value y for each transition
    '''
    max_av = np.max(av, axis=1)
    
    # if episode terminate, y take value r
    # otherwise, q-learning step
    
    ys = []
    for r, d, av in zip(mini_batch_rw, mini_batch_done, max_av):
        if d:
            ys.append(r)
        else:
            q_step = r + discounted_value * av
            ys.append(q_step)
    
    assert len(ys) == len(mini_batch_rw)
    return ys

def test_agent(env_test, agent_op, num_games=20):
    '''
    Test an agent
    '''
    games_r = []

    for _ in range(num_games):
        d = False
        game_r = 0
        o = env_test.reset()

        while not d:
            # Use an eps-greedy policy with eps=0.05 (to add stochasticity to the policy)
            # Needed because Atari envs are deterministic
            # If you would use a greedy policy, the results will be always the same
            a = eps_greedy(np.squeeze(agent_op(o)), eps=0.05)
            o, r, d, _ = env_test.step(a)

            game_r += r

        games_r.append(game_r)

    return games_r

def scale_frames(frames):
    '''
    Scale the frame with number between 0 and 1
    '''
    return np.array(frames, dtype=np.float32) / 255.0


def dueling_qnet(x, hidden_layers, output_size, fnn_activation=tf.nn.relu, last_activation=None):
    '''
    Dueling neural network
    '''
    x = cnn(x)
    x = tf.layers.flatten(x)

    qf = fnn(x, hidden_layers, 1, fnn_activation, last_activation)
    aaqf = fnn(x, hidden_layers, output_size, fnn_activation, last_activation)

    return qf + aaqf - tf.reduce_mean(aaqf)

def double_q_target_values(mini_batch_rw, mini_batch_done, target_qv, online_qv, discounted_value):   ## IS THE NAME CORRECT???
    '''
    Calculate the target value y following the double Q-learning update
    '''
    argmax_online_qv = np.argmax(online_qv, axis=1)
    
    # if episode terminate, y take value r
    # otherwise, q-learning step
    
    ys = []
    assert len(mini_batch_rw) == len(mini_batch_done) == len(target_qv) == len(argmax_online_qv)
    for r, d, t_av, arg_a in zip(mini_batch_rw, mini_batch_done, target_qv, argmax_online_qv):
        if d:
            ys.append(r)
        else:
            q_value = r + discounted_value * t_av[arg_a]
            ys.append(q_value)
    
    assert len(ys) == len(mini_batch_rw)

    return ys

class MultiStepExperienceBuffer():
    '''
    Experience Replay Buffer for multi-step learning
    '''
    def __init__(self, buffer_size, n_step, gamma):
        self.obs_buf = deque(maxlen=buffer_size)
        self.act_buf = deque(maxlen=buffer_size)

        self.n_obs_buf = deque(maxlen=buffer_size)
        self.n_done_buf = deque(maxlen=buffer_size)
        self.n_rew_buf = deque(maxlen=buffer_size)

        self.n_step = n_step
        self.last_rews = deque(maxlen=self.n_step+1)
        self.gamma = gamma


    def add(self, obs, rew, act, obs2, done):
        self.obs_buf.append(obs)
        self.act_buf.append(act)
        # the following buffers will be updated in the next n_step steps
        # their values are not known, yet
        self.n_obs_buf.append(None)
        self.n_rew_buf.append(None)
        self.n_done_buf.append(None)

        self.last_rews.append(rew)

        ln = len(self.obs_buf)
        len_rews = len(self.last_rews)

        # Update the indices of the buffer that are n_steps old
        if done:
            # In case it's the last step, update up to the n_steps indices fo the buffer
            # it cannot update more than len(last_rews), otherwise will update the previous traj
            for i in range(len_rews):
                self.n_obs_buf[ln-(len_rews-i-1)-1] = obs2
                self.n_done_buf[ln-(len_rews-i-1)-1] = done
                rgt = np.sum([(self.gamma**k)*r for k,r in enumerate(np.array(self.last_rews)[i:len_rews])])
                self.n_rew_buf[ln-(len_rews-i-1)-1] = rgt

            # reset the reward deque
            self.last_rews = deque(maxlen=self.n_step+1)
        else:
            # Update the elements of the buffer that has been added n_step steps ago
            # Add only if the multi-step values are updated
            if len(self.last_rews) >= (self.n_step+1):
                self.n_obs_buf[ln-self.n_step-1] = obs2
                self.n_done_buf[ln-self.n_step-1] = done
                rgt = np.sum([(self.gamma**k)*r for k,r in enumerate(np.array(self.last_rews)[:len_rews])])
                self.n_rew_buf[ln-self.n_step-1] = rgt
        

    def sample_minibatch(self, batch_size):
        # Sample a minibatch of size batch_size
        # Note: the samples should be at least of n_step steps ago
        mb_indices = np.random.randint(len(self.obs_buf)-self.n_step, size=batch_size)

        mb_obs = scale_frames([self.obs_buf[i] for i in mb_indices])
        mb_rew = [self.n_rew_buf[i] for i in mb_indices]
        mb_act = [self.act_buf[i] for i in mb_indices]
        mb_obs2 = scale_frames([self.n_obs_buf[i] for i in mb_indices])
        mb_done = [self.n_done_buf[i] for i in mb_indices]

        return mb_obs, mb_rew, mb_act, mb_obs2, mb_done

    def __len__(self):
        return len(self.obs_buf)

def DQN_with_variations(env_name, extensions_hyp, hidden_sizes=[32], lr=1e-2, num_epochs=2000, buffer_size=100000, discount=0.99, render_cycle=100, update_target_net=1000, 
        batch_size=64, update_freq=4, frames_num=2, min_buffer_size=5000, test_frequency=20, start_explor=1, end_explor=0.1, explor_steps=100000):

    # Create the environment both for train and test
    env = make_env(env_name, frames_num=frames_num, skip_frames=True, noop_num=20)
    env_test = make_env(env_name, frames_num=frames_num, skip_frames=True, noop_num=20)
    # Add a monitor to the test env to store the videos
    env_test = gym.wrappers.Monitor(env_test, "VIDEOS/TEST_VIDEOS"+env_name+str(current_milli_time()),force=True, video_callable=lambda x: x%20==0)

    tf.reset_default_graph()

    obs_dim = env.observation_space.shape
    act_dim = env.action_space.n 

    # Create all the placeholders
    obs_ph = tf.placeholder(shape=(None, obs_dim[0], obs_dim[1], obs_dim[2]), dtype=tf.float32, name='obs')
    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')
    y_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='y')

    # Create the target network
    with tf.variable_scope('target_network'):
        if extensions_hyp['dueling']:
            target_qv = dueling_qnet(obs_ph, hidden_sizes, act_dim)
        else:
            target_qv = qnet(obs_ph, hidden_sizes, act_dim)
    target_vars = tf.trainable_variables()

    # Create the online network (i.e. the behavior policy)
    with tf.variable_scope('online_network'):
        if extensions_hyp['dueling']:
            online_qv = dueling_qnet(obs_ph, hidden_sizes, act_dim)
        else:
            online_qv = qnet(obs_ph, hidden_sizes, act_dim)
    train_vars = tf.trainable_variables()

    # Update the target network by assigning to it the variables of the online network
    # Note that the target network and the online network have the same exact architecture
    update_target = [train_vars[i].assign(train_vars[i+len(target_vars)]) for i in range(len(train_vars) - len(target_vars))]
    update_target_op = tf.group(*update_target)

    # One hot encoding of the action
    act_onehot = tf.one_hot(act_ph, depth=act_dim)
    # We are interested only in the Q-values of those actions
    q_values = tf.reduce_sum(act_onehot * online_qv, axis=1)
    
    # MSE loss function
    v_loss = tf.reduce_mean((y_ph - q_values)**2)
    # Adam optimize that minimize the loss v_loss
    v_opt = tf.train.AdamOptimizer(lr).minimize(v_loss)

    def agent_op(o):
        '''
        Forward pass to obtain the Q-values from the online network of a single observation
        '''
        # Scale the frames
        o = scale_frames(o)
        return sess.run(online_qv, feed_dict={obs_ph:[o]})

    # Time
    now = datetime.now()
    clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, int(now.second))
    print('Time:', clock_time)

    mr_v = tf.Variable(0.0)
    ml_v = tf.Variable(0.0)


    # TensorBoard summaries
    tf.summary.scalar('v_loss', v_loss)
    tf.summary.scalar('Q-value', tf.reduce_mean(q_values))
    tf.summary.histogram('Q-values', q_values)

    scalar_summary = tf.summary.merge_all()
    reward_summary = tf.summary.scalar('test_rew', mr_v)
    mean_loss_summary = tf.summary.scalar('mean_loss', ml_v)

    LOG_DIR = 'log_dir/'+env_name
    hyp_str = "-lr_{}-upTN_{}-upF_{}-frms_{}-ddqn_{}-duel_{}-nstep_{}" \
                .format(lr, update_target_net, update_freq, frames_num, extensions_hyp['DDQN'], extensions_hyp['dueling'], extensions_hyp['multi_step'])

    # initialize the File Writer for writing TensorBoard summaries
    file_writer = tf.summary.FileWriter(LOG_DIR+'/DQN_'+clock_time+'_'+hyp_str, tf.get_default_graph())

    # open a session
    sess = tf.Session()
    # and initialize all the variables
    sess.run(tf.global_variables_initializer())
    
    render_the_game = False
    step_count = 0
    last_update_loss = []
    ep_time = current_milli_time()
    batch_rew = []
    old_step_count = 0

    obs = env.reset()

    # Initialize the experience buffer
    #buffer = ExperienceBuffer(buffer_size)
    buffer = MultiStepExperienceBuffer(buffer_size, extensions_hyp['multi_step'], discount)
    
    # Copy the online network in the target network
    sess.run(update_target_op)

    ########## EXPLORATION INITIALIZATION ######
    eps = start_explor
    eps_decay = (start_explor - end_explor) / explor_steps

    for ep in range(num_epochs):
        g_rew = 0
        done = False

        # Until the environment does not end..
        while not done:
                
            # Epsilon decay
            if eps > end_explor:
                eps -= eps_decay

            # Choose an eps-greedy action 
            act = eps_greedy(np.squeeze(agent_op(obs)), eps=eps)

            # execute the action in the environment
            obs2, rew, done, _ = env.step(act)

            # Render the game if you want to
            if render_the_game:
                env.render()

            # Add the transition to the replay buffer
            buffer.add(obs, rew, act, obs2, done)

            obs = obs2
            g_rew += rew
            step_count += 1

            ################ TRAINING ###############
            # If it's time to train the network:
            if len(buffer) > min_buffer_size and (step_count % update_freq == 0):
                
                # sample a minibatch from the buffer
                mb_obs, mb_rew, mb_act, mb_obs2, mb_done = buffer.sample_minibatch(batch_size)

                if extensions_hyp['DDQN']:
                    mb_onl_qv, mb_trg_qv = sess.run([online_qv,target_qv], feed_dict={obs_ph:mb_obs2})
                    y_r = double_q_target_values(mb_rew, mb_done, mb_trg_qv, mb_onl_qv, discount)
                else:
                    mb_trg_qv = sess.run(target_qv, feed_dict={obs_ph:mb_obs2})
                    y_r = q_target_values(mb_rew, mb_done, mb_trg_qv, discount)

                # optimize, compute the loss and return the TB summary
                train_summary, train_loss, _ = sess.run([scalar_summary, v_loss, v_opt], feed_dict={obs_ph:mb_obs, y_ph:y_r, act_ph: mb_act})

                # Add the train summary to the file_writer
                file_writer.add_summary(train_summary, step_count)
                last_update_loss.append(train_loss)

            # Every update_target_net steps, update the target network
            if (len(buffer) > min_buffer_size) and (step_count % update_target_net == 0):

                # run the session to update the target network and get the mean loss sumamry 
                _, train_summary = sess.run([update_target_op, mean_loss_summary], feed_dict={ml_v:np.mean(last_update_loss)})
                file_writer.add_summary(train_summary, step_count)
                last_update_loss = []


            # If the environment is ended, reset it and initialize the variables
            if done:
                obs = env.reset()
                batch_rew.append(g_rew)
                g_rew, render_the_game = 0, False

        # every test_frequency episodes, test the agent and write some stats in TensorBoard
        if ep % test_frequency == 0:
            # Test the agent to 10 games
            test_rw = test_agent(env_test, agent_op, num_games=10)

            # Run the test stats and add them to the file_writer
            test_summary = sess.run(reward_summary, feed_dict={mr_v: np.mean(test_rw)})
            file_writer.add_summary(test_summary, step_count)

            # Print some useful stats
            ep_sec_time = int((current_milli_time()-ep_time) / 1000)
            print('Ep:%4d Rew:%4.2f, Eps:%2.2f -- Step:%5d -- Test:%4.2f %4.2f -- Time:%d -- Ep_Steps:%d' %
                        (ep,np.mean(batch_rew), eps, step_count, np.mean(test_rw), np.std(test_rw), ep_sec_time, (step_count-old_step_count)/test_frequency))

            ep_time = current_milli_time()
            batch_rew = []
            old_step_count = step_count
                            
        if ep % render_cycle == 0:
            render_the_game = True

    file_writer.close()
    env.close()


if __name__ == '__main__':

    extensions_hyp={
        'DDQN':False,
        'dueling':False,
        'multi_step':1
    }
    DQN_with_variations('PongNoFrameskip-v4', extensions_hyp, hidden_sizes=[128], lr=2e-4, buffer_size=100000, update_target_net=1000, batch_size=32, 
        update_freq=2, frames_num=2, min_buffer_size=10000, render_cycle=10000)

================================================
FILE: Chapter05/Untitled.ipynb
================================================
{
 "cells": [],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: Chapter05/atari_wrappers.py
================================================
import numpy as np
import os
from collections import deque
import gym
from gym import spaces
import cv2

''' 
Atari Wrapper copied from https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py
'''

class NoopResetEnv(gym.Wrapper):
    def __init__(self, env, noop_max=30):
        """Sample initial states by taking random number of no-ops on reset.
        No-op is assumed to be action 0.
        """
        gym.Wrapper.__init__(self, env)
        self.noop_max = noop_max
        self.override_num_noops = None
        self.noop_action = 0
        assert env.unwrapped.get_action_meanings()[0] == 'NOOP'

    def reset(self, **kwargs):
        """ Do no-op action for a number of steps in [1, noop_max]."""
        self.env.reset(**kwargs)
        if self.override_num_noops is not None:
            noops = self.override_num_noops
        else:
            noops = self.unwrapped.np_random.randint(1, self.noop_max + 1) #pylint: disable=E1101
        assert noops > 0
        obs = None
        for _ in range(noops):
            obs, _, done, _ = self.env.step(self.noop_action)
            if done:
                obs = self.env.reset(**kwargs)
        return obs

    def step(self, ac):
        return self.env.step(ac)

class LazyFrames(object):
    def __init__(self, frames):
        """This object ensures that common frames between the observations are only stored once.
        It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
        buffers.
        This object should only be converted to numpy array before being passed to the model.
        You'd not believe how complex the previous solution was."""
        self._frames = frames
        self._out = None

    def _force(self):
        if self._out is None:
            self._out = np.concatenate(self._frames, axis=2)
            self._frames = None
        return self._out

    def __array__(self, dtype=None):
        out = self._force()
        if dtype is not None:
            out = out.astype(dtype)
        return out

    def __len__(self):
        return len(self._force())

    def __getitem__(self, i):
        return self._force()[i]

class FireResetEnv(gym.Wrapper):
    def __init__(self, env):
        """Take action on reset for environments that are fixed until firing."""
        gym.Wrapper.__init__(self, env)
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        assert len(env.unwrapped.get_action_meanings()) >= 3

    def reset(self, **kwargs):
        self.env.reset(**kwargs)
        obs, _, done, _ = self.env.step(1)
        if done:
            self.env.reset(**kwargs)
        obs, _, done, _ = self.env.step(2)
        if done:
            self.env.reset(**kwargs)
        return obs

    def step(self, ac):
        return self.env.step(ac)


class MaxAndSkipEnv(gym.Wrapper):
    def __init__(self, env, skip=4):
        """Return only every `skip`-th frame"""
        gym.Wrapper.__init__(self, env)
        # most recent raw observations (for max pooling across time steps)
        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
        self._skip       = skip

    def step(self, action):
        """Repeat action, sum reward, and max over last observations."""
        total_reward = 0.0
        done = None
        for i in range(self._skip):
            obs, reward, done, info = self.env.step(action)
            if i == self._skip - 2: self._obs_buffer[0] = obs
            if i == self._skip - 1: self._obs_buffer[1] = obs
            total_reward += reward
            if done:
                break
        # Note that the observation on the done=True frame
        # doesn't matter
        max_frame = self._obs_buffer.max(axis=0)

        return max_frame, total_reward, done, info

    def reset(self, **kwargs):
        return self.env.reset(**kwargs)


class WarpFrame(gym.ObservationWrapper):
    def __init__(self, env):
        """Warp frames to 84x84 as done in the Nature paper and later work."""
        gym.ObservationWrapper.__init__(self, env)
        self.width = 84
        self.height = 84
        self.observation_space = spaces.Box(low=0, high=255,
            shape=(self.height, self.width, 1), dtype=np.uint8)

    def observation(self, frame):
        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
        return frame[:, :, None]


class FrameStack(gym.Wrapper):
    def __init__(self, env, k):
        """Stack k last frames.
        Returns lazy array, which is much more memory efficient.
        See Also
        baselines.common.atari_wrappers.LazyFrames
        """
        gym.Wrapper.__init__(self, env)
        self.k = k
        self.frames = deque([], maxlen=k)
        shp = env.observation_space.shape
        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=env.observation_space.dtype)

    def reset(self):
        ob = self.env.reset()
        for _ in range(self.k):
            self.frames.append(ob)
        return self._get_ob()

    def step(self, action):
        ob, reward, done, info = self.env.step(action)
        self.frames.append(ob)
        return self._get_ob(), reward, done, info

    def _get_ob(self):
        assert len(self.frames) == self.k
        return LazyFrames(list(self.frames))


class ScaledFloatFrame(gym.ObservationWrapper):
    def __init__(self, env):
        gym.ObservationWrapper.__init__(self, env)
        self.observation_space = gym.spaces.Box(low=0, high=1, shape=env.observation_space.shape, dtype=np.float32)

    def observation(self, observation):
        # careful! This undoes the memory optimization, use
        # with smaller replay buffers only.
        return np.array(observation).astype(np.float32) / 255.0


def make_env(env_name, fire=True, frames_num=2, noop_num=30, skip_frames=True):
    env = gym.make(env_name)
    
    if skip_frames:
        env = MaxAndSkipEnv(env) ## Return only every `skip`-th frame
    if fire:
       env = FireResetEnv(env) ## Fire at the beginning
    env = NoopResetEnv(env, noop_max=noop_num)
    env = WarpFrame(env) ## Reshape image
    env = FrameStack(env, frames_num) ## Stack last 4 frames
    #env = ScaledFloatFrame(env) ## Scale frames
    return env

================================================
FILE: Chapter05/untitled
================================================


================================================
FILE: Chapter06/AC.py
================================================
import numpy as np 
import tensorflow as tf
import gym
from datetime import datetime
import time


def mlp(x, hidden_layers, output_size, activation=tf.nn.relu, last_activation=None):
    '''
    Multi-layer perceptron
    '''
    for l in hidden_layers:
        x = tf.layers.dense(x, units=l, activation=activation)
    return tf.layers.dense(x, units=output_size, activation=last_activation)

def softmax_entropy(logits):
    '''
    Softmax Entropy
    '''
    return tf.reduce_sum(tf.nn.softmax(logits, axis=-1) * tf.nn.log_softmax(logits, axis=-1), axis=-1)

def discounted_rewards(rews, last_sv, gamma):
    '''
    Discounted reward to go 

    Parameters:
    ----------
    rews: list of rewards
    last_sv: value of the last state
    gamma: discount value 
    '''
    rtg = np.zeros_like(rews, dtype=np.float32)
    rtg[-1] = rews[-1] + gamma*last_sv
    for i in reversed(range(len(rews)-1)):
        rtg[i] = rews[i] + gamma*rtg[i+1]
    return rtg

class Buffer():
    '''
    Buffer class to store the experience from a unique policy
    '''
    def __init__(self, gamma=0.99):
        self.gamma = gamma
        self.obs = []
        self.act = []
        self.ret = []
        self.rtg = []

    def store(self, temp_traj, last_sv):
        '''
        Add temp_traj values to the buffers and compute the advantage and reward to go

        Parameters:
        -----------
        temp_traj: list where each element is a list that contains: observation, reward, action, state-value
        last_sv: value of the last state (Used to Bootstrap)
        '''
        # store only if the temp_traj list is not empty
        if len(temp_traj) > 0:
            self.obs.extend(temp_traj[:,0])
            rtg = discounted_rewards(temp_traj[:,1], last_sv, self.gamma)
            self.ret.extend(rtg - temp_traj[:,3])
            self.rtg.extend(rtg)
            self.act.extend(temp_traj[:,2])

    def get_batch(self):
        return self.obs, self.act, self.ret, self.rtg

    def __len__(self):
        assert(len(self.obs) == len(self.act) == len(self.ret) == len(self.rtg))
        return len(self.obs)
    
def AC(env_name, hidden_sizes=[32], ac_lr=5e-3, cr_lr=8e-3, num_epochs=50, gamma=0.99, steps_per_epoch=100, steps_to_print=100):
    '''
    Actor-Critic Algorithm
s
    Parameters:
    -----------
    env_name: Name of the environment
    hidden_size: list of the number of hidden units for each layer
    ac_lr: actor learning rate
    cr_lr: critic learning rate
    num_epochs: number of training epochs
    gamma: discount factor
    steps_per_epoch: number of steps per epoch
    '''
    tf.reset_default_graph()

    env = gym.make(env_name)    

    
    obs_dim = env.observation_space.shape
    act_dim = env.action_space.n 

    # Placeholders
    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')
    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')
    ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')
    rtg_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='rtg')

    #####################################################
    ########### COMPUTE THE PG LOSS FUNCTIONS ###########
    #####################################################

    # policy
    p_logits = mlp(obs_ph, hidden_sizes, act_dim, activation=tf.tanh)

    act_multn = tf.squeeze(tf.random.multinomial(p_logits, 1))
    actions_mask = tf.one_hot(act_ph, depth=act_dim)
    p_log = tf.reduce_sum(actions_mask * tf.nn.log_softmax(p_logits), axis=1)
    # entropy useful to study the algorithms
    entropy = -tf.reduce_mean(softmax_entropy(p_logits))
    p_loss = -tf.reduce_mean(p_log*ret_ph)

    # policy optimization
    p_opt = tf.train.AdamOptimizer(ac_lr).minimize(p_loss)

    #######################################
    ###########  VALUE FUNCTION ###########
    #######################################
    
    # value function
    s_values = tf.squeeze(mlp(obs_ph, hidden_sizes, 1, activation=tf.tanh))
    # MSE loss function
    v_loss = tf.reduce_mean((rtg_ph - s_values)**2)
    # value function optimization
    v_opt = tf.train.AdamOptimizer(cr_lr).minimize(v_loss)

    # Time
    now = datetime.now()
    clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, now.second)
    print('Time:', clock_time)


    # Set scalars and hisograms for TensorBoard
    tf.summary.scalar('p_loss', p_loss, collections=['train'])
    tf.summary.scalar('v_loss', v_loss, collections=['train'])
    tf.summary.scalar('entropy', entropy, collections=['train'])
    tf.summary.scalar('s_values', tf.reduce_mean(s_values), collections=['train'])
    tf.summary.histogram('p_soft', tf.nn.softmax(p_logits), collections=['train'])
    tf.summary.histogram('p_log', p_log, collections=['train'])
    tf.summary.histogram('act_multn', act_multn, collections=['train'])
    tf.summary.histogram('p_logits', p_logits, collections=['train'])
    tf.summary.histogram('ret_ph', ret_ph, collections=['train'])
    tf.summary.histogram('rtg_ph', rtg_ph, collections=['train'])
    tf.summary.histogram('s_values', s_values, collections=['train'])
    train_summary = tf.summary.merge_all('train')

    tf.summary.scalar('old_v_loss', v_loss, collections=['pre_train'])
    tf.summary.scalar('old_p_loss', p_loss, collections=['pre_train'])
    pre_scalar_summary = tf.summary.merge_all('pre_train')

    hyp_str = '-steps_{}-aclr_{}-crlr_{}'.format(steps_per_epoch, ac_lr, cr_lr)
    file_writer = tf.summary.FileWriter('log_dir/{}/AC_{}_{}'.format(env_name, clock_time, hyp_str), tf.get_default_graph())
    
    # create a session
    sess = tf.Session()
    # initialize the variables
    sess.run(tf.global_variables_initializer())

    # few variables
    step_count = 0
    train_rewards = []
    train_ep_len = []
    timer = time.time()
    last_print_step = 0

    #Reset the environment at the beginning of the cycle
    obs = env.reset()
    ep_rews = []

    # main cycle
    for ep in range(num_epochs):

        # intiaizlie buffer and other variables for the new epochs
        buffer = Buffer(gamma)
        env_buf = []
        
        #iterate always over a fixed number of iterations
        for _ in range(steps_per_epoch):

            # run the policy
            act, val = sess.run([act_multn, s_values], feed_dict={obs_ph:[obs]})
            # take a step in the environment
            obs2, rew, done, _ = env.step(np.squeeze(act))

            # add the new transition
            env_buf.append([obs.copy(), rew, act, np.squeeze(val)])

            obs = obs2.copy()

            step_count += 1
            last_print_step += 1
            ep_rews.append(rew)

            if done:
                # store the trajectory just completed
                # Changed from REINFORCE! The second parameter is the estimated value of the next state. Because the environment is done. 
                # we pass a value of 0
                buffer.store(np.array(env_buf), 0)
                env_buf = []
                # store additionl information about the episode
                train_rewards.append(np.sum(ep_rews))
                train_ep_len.append(len(ep_rews))
                # reset the environment
                obs = env.reset()
                ep_rews = []

        # Bootstrap with the estimated state value of the next state!
        if len(env_buf) > 0:
            last_sv = sess.run(s_values, feed_dict={obs_ph:[obs]})
            buffer.store(np.array(env_buf), last_sv)

        # collect the episodes' information
        obs_batch, act_batch, ret_batch, rtg_batch = buffer.get_batch()
        
        # run pre_scalar_summary before the optimization phase
        old_p_loss, old_v_loss, epochs_summary = sess.run([p_loss, v_loss, pre_scalar_summary], feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch, rtg_ph:rtg_batch})
        file_writer.add_summary(epochs_summary, step_count)

        # Optimize the actor and the critic
        sess.run([p_opt, v_opt], feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch, rtg_ph:rtg_batch})

        # run train_summary to save the summary after the optimization
        new_p_loss, new_v_loss, train_summary_run = sess.run([p_loss, v_loss, train_summary], feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch, rtg_ph:rtg_batch})
        file_writer.add_summary(train_summary_run, step_count)
        summary = tf.Summary()
        summary.value.add(tag='diff/p_loss', simple_value=(old_p_loss - new_p_loss))
        summary.value.add(tag='diff/v_loss', simple_value=(old_v_loss - new_v_loss))
        file_writer.add_summary(summary, step_count)
        file_writer.flush()

        # it's time to print some useful information
        if last_print_step > steps_to_print:
            print('Ep:%d MnRew:%.2f MxRew:%.1f EpLen:%.1f Buffer:%d -- Step:%d -- Time:%d' % (ep, np.mean(train_rewards), np.max(train_rewards), np.mean(train_ep_len), len(buffer), step_count,time.time()-timer))

            summary = tf.Summary()
            summary.value.add(tag='supplementary/len', simple_value=np.mean(train_ep_len))
            summary.value.add(tag='supplementary/train_rew', simple_value=np.mean(train_rewards))
            file_writer.add_summary(summary, step_count)
            file_writer.flush()

            timer = time.time()
            train_rewards = []
            train_ep_len = []
            last_print_step = 0

    env.close()
    file_writer.close()


if __name__ == '__main__':
    AC('LunarLander-v2', hidden_sizes=[64], ac_lr=4e-3, cr_lr=1.5e-2, gamma=0.99, steps_per_epoch=100, steps_to_print=5000, num_epochs=8000)


================================================
FILE: Chapter06/REINFORCE.py
================================================
import numpy as np 
import tensorflow as tf
import gym
from datetime import datetime
import time


def mlp(x, hidden_layers, output_size, activation=tf.nn.relu, last_activation=None):
    '''
    Multi-layer perceptron
    '''
    for l in hidden_layers:
        x = tf.layers.dense(x, units=l, activation=activation)
    return tf.layers.dense(x, units=output_size, activation=last_activation)

def softmax_entropy(logits):
    '''
    Softmax Entropy
    '''
    return tf.reduce_sum(tf.nn.softmax(logits, axis=-1) * tf.nn.log_softmax(logits, axis=-1), axis=-1)


def discounted_rewards(rews, gamma):
    '''
    Discounted reward to go 

    Parameters:
    ----------
    rews: list of rewards
    gamma: discount value 
    '''
    rtg = np.zeros_like(rews, dtype=np.float32)
    rtg[-1] = rews[-1]
    for i in reversed(range(len(rews)-1)):
        rtg[i] = rews[i] + gamma*rtg[i+1]
    return rtg

class Buffer():
    '''
    Buffer class to store the experience from a unique policy
    '''
    def __init__(self, gamma=0.99):
        self.gamma = gamma
        self.obs = []
        self.act = []
        self.ret = []

    def store(self, temp_traj):
        '''
        Add temp_traj values to the buffers and compute the advantage and reward to go

        Parameters:
        -----------
        temp_traj: list where each element is a list that contains: observation, reward, action, state-value
        '''
        # store only if the temp_traj list is not empty
        if len(temp_traj) > 0:
            self.obs.extend(temp_traj[:,0])
            rtg = discounted_rewards(temp_traj[:,1], self.gamma)
            self.ret.extend(rtg)
            self.act.extend(temp_traj[:,2])

    def get_batch(self):
        b_ret = self.ret
        return self.obs, self.act, b_ret

    def __len__(self):
        assert(len(self.obs) == len(self.act) == len(self.ret))
        return len(self.obs)
    

def REINFORCE(env_name, hidden_sizes=[32], lr=5e-3, num_epochs=50, gamma=0.99, steps_per_epoch=100):
    '''
    REINFORCE Algorithm

    Parameters:
    -----------
    env_name: Name of the environment
    hidden_size: list of the number of hidden units for each layer
    lr: policy learning rate
    gamma: discount factor
    steps_per_epoch: number of steps per epoch
    num_epochs: number train epochs (Note: they aren't properly epochs)
    '''
    tf.reset_default_graph()

    env = gym.make(env_name)    

    
    obs_dim = env.observation_space.shape
    act_dim = env.action_space.n 

    # Placeholders
    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')
    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')
    ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')

    ##################################################
    ########### COMPUTE THE LOSS FUNCTIONS ###########
    ##################################################


    # policy
    p_logits = mlp(obs_ph, hidden_sizes, act_dim, activation=tf.tanh)


    act_multn = tf.squeeze(tf.random.multinomial(p_logits, 1))
    actions_mask = tf.one_hot(act_ph, depth=act_dim)

    p_log = tf.reduce_sum(actions_mask * tf.nn.log_softmax(p_logits), axis=1)

    # entropy useful to study the algorithms
    entropy = -tf.reduce_mean(softmax_entropy(p_logits))
    p_loss = -tf.reduce_mean(p_log*ret_ph)

    # policy optimization
    p_opt = tf.train.AdamOptimizer(lr).minimize(p_loss)

    # Time
    now = datetime.now()
    clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, now.second)
    print('Time:', clock_time)


    # Set scalars and hisograms for TensorBoard
    tf.summary.scalar('p_loss', p_loss, collections=['train'])
    tf.summary.scalar('entropy', entropy, collections=['train'])
    tf.summary.histogram('p_soft', tf.nn.softmax(p_logits), collections=['train'])
    tf.summary.histogram('p_log', p_log, collections=['train'])
    tf.summary.histogram('act_multn', act_multn, collections=['train'])
    tf.summary.histogram('p_logits', p_logits, collections=['train'])
    tf.summary.histogram('ret_ph', ret_ph, collections=['train'])
    train_summary = tf.summary.merge_all('train')

    tf.summary.scalar('old_p_loss', p_loss, collections=['pre_train'])
    pre_scalar_summary = tf.summary.merge_all('pre_train')

    hyp_str = '-steps_{}-aclr_{}'.format(steps_per_epoch, lr)
    file_writer = tf.summary.FileWriter('log_dir/{}/REINFORCE_{}_{}'.format(env_name, clock_time, hyp_str), tf.get_default_graph())
    
    # create a session
    sess = tf.Session()
    # initialize the variables
    sess.run(tf.global_variables_initializer())

    # few variables
    step_count = 0
    train_rewards = []
    train_ep_len = []
    timer = time.time()

    # main cycle
    for ep in range(num_epochs):

        # initialize environment for the new epochs
        obs = env.reset()

        # intiaizlie buffer and other variables for the new epochs
        buffer = Buffer(gamma)
        env_buf = []
        ep_rews = []
        
        while len(buffer) < steps_per_epoch:

            # run the policy
            act = sess.run(act_multn, feed_dict={obs_ph:[obs]})
            # take a step in the environment
            obs2, rew, done, _ = env.step(np.squeeze(act))

            # add the new transition
            env_buf.append([obs.copy(), rew, act])

            obs = obs2.copy()

            step_count += 1
            ep_rews.append(rew)

            if done:
                # store the trajectory just completed
                buffer.store(np.array(env_buf))
                env_buf = []
                # store additionl information about the episode
                train_rewards.append(np.sum(ep_rews))
                train_ep_len.append(len(ep_rews))
                # reset the environment
                obs = env.reset()
                ep_rews = []

        # collect the episodes' information
        obs_batch, act_batch, ret_batch = buffer.get_batch()
        
        # run pre_scalar_summary before the optimization phase
        epochs_summary = sess.run(pre_scalar_summary, feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch})
        file_writer.add_summary(epochs_summary, step_count)

        # Optimize the policy
        sess.run(p_opt, feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch})

        # run train_summary to save the summary after the optimization
        train_summary_run = sess.run(train_summary, feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch})
        file_writer.add_summary(train_summary_run, step_count)

        # it's time to print some useful information
        if ep % 10 == 0:
            print('Ep:%d MnRew:%.2f MxRew:%.1f EpLen:%.1f Buffer:%d -- Step:%d -- Time:%d' % (ep, np.mean(train_rewards), np.max(train_rewards), np.mean(train_ep_len), len(buffer), step_count,time.time()-timer))

            summary = tf.Summary()
            summary.value.add(tag='supplementary/len', simple_value=np.mean(train_ep_len))
            summary.value.add(tag='supplementary/train_rew', simple_value=np.mean(train_rewards))
            file_writer.add_summary(summary, step_count)
            file_writer.flush()

            timer = time.time()
            train_rewards = []
            train_ep_len = []


    env.close()
    file_writer.close()


if __name__ == '__main__':
    REINFORCE('LunarLander-v2', hidden_sizes=[64], lr=8e-3, gamma=0.99, num_epochs=1000, steps_per_epoch=1000)

================================================
FILE: Chapter06/REINFORCE_baseline.py
================================================
import numpy as np 
import tensorflow as tf
import gym
from datetime import datetime
import time


def mlp(x, hidden_layers, output_size, activation=tf.nn.relu, last_activation=None):
    '''
    Multi-layer perceptron
    '''
    for l in hidden_layers:
        x = tf.layers.dense(x, units=l, activation=activation)
    return tf.layers.dense(x, units=output_size, activation=last_activation)

def softmax_entropy(logits):
    '''
    Softmax Entropy
    '''
    return tf.reduce_sum(tf.nn.softmax(logits, axis=-1) * tf.nn.log_softmax(logits, axis=-1), axis=-1)


def discounted_rewards(rews, gamma):
    '''
    Discounted reward to go 

    Parameters:
    ----------
    rews: list of rewards
    gamma: discount value 
    '''
    rtg = np.zeros_like(rews, dtype=np.float32)
    rtg[-1] = rews[-1]
    for i in reversed(range(len(rews)-1)):
        rtg[i] = rews[i] + gamma*rtg[i+1]
    return rtg

class Buffer():
    '''
    Buffer class to store the experience from a unique policy
    '''
    def __init__(self, gamma=0.99):
        self.gamma = gamma
        self.obs = []
        self.act = []
        self.ret = []
        self.rtg = []

    def store(self, temp_traj):
        '''
        Add temp_traj values to the buffers and compute the advantage and reward to go

        Parameters:
        -----------
        temp_traj: list where each element is a list that contains: observation, reward, action, state-value
        '''
        # store only if the temp_traj list is not empty
        if len(temp_traj) > 0:
            self.obs.extend(temp_traj[:,0])
            rtg = discounted_rewards(temp_traj[:,1], self.gamma)
            # NEW
            self.ret.extend(rtg - temp_traj[:,3])
            self.rtg.extend(rtg)
            self.act.extend(temp_traj[:,2])

    def get_batch(self):
        # MODIFIED
        return self.obs, self.act, self.ret, self.rtg

    def __len__(self):
        assert(len(self.obs) == len(self.act) == len(self.ret) == len(self.rtg))
        return len(self.obs)


def REINFORCE_baseline(env_name, hidden_sizes=[32], p_lr=5e-3, vf_lr=8e-3, gamma=0.99, steps_per_epoch=100, num_epochs=1000):
    '''
    REINFORCE with baseline Algorithm

    Parameters:
    -----------
    env_name: Name of the environment
    hidden_size: list of the number of hidden units for each layer
    p_lr: policy learning rate
    vf_lr: value function learning rate
    gamma: discount factor
    steps_per_epoch: number of steps per epoch
    num_epochs: number train epochs (Note: they aren't properly epochs)
    '''
    tf.reset_default_graph()

    env = gym.make(env_name)    
    
    obs_dim = env.observation_space.shape
    act_dim = env.action_space.n 

    # Placeholders
    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')
    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')
    ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')
    rtg_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='rtg')

    #####################################################
    ########### COMPUTE THE PG LOSS FUNCTIONS ###########
    #####################################################

    # policy
    p_logits = mlp(obs_ph, hidden_sizes, act_dim, activation=tf.tanh)

    act_multn = tf.squeeze(tf.random.multinomial(p_logits, 1))
    actions_mask = tf.one_hot(act_ph, depth=act_dim)
    p_log = tf.reduce_sum(actions_mask * tf.nn.log_softmax(p_logits), axis=1)
    # entropy useful to study the algorithms
    entropy = -tf.reduce_mean(softmax_entropy(p_logits))
    p_loss = -tf.reduce_mean(p_log*ret_ph)

    # policy optimization
    p_opt = tf.train.AdamOptimizer(p_lr).minimize(p_loss)

    #######################################
    ###########  VALUE FUNCTION ###########
    #######################################
    
    ########### NEW ###########
    # value function
    s_values = tf.squeeze(mlp(obs_ph, hidden_sizes, 1, activation=tf.tanh))

    # MSE loss function
    v_loss = tf.reduce_mean((rtg_ph - s_values)**2)

    # value function optimization
    v_opt = tf.train.AdamOptimizer(vf_lr).minimize(v_loss)

    # Time
    now = datetime.now()
    clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, now.second)
    print('Time:', clock_time)


    # Set scalars and hisograms for TensorBoard
    tf.summary.scalar('p_loss', p_loss, collections=['train'])
    tf.summary.scalar('v_loss', v_loss, collections=['train'])
    tf.summary.scalar('entropy', entropy, collections=['train'])
    tf.summary.scalar('s_values', tf.reduce_mean(s_values), collections=['train'])
    tf.summary.histogram('p_soft', tf.nn.softmax(p_logits), collections=['train'])
    tf.summary.histogram('p_log', p_log, collections=['train'])
    tf.summary.histogram('act_multn', act_multn, collections=['train'])
    tf.summary.histogram('p_logits', p_logits, collections=['train'])
    tf.summary.histogram('ret_ph', ret_ph, collections=['train'])
    tf.summary.histogram('rtg_ph', rtg_ph, collections=['train'])
    tf.summary.histogram('s_values', s_values, collections=['train'])
    train_summary = tf.summary.merge_all('train')

    tf.summary.scalar('old_v_loss', v_loss, collections=['pre_train'])
    tf.summary.scalar('old_p_loss', p_loss, collections=['pre_train'])
    pre_scalar_summary = tf.summary.merge_all('pre_train')

    hyp_str = '-steps_{}-plr_{}-vflr_{}'.format(steps_per_epoch, p_lr, vf_lr)
    file_writer = tf.summary.FileWriter('log_dir/{}/REINFORCE_basel_{}_{}'.format(env_name, clock_time, hyp_str), tf.get_default_graph())
    
    # create a session
    sess = tf.Session()
    # initialize the variables
    sess.run(tf.global_variables_initializer())

    # few variables
    step_count = 0
    train_rewards = []
    train_ep_len = []
    timer = time.time()

    # main cycle
    for ep in range(num_epochs):

        # initialize environment for the new epochs
        obs = env.reset()

        # intiaizlie buffer and other variables for the new epochs
        buffer = Buffer(gamma)
        env_buf = []
        ep_rews = []
        
        while len(buffer) < steps_per_epoch:

            # run the policy
            act, val = sess.run([act_multn, s_values], feed_dict={obs_ph:[obs]})
            # take a step in the environment
            obs2, rew, done, _ = env.step(np.squeeze(act))

            # add the new transition
            env_buf.append([obs.copy(), rew, act, np.squeeze(val)])

            obs = obs2.copy()

            step_count += 1
            ep_rews.append(rew)

            if done:
                # store the trajectory just completed
                buffer.store(np.array(env_buf))
                env_buf = []
                # store additionl information about the episode
                train_rewards.append(np.sum(ep_rews))
                train_ep_len.append(len(ep_rews))
                # reset the environment
                obs = env.reset()
                ep_rews = []

        # collect the episodes' information
        obs_batch, act_batch, ret_batch, rtg_batch = buffer.get_batch()
        
        # run pre_scalar_summary before the optimization phase
        epochs_summary = sess.run(pre_scalar_summary, feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch, rtg_ph:rtg_batch})
        file_writer.add_summary(epochs_summary, step_count)

        # Optimize the NN policy and the NN value function
        sess.run([p_opt, v_opt], feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch, rtg_ph:rtg_batch})

        # run train_summary to save the summary after the optimization
        train_summary_run = sess.run(train_summary, feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch, rtg_ph:rtg_batch})
        file_writer.add_summary(train_summary_run, step_count)

        # it's time to print some useful information
        if ep % 10 == 0:
            print('Ep:%d MnRew:%.2f MxRew:%.1f EpLen:%.1f Buffer:%d -- Step:%d -- Time:%d' % (ep, np.mean(train_rewards), np.max(train_rewards), np.mean(train_ep_len), len(buffer), step_count,time.time()-timer))

            summary = tf.Summary()
            summary.value.add(tag='supplementary/len', simple_value=np.mean(train_ep_len))
            summary.value.add(tag='supplementary/train_rew', simple_value=np.mean(train_rewards))
            file_writer.add_summary(summary, step_count)
            file_writer.flush()

            timer = time.time()
            train_rewards = []
            train_ep_len = []


    env.close()
    file_writer.close()


if __name__ == '__main__':
    REINFORCE_baseline('LunarLander-v2', hidden_sizes=[64], p_lr=8e-3, vf_lr=7e-3, gamma=0.99, steps_per_epoch=1000, num_epochs=1000)

================================================
FILE: Chapter07/PPO.py
================================================
import numpy as np 
import tensorflow as tf
import gym
from datetime import datetime
import time
import roboschool

def mlp(x, hidden_layers, output_layer, activation=tf.tanh, last_activation=None):
    '''
    Multi-layer perceptron
    '''
    for l in hidden_layers:
        x = tf.layers.dense(x, units=l, activation=activation)
    return tf.layers.dense(x, units=output_layer, activation=last_activation)

def softmax_entropy(logits):
    '''
    Softmax Entropy
    '''
    return -tf.reduce_sum(tf.nn.softmax(logits, axis=-1) * tf.nn.log_softmax(logits, axis=-1), axis=-1)

def clipped_surrogate_obj(new_p, old_p, adv, eps):
    '''
    Clipped surrogate objective function
    '''
    rt = tf.exp(new_p - old_p) # i.e. pi / old_pi
    return -tf.reduce_mean(tf.minimum(rt*adv, tf.clip_by_value(rt, 1-eps, 1+eps)*adv))

def GAE(rews, v, v_last, gamma=0.99, lam=0.95):
    '''
    Generalized Advantage Estimation
    '''
    assert len(rews) == len(v)
    vs = np.append(v, v_last)
    delta = np.array(rews) + gamma*vs[1:] - vs[:-1]
    gae_advantage = discounted_rewards(delta, 0, gamma*lam)
    return gae_advantage

def discounted_rewards(rews, last_sv, gamma):
    '''
    Discounted reward to go 

    Parameters:
    ----------
    rews: list of rewards
    last_sv: value of the last state
    gamma: discount value 
    '''
    rtg = np.zeros_like(rews, dtype=np.float32)
    rtg[-1] = rews[-1] + gamma*last_sv
    for i in reversed(range(len(rews)-1)):
        rtg[i] = rews[i] + gamma*rtg[i+1]
    return rtg


class StructEnv(gym.Wrapper):
    '''
    Gym Wrapper to store information like number of steps and total reward of the last espisode.
    '''
    def __init__(self, env):
        gym.Wrapper.__init__(self, env)
        self.n_obs = self.env.reset()
        self.rew_episode = 0
        self.len_episode = 0

    def reset(self, **kwargs):
        self.n_obs = self.env.reset(**kwargs)
        self.rew_episode = 0
        self.len_episode = 0
        return self.n_obs.copy()
        
    def step(self, action):
        ob, reward, done, info = self.env.step(action)
        self.rew_episode += reward
        self.len_episode += 1
        return ob, reward, done, info

    def get_episode_reward(self):
        return self.rew_episode

    def get_episode_length(self):
        return self.len_episode

class Buffer():
    '''
    Class to store the experience from a unique policy
    '''
    def __init__(self, gamma=0.99, lam=0.95):
        self.gamma = gamma
        self.lam = lam
        self.adv = []
        self.ob = []
        self.ac = []
        self.rtg = []

    def store(self, temp_traj, last_sv):
        '''
        Add temp_traj values to the buffers and compute the advantage and reward to go

        Parameters:
        -----------
        temp_traj: list where each element is a list that contains: observation, reward, action, state-value
        last_sv: value of the last state (Used to Bootstrap)
        '''
        # store only if there are temporary trajectories
        if len(temp_traj) > 0:
            self.ob.extend(temp_traj[:,0])
            rtg = discounted_rewards(temp_traj[:,1], last_sv, self.gamma)
            self.adv.extend(GAE(temp_traj[:,1], temp_traj[:,3], last_sv, self.gamma, self.lam))
            self.rtg.extend(rtg)
            self.ac.extend(temp_traj[:,2])

    def get_batch(self):
        # standardize the advantage values
        norm_adv = (self.adv - np.mean(self.adv)) / (np.std(self.adv) + 1e-10)
        return np.array(self.ob), np.array(self.ac), np.array(norm_adv), np.array(self.rtg)

    def __len__(self):
        assert(len(self.adv) == len(self.ob) == len(self.ac) == len(self.rtg))
        return len(self.ob)
    
def gaussian_log_likelihood(x, mean, log_std):
    '''
    Gaussian Log Likelihood 
    '''
    log_p = -0.5 *((x-mean)**2 / (tf.exp(log_std)**2+1e-9) + 2*log_std + np.log(2*np.pi))
    return tf.reduce_sum(log_p, axis=-1)

def PPO(env_name, hidden_sizes=[32], cr_lr=5e-3, ac_lr=5e-3, num_epochs=50, minibatch_size=5000, gamma=0.99, lam=0.95, number_envs=1, eps=0.1, 
        actor_iter=5, critic_iter=10, steps_per_env=100, action_type='Discrete'):
    '''
    Proximal Policy Optimization

    Parameters:
    -----------
    env_name: Name of the environment
    hidden_size: list of the number of hidden units for each layer
    ac_lr: actor learning rate
    cr_lr: critic learning rate
    num_epochs: number of training epochs
    minibatch_size: Batch size used to train the critic and actor
    gamma: discount factor
    lam: lambda parameter for computing the GAE
    number_envs: number of parallel synchronous environments
        # NB: it isn't distributed across multiple CPUs
    eps: Clip threshold. Max deviation from previous policy.
    actor_iter: Number of SGD iterations on the actor per epoch
    critic_iter: NUmber of SGD iterations on the critic per epoch
    steps_per_env: number of steps per environment
            # NB: the total number of steps per epoch will be: steps_per_env*number_envs
    action_type: class name of the action space: Either "Discrete' or "Box"
    '''

    tf.reset_default_graph()

    # Create some environments to collect the trajectories
    envs = [StructEnv(gym.make(env_name)) for _ in range(number_envs)]
    
    obs_dim = envs[0].observation_space.shape

    # Placeholders
    if action_type == 'Discrete':
        act_dim = envs[0].action_space.n 
        act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')

    elif action_type == 'Box':
        low_action_space = envs[0].action_space.low
        high_action_space = envs[0].action_space.high
        act_dim = envs[0].action_space.shape[0]
        act_ph = tf.placeholder(shape=(None,act_dim), dtype=tf.float32, name='act')

    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')
    ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')
    adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='adv')
    old_p_log_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='old_p_log')

    # Computational graph for the policy in case of a continuous action space
    if action_type == 'Discrete':
        with tf.variable_scope('actor_nn'):
            p_logits = mlp(obs_ph, hidden_sizes, act_dim, tf.nn.relu, last_activation=tf.tanh)

        act_smp = tf.squeeze(tf.random.multinomial(p_logits, 1))
        act_onehot = tf.one_hot(act_ph, depth=act_dim)
        p_log = tf.reduce_sum(act_onehot * tf.nn.log_softmax(p_logits), axis=-1)
        
    # Computational graph for the policy in case of a continuous action space
    else:
        with tf.variable_scope('actor_nn'):
            p_logits = mlp(obs_ph, hidden_sizes, act_dim, tf.tanh, last_activation=tf.tanh)
            log_std = tf.get_variable(name='log_std', initializer=np.zeros(act_dim, dtype=np.float32)-0.5)
        
        # Add noise to the mean values predicted
        # The noise is proportional to the standard deviation
        p_noisy = p_logits + tf.random_normal(tf.shape(p_logits), 0, 1) * tf.exp(log_std)
        # Clip the noisy actions
        act_smp = tf.clip_by_value(p_noisy, low_action_space, high_action_space)
        # Compute the gaussian log likelihood
        p_log = gaussian_log_likelihood(act_ph, p_logits, log_std)

    # Nerual nework value function approximizer
    with tf.variable_scope('critic_nn'):
        s_values = mlp(obs_ph, hidden_sizes, 1, tf.tanh, last_activation=None)
        s_values = tf.squeeze(s_values)
            
    # PPO loss function
    p_loss = clipped_surrogate_obj(p_log, old_p_log_ph, adv_ph, eps)
    # MSE loss function
    v_loss = tf.reduce_mean((ret_ph - s_values)**2)

    # policy optimizer
    p_opt = tf.train.AdamOptimizer(ac_lr).minimize(p_loss)
    # value function optimizer
    v_opt = tf.train.AdamOptimizer(cr_lr).minimize(v_loss)

    # Time
    now = datetime.now()
    clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, now.second)
    print('Time:', clock_time)

    # Set scalars and hisograms for TensorBoard
    tf.summary.scalar('p_loss', p_loss, collections=['train'])
    tf.summary.scalar('v_loss', v_loss, collections=['train'])
    tf.summary.scalar('s_values_m', tf.reduce_mean(s_values), collections=['train'])

    if action_type == 'Box':
        tf.summary.scalar('p_std', tf.reduce_mean(tf.exp(log_std)), collections=['train'])
        tf.summary.histogram('log_std',log_std, collections=['train'])
    tf.summary.histogram('p_log', p_log, collections=['train'])
    tf.summary.histogram('p_logits', p_logits, collections=['train'])
    tf.summary.histogram('s_values', s_values, collections=['train'])
    tf.summary.histogram('adv_ph',adv_ph, collections=['train'])
    scalar_summary = tf.summary.merge_all('train')

    # .. summary to run before the optimization steps
    tf.summary.scalar('old_v_loss', v_loss, collections=['pre_train'])
    tf.summary.scalar('old_p_loss', p_loss, collections=['pre_train'])
    pre_scalar_summary = tf.summary.merge_all('pre_train')

    hyp_str = '-bs_'+str(minibatch_size)+'-envs_'+str(number_envs)+'-ac_lr_'+str(ac_lr)+'-cr_lr'+str(cr_lr)+'-act_it_'+str(actor_iter)+'-crit_it_'+str(critic_iter)
    file_writer = tf.summary.FileWriter('log_dir/'+env_name+'/PPO_'+clock_time+'_'+hyp_str, tf.get_default_graph())
    
    # create a session
    sess = tf.Session()
    # initialize the variables
    sess.run(tf.global_variables_initializer())
    
    # variable to store the total number of steps
    step_count = 0
    
    print('Env batch size:',steps_per_env, ' Batch size:',steps_per_env*number_envs)

    for ep in range(num_epochs):
        # Create the buffer that will contain the trajectories (full or partial) 
        # run with the last policy
        buffer = Buffer(gamma, lam)
        # lists to store rewards and length of the trajectories completed
        batch_rew = []
        batch_len = []

        # Execute in serial the environments, storing temporarily the trajectories. 
        for env in envs:
            temp_buf = []

            #iterate over a fixed number of steps
            for _ in range(steps_per_env):

                # run the policy
                act, val = sess.run([act_smp, s_values], feed_dict={obs_ph:[env.n_obs]})
                act = np.squeeze(act)

                # take a step in the environment
                obs2, rew, done, _ = env.step(act)
                
                # add the new transition to the temporary buffer
                temp_buf.append([env.n_obs.copy(), rew, act, np.squeeze(val)])

                env.n_obs = obs2.copy()
                step_count += 1

                if done:
                    # Store the full trajectory in the buffer 
                    # (the value of the last state is 0 as the trajectory is completed)
                    buffer.store(np.array(temp_buf), 0)

                    # Empty temporary buffer
                    temp_buf = []
                    
                    batch_rew.append(env.get_episode_reward())
                    batch_len.append(env.get_episode_length())
                    
                    # reset the environment
                    env.reset()                 

            # Bootstrap with the estimated state value of the next state!
            last_v = sess.run(s_values, feed_dict={obs_ph:[env.n_obs]})
            buffer.store(np.array(temp_buf), np.squeeze(last_v))


        # Gather the entire batch from the buffer
        # NB: all the batch is used and deleted after the optimization. That is because PPO is on-policy
        obs_batch, act_batch, adv_batch, rtg_batch = buffer.get_batch()

        old_p_log = sess.run(p_log, feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})
        old_p_batch = np.array(old_p_log)

        summary = sess.run(pre_scalar_summary, feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_batch})
        file_writer.add_summary(summary, step_count)

        lb = len(buffer)
        shuffled_batch = np.arange(lb)    
        
        # Policy optimization steps
        for _ in range(actor_iter):
            # shuffle the batch on every iteration
            np.random.shuffle(shuffled_batch)
            for idx in range(0,lb, minibatch_size):
                minib = shuffled_batch[idx:min(idx+minibatch_size,lb)]
                sess.run(p_opt, feed_dict={obs_ph:obs_batch[minib], act_ph:act_batch[minib], adv_ph:adv_batch[minib], old_p_log_ph:old_p_batch[minib]})

        # Value function optimization steps
        for _ in range(critic_iter):
            # shuffle the batch on every iteration
            np.random.shuffle(shuffled_batch)
            for idx in range(0,lb, minibatch_size):
                minib = shuffled_batch[idx:min(idx+minibatch_size,lb)]
                sess.run(v_opt, feed_dict={obs_ph:obs_batch[minib], ret_ph:rtg_batch[minib]})
                

        # print some statistics and run the summary for visualizing it on TB
        if len(batch_rew) > 0:           
            train_summary = sess.run(scalar_summary, feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, 
                                                                old_p_log_ph:old_p_batch, ret_ph:rtg_batch})
            file_writer.add_summary(train_summary, step_count)

            summary = tf.Summary()
            summary.value.add(tag='supplementary/performance', simple_value=np.mean(batch_rew))
            summary.value.add(tag='supplementary/len', simple_value=np.mean(batch_len))
            file_writer.add_summary(summary, step_count)
            file_writer.flush()
        
            print('Ep:%d Rew:%.2f -- Step:%d' % (ep, np.mean(batch_rew), step_count))

    # closing environments..
    for env in envs:
        env.close()

    # Close the writer
    file_writer.close()


if __name__ == '__main__':
    PPO('RoboschoolWalker2d-v1', hidden_sizes=[64,64], cr_lr=5e-4, ac_lr=2e-4, gamma=0.99, lam=0.95, steps_per_env=5000, 
        number_envs=1, eps=0.15, actor_iter=6, critic_iter=10, action_type='Box', num_epochs=5000, minibatch_size=256)
      

================================================
FILE: Chapter07/TRPO.py
================================================
import numpy as np 
import tensorflow as tf
import gym
from datetime import datetime
import roboschool

def mlp(x, hidden_layers, output_layer, activation=tf.tanh, last_activation=None):
    '''
    Multi-layer perceptron
    '''
    for l in hidden_layers:
        x = tf.layers.dense(x, units=l, activation=activation)
    return tf.layers.dense(x, units=output_layer, activation=last_activation)

def softmax_entropy(logits):
    '''
    Softmax Entropy
    '''
    return -tf.reduce_sum(tf.nn.softmax(logits, axis=-1) * tf.nn.log_softmax(logits, axis=-1), axis=-1)


def gaussian_log_likelihood(ac, mean, log_std):
    '''
    Gaussian Log Likelihood 
    '''
    log_p = ((ac-mean)**2 / (tf.exp(log_std)**2+1e-9) + 2*log_std) + np.log(2*np.pi)
    return -0.5 * tf.reduce_sum(log_p, axis=-1)


def conjugate_gradient(A, b, x=None, iters=10):
    '''
    Conjugate gradient method: approximate the solution of Ax=b
    It solve Ax=b without forming the full matrix, just compute the matrix-vector product (The Fisher-vector product)
    
    NB: A is not the full matrix but is a useful matrix-vector product between the averaged Fisher information matrix and arbitrary vectors 
    Descibed in Appendix C.1 of the TRPO paper
    '''
    if x is None:
        x = np.zeros_like(b)
        
    r = A(x) - b
    p = -r
    for _ in range(iters):
        a = np.dot(r, r) / (np.dot(p, A(p))+1e-8)
        x += a*p
        r_n = r + a*A(p)
        b = np.dot(r_n, r_n) / (np.dot(r, r)+1e-8)
        p = -r_n + b*p
        r = r_n
    return x

def gaussian_DKL(mu_q, log_std_q, mu_p, log_std_p):
    '''
    Gaussian KL divergence in case of a diagonal covariance matrix
    '''
    return tf.reduce_mean(tf.reduce_sum(0.5 * (log_std_p - log_std_q + tf.exp(log_std_q - log_std_p) + (mu_q - mu_p)**2 / tf.exp(log_std_p) - 1), axis=1))


def backtracking_line_search(Dkl, delta, old_loss, p=0.8):
    '''
    Backtracking line searc. It look for a coefficient s.t. the constraint on the DKL is satisfied
    It has both to
     - improve the non-linear objective
     - satisfy the constraint

    '''
    ## Explained in Appendix C of the TRPO paper
    a = 1
    it = 0
 
    new_dkl, new_loss = Dkl(a) 
    while (new_dkl > delta) or (new_loss > old_loss):
        a *= p
        it += 1
        new_dkl, new_loss = Dkl(a)

    return a


def GAE(rews, v, v_last, gamma=0.99, lam=0.95):
    '''
    Generalized Advantage Estimation
    '''
    assert len(rews) == len(v)
    vs = np.append(v, v_last)
    d = np.array(rews) + gamma*vs[1:] - vs[:-1]
    gae_advantage = discounted_rewards(d, 0, gamma*lam)
    return gae_advantage

def discounted_rewards(rews, last_sv, gamma):
    '''
    Discounted reward to go 

    Parameters:
    ----------
    rews: list of rewards
    last_sv: value of the last state
    gamma: discount value 
    '''
    rtg = np.zeros_like(rews, dtype=np.float32)
    rtg[-1] = rews[-1] + gamma*last_sv
    for i in reversed(range(len(rews)-1)):
        rtg[i] = rews[i] + gamma*rtg[i+1]
    return rtg

class Buffer():
    '''
    Class to store the experience from a unique policy
    '''
    def __init__(self, gamma=0.99, lam=0.95):
        self.gamma = gamma
        self.lam = lam
        self.adv = []
        self.ob = []
        self.ac = []
        self.rtg = []

    def store(self, temp_traj, last_sv):
        '''
        Add temp_traj values to the buffers and compute the advantage and reward to go

        Parameters:
        -----------
        temp_traj: list where each element is a list that contains: observation, reward, action, state-value
        last_sv: value of the last state (Used to Bootstrap)
        '''
        # store only if there are temporary trajectories
        if len(temp_traj) > 0:
            self.ob.extend(temp_traj[:,0])
            rtg = discounted_rewards(temp_traj[:,1], last_sv, self.gamma)
            self.adv.extend(GAE(temp_traj[:,1], temp_traj[:,3], last_sv, self.gamma, self.lam))
            self.rtg.extend(rtg)
            self.ac.extend(temp_traj[:,2])

    def get_batch(self):
        # standardize the advantage values
        norm_adv = (self.adv - np.mean(self.adv)) / (np.std(self.adv) + 1e-10)
        return np.array(self.ob), np.array(self.ac), np.array(norm_adv), np.array(self.rtg)

    def __len__(self):
        assert(len(self.adv) == len(self.ob) == len(self.ac) == len(self.rtg))
        return len(self.ob)

def flatten_list(tensor_list):
    '''
    Flatten a list of tensors
    '''
    return tf.concat([flatten(t) for t in tensor_list], axis=0)

def flatten(tensor):
    '''
    Flatten a tensor
    '''
    return tf.reshape(tensor, shape=(-1,))


class StructEnv(gym.Wrapper):
    '''
    Gym Wrapper to store information like number of steps and total reward of the last espisode.
    '''
    def __init__(self, env):
        gym.Wrapper.__init__(self, env)
        self.n_obs = self.env.reset()
        self.total_rew = 0
        self.len_episode = 0

    def reset(self, **kwargs):
        self.n_obs = self.env.reset(**kwargs)
        self.total_rew = 0
        self.len_episode = 0
        return self.n_obs.copy()
        
    def step(self, action):
        ob, reward, done, info = self.env.step(action)
        self.total_rew += reward
        self.len_episode += 1
        return ob, reward, done, info

    def get_episode_reward(self):
        return self.total_rew

    def get_episode_length(self):
        return self.len_episode


def TRPO(env_name, hidden_sizes=[32], cr_lr=5e-3, num_epochs=50, gamma=0.99, lam=0.95, number_envs=1, 
        critic_iter=10, steps_per_env=100, delta=0.002, algorithm='TRPO', conj_iters=10, minibatch_size=1000):
    '''
    Trust Region Policy Optimization

    Parameters:
    -----------
    env_name: Name of the environment
    hidden_sizes: list of the number of hidden units for each layer
    cr_lr: critic learning rate
    num_epochs: number of training epochs
    gamma: discount factor
    lam: lambda parameter for computing the GAE
    number_envs: number of "parallel" synchronous environments
        # NB: it isn't distributed across multiple CPUs
    critic_iter: NUmber of SGD iterations on the critic per epoch
    steps_per_env: number of steps per environment
            # NB: the total number of steps per epoch will be: steps_per_env*number_envs
    delta: Maximum KL divergence between two policies. Scalar value
    algorithm: type of algorithm. Either 'TRPO' or 'NPO'
    conj_iters: number of conjugate gradient iterations
    minibatch_size: Batch size used to train the critic
    '''

    tf.reset_default_graph()

    # Create a few environments to collect the trajectories
    envs = [StructEnv(gym.make(env_name)) for _ in range(number_envs)]

    low_action_space = envs[0].action_space.low
    high_action_space = envs[0].action_space.high

    obs_dim = envs[0].observation_space.shape
    act_dim = envs[0].action_space.shape[0]

    # Placeholders
    act_ph = tf.placeholder(shape=(None,act_dim), dtype=tf.float32, name='act')
    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')
    ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')
    adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='adv')
    old_p_log_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='old_p_log')
    old_mu_ph = tf.placeholder(shape=(None, act_dim), dtype=tf.float32, name='old_mu')
    old_log_std_ph = tf.placeholder(shape=(act_dim), dtype=tf.float32, name='old_log_std')
    p_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='p_ph')
    # result of the conjugate gradient algorithm
    cg_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='cg')
        
    # Neural network that represent the policy
    with tf.variable_scope('actor_nn'):
        p_means = mlp(obs_ph, hidden_sizes, act_dim, tf.tanh, last_activation=tf.tanh)
        log_std = tf.get_variable(name='log_std', initializer=np.zeros(act_dim, dtype=np.float32) - 0.5)

    # Neural network that represent the value function
    with tf.variable_scope('critic_nn'):
        s_values = mlp(obs_ph, hidden_sizes, 1, tf.tanh, last_activation=None)
        s_values = tf.squeeze(s_values)    

    # Add "noise" to the predicted mean following the Guassian distribution with standard deviation e^(log_std)
    p_noisy = p_means + tf.random_normal(tf.shape(p_means), 0, 1) * tf.exp(log_std)
    # Clip the noisy actions
    a_sampl = tf.clip_by_value(p_noisy, low_action_space, high_action_space)
    # Compute the gaussian log likelihood
    p_log = gaussian_log_likelihood(act_ph, p_means, log_std)

    # Measure the divergence
    diverg = tf.reduce_mean(tf.exp(old_p_log_ph - p_log))
    
    # ratio
    ratio_new_old = tf.exp(p_log - old_p_log_ph)
    # TRPO surrogate loss function
    p_loss = - tf.reduce_mean(ratio_new_old * adv_ph)

    # MSE loss function
    v_loss = tf.reduce_mean((ret_ph - s_values)**2)
    # Critic optimization
    v_opt = tf.train.AdamOptimizer(cr_lr).minimize(v_loss)

    def variables_in_scope(scope):
        # get all trainable variables in 'scope'
        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
    
    # Gather and flatten the actor parameters
    p_variables = variables_in_scope('actor_nn')
    p_var_flatten = flatten_list(p_variables)

    # Gradient of the policy loss with respect to the actor parameters
    p_grads = tf.gradients(p_loss, p_variables)
    p_grads_flatten = flatten_list(p_grads)

    ########### RESTORE ACTOR PARAMETERS ###########
    p_old_variables = tf.placeholder(shape=(None,), dtype=tf.float32, name='p_old_variables')
    # variable used as index for restoring the actor's parameters
    it_v1 = tf.Variable(0, trainable=False)
    restore_params = []

    for p_v in p_variables:
        upd_rsh = tf.reshape(p_old_variables[it_v1 : it_v1+tf.reduce_prod(p_v.shape)], shape=p_v.shape)
        restore_params.append(p_v.assign(upd_rsh)) 
        it_v1 += tf.reduce_prod(p_v.shape)

    restore_params = tf.group(*restore_params)

    # gaussian KL divergence of the two policies 
    dkl_diverg = gaussian_DKL(old_mu_ph, old_log_std_ph, p_means, log_std) 

    # Jacobian of the KL divergence (Needed for the Fisher matrix-vector product)
    dkl_diverg_grad = tf.gradients(dkl_diverg, p_variables) 

    dkl_matrix_product = tf.reduce_sum(flatten_list(dkl_diverg_grad) * p_ph)
    print('dkl_matrix_product', dkl_matrix_product.shape)
    # Fisher vector product
    # The Fisher-vector product is a way to compute the A matrix without the need of the full A
    Fx = flatten_list(tf.gradients(dkl_matrix_product, p_variables))

    ## Step length
    beta_ph = tf.placeholder(shape=(), dtype=tf.float32, name='beta')
    # NPG update
    npg_update = beta_ph * cg_ph
    
    ## alpha is found through line search
    alpha = tf.Variable(1., trainable=False)
    # TRPO update
    trpo_update = alpha * npg_update

    ####################   POLICY UPDATE  ###################
    # variable used as an index
    it_v = tf.Variable(0, trainable=False)
    p_opt = []
    # Apply the updates to the policy
    for p_v in p_variables:
        upd_rsh = tf.reshape(trpo_update[it_v : it_v+tf.reduce_prod(p_v.shape)], shape=p_v.shape)
        p_opt.append(p_v.assign_sub(upd_rsh))
        it_v += tf.reduce_prod(p_v.shape)

    p_opt = tf.group(*p_opt)
        
    # Time
    now = datetime.now()
    clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, now.second)
    print('Time:', clock_time)


    # Set scalars and hisograms for TensorBoard
    tf.summary.scalar('p_loss', p_loss, collections=['train'])
    tf.summary.scalar('v_loss', v_loss, collections=['train'])
    tf.summary.scalar('p_divergence', diverg, collections=['train'])
    tf.summary.scalar('ratio_new_old',tf.reduce_mean(ratio_new_old), collections=['train'])
    tf.summary.scalar('dkl_diverg', dkl_diverg, collections=['train'])
    tf.summary.scalar('alpha', alpha, collections=['train'])
    tf.summary.scalar('beta', beta_ph, collections=['train'])
    tf.summary.scalar('p_std_mn', tf.reduce_mean(tf.exp(log_std)), collections=['train'])
    tf.summary.scalar('s_values_mn', tf.reduce_mean(s_values), collections=['train'])
    tf.summary.histogram('p_log', p_log, collections=['train'])
    tf.summary.histogram('p_means', p_means, collections=['train'])
    tf.summary.histogram('s_values', s_values, collections=['train'])
    tf.summary.histogram('adv_ph',adv_ph, collections=['train'])
    tf.summary.histogram('log_std',log_std, collections=['train'])
    scalar_summary = tf.summary.merge_all('train')

    tf.summary.scalar('old_v_loss', v_loss, collections=['pre_train'])
    tf.summary.scalar('old_p_loss', p_loss, collections=['pre_train'])
    pre_scalar_summary = tf.summary.merge_all('pre_train')

    hyp_str = '-spe_'+str(steps_per_env)+'-envs_'+str(number_envs)+'-cr_lr'+str(cr_lr)+'-crit_it_'+str(critic_iter)+'-delta_'+str(delta)+'-conj_iters_'+str(conj_iters)
    file_writer = tf.summary.FileWriter('log_dir/'+env_name+'/'+algorithm+'_'+clock_time+'_'+hyp_str, tf.get_default_graph())
    
    # create a session
    sess = tf.Session()
    # initialize the variables
    sess.run(tf.global_variables_initializer())
    
    # variable to store the total number of steps
    step_count = 0
    
    print('Env batch size:',steps_per_env, ' Batch size:',steps_per_env*number_envs)

    for ep in range(num_epochs):
        # Create the buffer that will contain the trajectories (full or partial) 
        # run with the last policy
        buffer = Buffer(gamma, lam)
        # lists to store rewards and length of the trajectories completed
        batch_rew = []
        batch_len = []

        # Execute in serial the environment, storing temporarily the trajectories.
        for env in envs:
            temp_buf = []

            # iterate over a fixed number of steps
            for _ in range(steps_per_env):
                # run the policy
                act, val = sess.run([a_sampl, s_values], feed_dict={obs_ph:[env.n_obs]})
                act = np.squeeze(act)

                # take a step in the environment
                obs2, rew, done, _ = env.step(act)

                # add the new transition to the temporary buffer
                temp_buf.append([env.n_obs.copy(), rew, act, np.squeeze(val)])

                env.n_obs = obs2.copy()
                step_count += 1

                if done:
                    # Store the full trajectory in the buffer 
                    # (the value of the last state is 0 as the trajectory is completed)
                    buffer.store(np.array(temp_buf), 0)
                    # Empty temporary buffer
                    temp_buf = []

                    batch_rew.append(env.get_episode_reward())
                    batch_len.append(env.get_episode_length())

                    env.reset()
                    
            # Bootstrap with the estimated state value of the next state!
            lsv = sess.run(s_values, feed_dict={obs_ph:[env.n_obs]})
            buffer.store(np.array(temp_buf), np.squeeze(lsv))


        # Get the entire batch from the buffer
        # NB: all the batch is used and deleted after the optimization. This is because PPO is on-policy
        obs_batch, act_batch, adv_batch, rtg_batch = buffer.get_batch()

        # log probabilities, logits and log std of the "old" policy
        # "old" policy refer to the policy to optimize and that has been used to sample from the environment
        old_p_log, old_p_means, old_log_std = sess.run([p_log, p_means, log_std], feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})
        # get also the "old" parameters
        old_actor_params = sess.run(p_var_flatten)

        # old_p_loss is later used in the line search
        # run pre_scalar_summary for a summary before the optimization
        old_p_loss, summary = sess.run([p_loss,pre_scalar_summary], feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})
        file_writer.add_summary(summary, step_count)

        def H_f(p):
            '''
            Run the Fisher-Vector product on 'p' to approximate the Hessian of the DKL
            '''
            return sess.run(Fx, feed_dict={old_mu_ph:old_p_means, old_log_std_ph:old_log_std, p_ph:p, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})

        g_f = sess.run(p_grads_flatten, feed_dict={old_mu_ph:old_p_means,obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})
        ## Compute the Conjugate Gradient so to obtain an approximation of H^(-1)*g
        # Where H in reality isn't the true Hessian of the KL divergence but an approximation of it computed via Fisher-Vector Product (F)
        conj_grad = conjugate_gradient(H_f, g_f, iters=conj_iters)

        # Compute the step length
        beta_np = np.sqrt(2*delta / np.sum(conj_grad * H_f(conj_grad)))
        
        def DKL(alpha_v):
            '''
            Compute the KL divergence.
            It optimize the function to compute the DKL. Afterwards it restore the old parameters.
            '''
            sess.run(p_opt, feed_dict={beta_ph:beta_np, alpha:alpha_v, cg_ph:conj_grad, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, old_p_log_ph:old_p_log})
            a_res = sess.run([dkl_diverg, p_loss], feed_dict={old_mu_ph:old_p_means, old_log_std_ph:old_log_std, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})
            sess.run(restore_params, feed_dict={p_old_variables: old_actor_params})
            return a_res

        # Actor optimization step
        # Different for TRPO or NPG
        if algorithm=='TRPO':
            # Backtracing line search to find the maximum alpha coefficient s.t. the constraint is valid
            best_alpha = backtracking_line_search(DKL, delta, old_p_loss, p=0.8)
            sess.run(p_opt, feed_dict={beta_ph:beta_np, alpha:best_alpha, cg_ph:conj_grad, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, old_p_log_ph:old_p_log})
        elif algorithm=='NPG':
            # In case of NPG, no line search
            sess.run(p_opt, feed_dict={beta_ph:beta_np, alpha:1, cg_ph:conj_grad, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, old_p_log_ph:old_p_log})


        lb = len(buffer)
        shuffled_batch = np.arange(lb)
        np.random.shuffle(shuffled_batch)

        # Value function optimization steps
        for _ in range(critic_iter):
            # shuffle the batch on every iteration
            np.random.shuffle(shuffled_batch)
            for idx in range(0,lb, minibatch_size):
                minib = shuffled_batch[idx:min(idx+minibatch_size,lb)]
                sess.run(v_opt, feed_dict={obs_ph:obs_batch[minib], ret_ph:rtg_batch[minib]})

        # print some statistics and run the summary for visualizing it on TB
        if len(batch_rew) > 0:
            train_summary = sess.run(scalar_summary, feed_dict={beta_ph:beta_np, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, cg_ph:conj_grad,
                                                                old_p_log_ph:old_p_log, ret_ph:rtg_batch, old_mu_ph:old_p_means, old_log_std_ph:old_log_std})
            file_writer.add_summary(train_summary, step_count)
            
            summary = tf.Summary()
            summary.value.add(tag='supplementary/performance', simple_value=np.mean(batch_rew))
            summary.value.add(tag='supplementary/len', simple_value=np.mean(batch_len))
            file_writer.add_summary(summary, step_count)
            file_writer.flush()

            print('Ep:%d Rew:%.2f -- Step:%d' % (ep, np.mean(batch_rew), step_count))

    # closing environments..
    for env in envs:
        env.close()

    file_writer.close()

if __name__ == '__main__':
    TRPO('RoboschoolWalker2d-v1', hidden_sizes=[64,64], cr_lr=2e-3, gamma=0.99, lam=0.95, num_epochs=1000, steps_per_env=6000, 
         number_envs=1, critic_iter=10, delta=0.01, algorithm='TRPO', conj_iters=10, minibatch_size=1000)


================================================
FILE: Chapter08/DDPG.py
================================================
import numpy as np 
import tensorflow as tf
import gym
from datetime import datetime
from collections import deque
import time

current_milli_time = lambda: int(round(time.time() * 1000))

def mlp(x, hidden_layers, output_layer, activation=tf.nn.relu, last_activation=None):
    '''
    Multi-layer perceptron
    '''
    for l in hidden_layers:
        x = tf.layers.dense(x, units=l, activation=activation)
    return tf.layers.dense(x, units=output_layer, activation=last_activation)

def deterministic_actor_critic(x, a, hidden_sizes, act_dim, max_act):
    '''
    Deterministic Actor-Critic
    '''
    # Actor
    with tf.variable_scope('p_mlp'):
        p_means = max_act * mlp(x, hidden_sizes, act_dim, last_activation=tf.tanh) 
    
    # Critic with as input the deterministic action of the actor
    with tf.variable_scope('q_mlp'):
        q_d = mlp(tf.concat([x,p_means], axis=-1), hidden_sizes, 1, last_activation=None) 
    
    # Critic with as input an arbirtary action
    with tf.variable_scope('q_mlp', reuse=True): # Use the weights of the mlp just defined
        q_a = mlp(tf.concat([x,a], axis=-1), hidden_sizes, 1, last_activation=None)

    return p_means, tf.squeeze(q_d), tf.squeeze(q_a)

class ExperiencedBuffer():
    '''
    Experienced buffer
    '''
    def __init__(self, buffer_size):
        # Contains up to 'buffer_size' experience
        self.obs_buf = deque(maxlen=buffer_size)
        self.rew_buf = deque(maxlen=buffer_size)
        self.act_buf = deque(maxlen=buffer_size)
        self.obs2_buf = deque(maxlen=buffer_size)
        self.done_buf = deque(maxlen=buffer_size)


    def add(self, obs, rew, act, obs2, done):
        '''
        Add a new transition to the buffers
        '''
        self.obs_buf.append(obs)
        self.rew_buf.append(rew)
        self.act_buf.append(act)
        self.obs2_buf.append(obs2)
        self.done_buf.append(done)
        

    def sample_minibatch(self, batch_size):
        '''
        Sample a mini-batch of size 'batch_size'
        '''
        mb_indices = np.random.randint(len(self.obs_buf), size=batch_size)

        mb_obs = [self.obs_buf[i] for i in mb_indices]
        mb_rew = [self.rew_buf[i] for i in mb_indices]
        mb_act = [self.act_buf[i] for i in mb_indices]
        mb_obs2 = [self.obs2_buf[i] for i in mb_indices]
        mb_done = [self.done_buf[i] for i in mb_indices]

        return mb_obs, mb_rew, mb_act, mb_obs2, mb_done

    def __len__(self):
        return len(self.obs_buf)

def test_agent(env_test, agent_op, num_games=10):
    '''
    Test an agent 'agent_op', 'num_games' times
    Return mean and std
    '''
    games_r = []
    for _ in range(num_games):
        d = False
        game_r = 0
        o = env_test.reset()

        while not d:
            a_s = agent_op(o)
            o, r, d, _ = env_test.step(a_s)
            game_r += r

        games_r.append(game_r)
    return np.mean(games_r), np.std(games_r)


def DDPG(env_name, hidden_sizes=[32], ac_lr=1e-2, cr_lr=1e-2, num_epochs=2000, buffer_size=5000, discount=0.99, render_cycle=100, mean_summaries_steps=1000, 
        batch_size=128, min_buffer_size=5000, tau=0.005):

    # Create an environment for training
    env = gym.make(env_name)
    # Create an environment for testing the actor
    env_test = gym.make(env_name)

    tf.reset_default_graph()

    obs_dim = env.observation_space.shape
    act_dim = env.action_space.shape
    print('-- Observation space:', obs_dim, ' Action space:', act_dim, '--')

    # Create some placeholders
    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')
    act_ph = tf.placeholder(shape=(None, act_dim[0]), dtype=tf.float32, name='act')
    y_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='y')

    # Create an online deterministic actor-critic 
    with tf.variable_scope('online'):
        p_onl, qd_onl, qa_onl = deterministic_actor_critic(obs_ph, act_ph, hidden_sizes, act_dim[0], np.max(env.action_space.high))
    # and a target one
    with tf.variable_scope('target'):
        _, qd_tar, _ = deterministic_actor_critic(obs_ph, act_ph, hidden_sizes, act_dim[0], np.max(env.action_space.high))

    def variables_in_scope(scope):
        '''
        Retrieve all the variables in the scope 'scope'
        '''
        return tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope)

    # Copy all the online variables to the target networks i.e. target = online
    # Needed only at the beginning
    init_target = [target_var.assign(online_var) for target_var, online_var in zip(variables_in_scope('target'), variables_in_scope('online'))]
    init_target_op = tf.group(*init_target)

    # Soft update
    update_target = [target_var.assign(tau*online_var + (1-tau)*target_var) for target_var, online_var in zip(variables_in_scope('target'), variables_in_scope('online'))]
    update_target_op = tf.group(*update_target)

    # Critic loss (MSE)
    q_loss = tf.reduce_mean((qa_onl - y_ph)**2) 
    # Actor loss
    p_loss = -tf.reduce_mean(qd_onl)

    # Optimize the critic
    q_opt = tf.train.AdamOptimizer(cr_lr).minimize(q_loss)
    # Optimize the actor
    p_opt = tf.train.AdamOptimizer(ac_lr).minimize(p_loss, var_list=variables_in_scope('online/p_mlp'))


    def agent_op(o):
        a = np.squeeze(sess.run(p_onl, feed_dict={obs_ph:[o]}))
        return np.clip(a, env.action_space.low, env.action_space.high)

    def agent_noisy_op(o, scale):
        action = agent_op(o)
        noisy_action = action + np.random.normal(loc=0.0, scale=scale, size=action.shape)
        return np.clip(noisy_action, env.action_space.low, env.action_space.high)


    # Time
    now = datetime.now()
    clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, int(now.second))
    print('Time:', clock_time)


    # Set TensorBoard
    tf.summary.scalar('loss/q', q_loss)
    tf.summary.scalar('loss/p', p_loss)
    scalar_summary = tf.summary.merge_all()

    hyp_str = '-aclr_'+str(ac_lr)+'-crlr_'+str(cr_lr)+'-tau_'+str(tau)

    file_writer = tf.summary.FileWriter('log_dir/'+env_name+'/DDPG_'+clock_time+'_'+hyp_str, tf.get_default_graph())

    # Create a session and initialize the variables
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())
    sess.run(init_target_op)
    
    # Some useful variables..
    render_the_game = False
    step_count = 0
    last_q_update_loss = []
    last_p_update_loss = []
    ep_time = current_milli_time()
    batch_rew = []

    # Reset the environment
    obs = env.reset()
    # Initialize the buffer
    buffer = ExperiencedBuffer(buffer_size)


    for ep in range(num_epochs):
        g_rew = 0
        done = False

        while not done:
            # If not gathered enough experience yet, act randomly
            if len(buffer) < min_buffer_size:
                act = env.action_space.sample()
            else:
                act = agent_noisy_op(obs, 0.1)

            # Take a step in the environment
            obs2, rew, done, _ = env.step(act)

            if render_the_game:
                env.render()

            # Add the transition in the buffer
            buffer.add(obs.copy(), rew, act, obs2.copy(), done)

            obs = obs2
            g_rew += rew
            step_count += 1

            if len(buffer) > min_buffer_size:
                # sample a mini batch from the buffer
                mb_obs, mb_rew, mb_act, mb_obs2, mb_done = buffer.sample_minibatch(batch_size)

                # Compute the target values
                q_target_mb = sess.run(qd_tar, feed_dict={obs_ph:mb_obs2})
                y_r = np.array(mb_rew) + discount*(1-np.array(mb_done))*q_target_mb

                # optimize the critic
                train_summary, _, q_train_loss = sess.run([scalar_summary, q_opt, q_loss], feed_dict={obs_ph:mb_obs, y_ph:y_r, act_ph: mb_act})
                
                # optimize the actor
                _, p_train_loss = sess.run([p_opt, p_loss], feed_dict={obs_ph:mb_obs})

                # summaries..
                file_writer.add_summary(train_summary, step_count)
                last_q_update_loss.append(q_train_loss)
                last_p_update_loss.append(p_train_loss)

                # Soft update of the target networks
                sess.run(update_target_op)

                # some 'mean' summaries to plot more smooth functions
                if step_count % mean_summaries_steps == 0:
                    summary = tf.Summary()
                    summary.value.add(tag='loss/mean_q', simple_value=np.mean(last_q_update_loss))
                    summary.value.add(tag='loss/mean_p', simple_value=np.mean(last_p_update_loss))
                    file_writer.add_summary(summary, step_count)
                    file_writer.flush()

                    last_q_update_loss = []
                    last_p_update_loss = []


            if done:
                obs = env.reset()
                batch_rew.append(g_rew)
                g_rew, render_the_game = 0, False

        # Test the actor every 10 epochs
        if ep % 10 == 0:
            test_mn_rw, test_std_rw = test_agent(env_test, agent_op)

            summary = tf.Summary()
            summary.value.add(tag='test/reward', simple_value=test_mn_rw)
            file_writer.add_summary(summary, step_count)
            file_writer.flush()

            ep_sec_time = int((current_milli_time()-ep_time) / 1000)
            print('Ep:%4d Rew:%4.2f -- Step:%5d -- Test:%4.2f %4.2f -- Time:%d' %  (ep,np.mean(batch_rew), step_count, test_mn_rw, test_std_rw, ep_sec_time))

            ep_time = current_milli_time()
            batch_rew = []
                
        if ep % render_cycle == 0:
            render_the_game = True

    # close everything
    file_writer.close()
    env.close()
    env_test.close()


if __name__ == '__main__':
    DDPG('BipedalWalker-v2', hidden_sizes=[64,64], ac_lr=3e-4, cr_lr=4e-4, buffer_size=200000, mean_summaries_steps=100, batch_size=64, 
        min_buffer_size=10000, tau=0.003)
    

================================================
FILE: Chapter08/TD3.py
================================================
import numpy as np 
import tensorflow as tf
import gym
from datetime import datetime
from collections import deque
import time

current_milli_time = lambda: int(round(time.time() * 1000))

def mlp(x, hidden_layers, output_layer, activation=tf.nn.relu, last_activation=None):
    '''
    Multi-layer perceptron
    '''
    for l in hidden_layers:
        x = tf.layers.dense(x, units=l, activation=activation)
    return tf.layers.dense(x, units=output_layer, activation=last_activation)

# CHANGED FROM DDPG!
def deterministic_actor_double_critic(x, a, hidden_sizes, act_dim, max_act=1):
    '''
    Deterministic Actor-Critic
    '''
    # Actor
    with tf.variable_scope('p_mlp'):
        p_means = max_act * mlp(x, hidden_sizes, act_dim, last_activation=tf.tanh) 
    
    # First critic
    with tf.variable_scope('q1_mlp'):
        q1_d = mlp(tf.concat([x,p_means], axis=-1), hidden_sizes, 1, last_activation=None) 
    
    with tf.variable_scope('q1_mlp', reuse=True): # Use the weights of the mlp just defined
        q1_a = mlp(tf.concat([x,a], axis=-1), hidden_sizes, 1, last_activation=None)

    # Second critic
    with tf.variable_scope('q2_mlp'):
        q2_d = mlp(tf.concat([x,p_means], axis=-1), hidden_sizes, 1, last_activation=None)
    with tf.variable_scope('q2_mlp', reuse=True):
        q2_a = mlp(tf.concat([x,a], axis=-1), hidden_sizes, 1, last_activation=None)

    return p_means, tf.squeeze(q1_d), tf.squeeze(q1_a), tf.squeeze(q2_d), tf.squeeze(q2_a)

class ExperiencedBuffer():
    '''
    Experienced buffer
    '''
    def __init__(self, buffer_size):
        # Contains up to 'buffer_size' experience
        self.obs_buf = deque(maxlen=buffer_size)
        self.rew_buf = deque(maxlen=buffer_size)
        self.act_buf = deque(maxlen=buffer_size)
        self.obs2_buf = deque(maxlen=buffer_size)
        self.done_buf = deque(maxlen=buffer_size)


    def add(self, obs, rew, act, obs2, done):
        '''
        Add a new transition to the buffers
        '''
        self.obs_buf.append(obs)
        self.rew_buf.append(rew)
        self.act_buf.append(act)
        self.obs2_buf.append(obs2)
        self.done_buf.append(done)
        

    def sample_minibatch(self, batch_size):
        '''
        Sample a mini-batch of size 'batch_size'
        '''
        mb_indices = np.random.randint(len(self.obs_buf), size=batch_size)

        mb_obs = [self.obs_buf[i] for i in mb_indices]
        mb_rew = [self.rew_buf[i] for i in mb_indices]
        mb_act = [self.act_buf[i] for i in mb_indices]
        mb_obs2 = [self.obs2_buf[i] for i in mb_indices]
        mb_done = [self.done_buf[i] for i in mb_indices]

        return mb_obs, mb_rew, mb_act, mb_obs2, mb_done

    def __len__(self):
        return len(self.obs_buf)

def test_agent(env_test, agent_op, num_games=10):
    '''
    Test an agent 'agent_op', 'num_games' times
    Return mean and std
    '''
    games_r = []

    for _ in range(num_games):
        d = False
        game_r = 0
        o = env_test.reset()

        while not d:
            a_s = agent_op(o)
            o, r, d, _ = env_test.step(a_s)

            game_r += r

        games_r.append(game_r)

    return np.mean(games_r), np.std(games_r)


def TD3(env_name, hidden_sizes=[32], ac_lr=1e-2, cr_lr=1e-2, num_epochs=2000, buffer_size=5000, discount=0.99, render_cycle=10000, mean_summaries_steps=1000, 
        batch_size=128, min_buffer_size=5000, tau=0.005, target_noise=0.2, expl_noise=0.1, policy_update_freq=2):

    # Create an environment for training
    env = gym.make(env_name)
    # Create an environment for testing the actor
    env_test = gym.make(env_name)

    tf.reset_default_graph()

    obs_dim = env.observation_space.shape
    act_dim = env.action_space.shape
    print('-- Observation space:', obs_dim, ' Action space:', act_dim, '--')

    # Create some placeholders
    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')
    act_ph = tf.placeholder(shape=(None, act_dim[0]), dtype=tf.float32, name='act')
    y_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='y')

    # Create an online deterministic actor and a double critic 
    with tf.variable_scope('online'):
        p_onl, qd1_onl, qa1_onl, _, qa2_onl = deterministic_actor_double_critic(obs_ph, act_ph, hidden_sizes, act_dim[0], np.max(env.action_space.high))

    # and a target actor and double critic
    with tf.variable_scope('target'):
        p_tar, _, qa1_tar, _, qa2_tar = deterministic_actor_double_critic(obs_ph, act_ph, hidden_sizes, act_dim[0], np.max(env.action_space.high))

    def variables_in_scope(scope):
        '''
        Retrieve all the variables in the scope 'scope'
        '''
        return tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope)

    # Copy all the online variables to the target networks i.e. target = online
    # Needed only at the beginning
    init_target = [target_var.assign(online_var) for target_var, online_var in zip(variables_in_scope('target'), variables_in_scope('online'))]
    init_target_op = tf.group(*init_target)

    # Soft update
    update_target = [target_var.assign(tau*online_var + (1-tau)*target_var) for target_var, online_var in zip(variables_in_scope('target'), variables_in_scope('online'))]
    update_target_op = tf.group(*update_target)

    # Critics loss (MSE)
    q1_loss = tf.reduce_mean((qa1_onl - y_ph)**2) 
    q2_loss = tf.reduce_mean((qa2_onl - y_ph)**2)

    # Actor loss
    p_loss = -tf.reduce_mean(qd1_onl)
    
    # Optimize the critics
    q1_opt = tf.train.AdamOptimizer(cr_lr).minimize(q1_loss)
    q2_opt = tf.train.AdamOptimizer(cr_lr).minimize(q2_loss)

    # Optimize the actor
    p_opt = tf.train.AdamOptimizer(ac_lr).minimize(p_loss, var_list=variables_in_scope('online/p_mlp'))


    def add_normal_noise(x, scale, low_lim=-0.5, high_lim=0.5):
        return x + np.clip(np.random.normal(loc=0.0, scale=scale, size=x.shape), low_lim, high_lim)

    def agent_op(o):
        ac = np.squeeze(sess.run(p_onl, feed_dict={obs_ph:[o]}))
        return np.clip(ac, env.action_space.low, env.action_space.high)

    def agent_noisy_op(o, scale):
        ac = agent_op(o)
        return np.clip(add_normal_noise(ac, scale, env.action_space.low, env.action_space.high), env.action_space.low, env.action_space.high)


    # Time
    now = datetime.now()
    clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, int(now.second))
    print('Time:', clock_time)

    # Set TensorBoard
    tf.summary.scalar('loss/q1', q1_loss)
    tf.summary.scalar('loss/q2', q2_loss)
    tf.summary.scalar('loss/p', p_loss)
    scalar_summary = tf.summary.merge_all()

    hyp_str = '-aclr_'+str(ac_lr)+'-crlr_'+str(cr_lr)+'-tau_'+str(tau)

    file_writer = tf.summary.FileWriter('log_dir/'+env_name+'/TD3_'+clock_time+'_'+hyp_str, tf.get_default_graph())

    # Create a session and initialize the variables
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())
    sess.run(init_target_op)
    
    # Some useful variables..
    render_the_game = False
    step_count = 0
    last_q1_update_loss = []
    last_q2_update_loss = []
    last_p_update_loss = []
    ep_time = current_milli_time()
    batch_rew = []

    # Reset the environment
    obs = env.reset()
    # Initialize the buffer
    buffer = ExperiencedBuffer(buffer_size)


    for ep in range(num_epochs):
        g_rew = 0
        done = False

        while not done:
            # If not gathered enough experience yet, act randomly
            if len(buffer) < min_buffer_size:
                act = env.action_space.sample()
            else:
                act = agent_noisy_op(obs, expl_noise)

            # Take a step in the environment
            obs2, rew, done, _ = env.step(act)

            if render_the_game:
                env.render()

            # Add the transition in the buffer
            buffer.add(obs.copy(), rew, act, obs2.copy(), done)

            obs = obs2
            g_rew += rew
            step_count += 1

            if len(buffer) > min_buffer_size:
                # sample a mini batch from the buffer
                mb_obs, mb_rew, mb_act, mb_obs2, mb_done = buffer.sample_minibatch(batch_size)


                double_actions = sess.run(p_tar, feed_dict={obs_ph:mb_obs2})
                # Target regularization
                double_noisy_actions = np.clip(add_normal_noise(double_actions, target_noise), env.action_space.low, env.action_space.high)

                # Clipped Double Q-learning
                q1_target_mb, q2_target_mb = sess.run([qa1_tar,qa2_tar], feed_dict={obs_ph:mb_obs2, act_ph:double_noisy_actions})
                q_target_mb = np.min([q1_target_mb, q2_target_mb], axis=0) 
                assert(len(q1_target_mb) == len(q_target_mb))

                # Compute the target values
                y_r = np.array(mb_rew) + discount*(1-np.array(mb_done))*q_target_mb

                # Optimize the critics
                train_summary, _, q1_train_loss, _, q2_train_loss = sess.run([scalar_summary, q1_opt, q1_loss, q2_opt, q2_loss], feed_dict={obs_ph:mb_obs, y_ph:y_r, act_ph: mb_act})

                # Delayed policy update
                if step_count % policy_update_freq == 0:
                    # Optimize the policy
                    _, p_train_loss = sess.run([p_opt, p_loss], feed_dict={obs_ph:mb_obs})

                    # Soft update of the target networks
                    sess.run(update_target_op)

                    file_writer.add_summary(train_summary, step_count)
                    last_q1_update_loss.append(q1_train_loss)
                    last_q2_update_loss.append(q2_train_loss)
                    last_p_update_loss.append(p_train_loss)

                
                # some 'mean' summaries to plot more smooth functions
                if step_count % mean_summaries_steps == 0:
                    summary = tf.Summary()
                    summary.value.add(tag='loss/mean_q1', simple_value=np.mean(last_q1_update_loss))
                    summary.value.add(tag='loss/mean_q2', simple_value=np.mean(last_q2_update_loss))
                    summary.value.add(tag='loss/mean_p', simple_value=np.mean(last_p_update_loss))
                    file_writer.add_summary(summary, step_count)
                    file_writer.flush()

                    last_q1_update_loss = []
                    last_q2_update_loss = []
                    last_p_update_loss = []


            if done:
                obs = env.reset()
                batch_rew.append(g_rew)
                g_rew, render_the_game = 0, False

        # Test the actor every 10 epochs
        if ep % 10 == 0:
            test_mn_rw, test_std_rw = test_agent(env_test, agent_op)
            summary = tf.Summary()
            summary.value.add(tag='test/reward', simple_value=test_mn_rw)
            file_writer.add_summary(summary, step_count)
            file_writer.flush()

            ep_sec_time = int((current_milli_time()-ep_time) / 1000)
            print('Ep:%4d Rew:%4.2f -- Step:%5d -- Test:%4.2f %4.2f -- Time:%d' %  (ep,np.mean(batch_rew), step_count, test_mn_rw, test_std_rw, ep_sec_time))

            ep_time = current_milli_time()
            batch_rew = []
                
        if ep % render_cycle == 0:
            render_the_game = True

    # close everything
    file_writer.close()
    env.close()
    env_test.close()


if __name__ == '__main__':
    TD3('BipedalWalker-v2', hidden_sizes=[64,64], ac_lr=4e-4, cr_lr=4e-4, buffer_size=200000, mean_summaries_steps=100, batch_size=64, 
        min_buffer_size=10000, tau=0.005, policy_update_freq=2, target_noise=0.1)

================================================
FILE: Chapter09/ME-TRPO.py
================================================
import numpy as np 
import tensorflow as tf
import gym
from datetime import datetime
import roboschool


def mlp(x, hidden_layers, output_layer, activation=tf.tanh, last_activation=None):
    '''
    Multi-layer perceptron
    '''
    for l in hidden_layers:
        x = tf.layers.dense(x, units=l, activation=activation)
    return tf.layers.dense(x, units=output_layer, activation=last_activation)

def softmax_entropy(logits):
    '''
    Softmax Entropy
    '''
    return -tf.reduce_sum(tf.nn.softmax(logits, axis=-1) * tf.nn.log_softmax(logits, axis=-1), axis=-1)

def gaussian_log_likelihood(ac, mean, log_std):
    '''
    Gaussian Log Likelihood 
    '''
    log_p = ((ac-mean)**2 / (tf.exp(log_std)**2+1e-9) + 2*log_std) + np.log(2*np.pi)
    return -0.5 * tf.reduce_sum(log_p, axis=-1)

def conjugate_gradient(A, b, x=None, iters=10):
    '''
    Conjugate gradient method: approximate the solution of Ax=b
    It solve Ax=b without forming the full matrix, just compute the matrix-vector product (The Fisher-vector product)
    
    NB: A is not the full matrix but is a useful matrix-vector product between the averaged Fisher information matrix and arbitrary vectors 
    Descibed in Appendix C.1 of the TRPO paper
    '''
    if x is None:
        x = np.zeros_like(b)
        
    r = A(x) - b
    p = -r
    for _ in range(iters):
        a = np.dot(r, r) / (np.dot(p, A(p))+1e-8)
        x += a*p
        r_n = r + a*A(p)
        b = np.dot(r_n, r_n) / (np.dot(r, r)+1e-8)
        p = -r_n + b*p
        r = r_n
    return x

def gaussian_DKL(mu_q, log_std_q, mu_p, log_std_p):
    '''
    Gaussian KL divergence in case of a diagonal covariance matrix
    '''
    return tf.reduce_mean(tf.reduce_sum(0.5 * (log_std_p - log_std_q + tf.exp(log_std_q - log_std_p) + (mu_q - mu_p)**2 / tf.exp(log_std_p) - 1), axis=1))

def backtracking_line_search(Dkl, delta, old_loss, p=0.8):
    '''
    Backtracking line searc. It look for a coefficient s.t. the constraint on the DKL is satisfied
    It has both to
     - improve the non-linear objective
     - satisfy the constraint

    '''
    ## Explained in Appendix C of the TRPO paper
    a = 1
    it = 0
    
    new_dkl, new_loss = Dkl(a) 
    while (new_dkl > delta) or (new_loss > old_loss):
        a *= p
        it += 1
        new_dkl, new_loss = Dkl(a)

    return a

def GAE(rews, v, v_last, gamma=0.99, lam=0.95):
    '''
    Generalized Advantage Estimation
    '''
    assert len(rews) == len(v)
    vs = np.append(v, v_last)
    d = np.array(rews) + gamma*vs[1:] - vs[:-1]
    gae_advantage = discounted_rewards(d, 0, gamma*lam)
    return gae_advantage

def discounted_rewards(rews, last_sv, gamma):
    '''
    Discounted reward to go 

    Parameters:
    ----------
    rews: list of rewards
    last_sv: value of the last state
    gamma: discount value 
    '''
    rtg = np.zeros_like(rews, dtype=np.float32)
    rtg[-1] = rews[-1] + gamma*last_sv
    for i in reversed(range(len(rews)-1)):
        rtg[i] = rews[i] + gamma*rtg[i+1]
    return rtg

def flatten_list(tensor_list):
    '''
    Flatten a list of tensors
    '''
    return tf.concat([flatten(t) for t in tensor_list], axis=0)

def flatten(tensor):
    '''
    Flatten a tensor
    '''
    return tf.reshape(tensor, shape=(-1,))

  
def test_agent(env_test, agent_op, num_games=10):
    '''
    Test an agent 'agent_op', 'num_games' times
    Return mean and std
    '''
    games_r = []
    for _ in range(num_games):
        d = False
        game_r = 0
        o = env_test.reset()

        while not d:
            a_s, _ = agent_op([o])
            o, r, d, _ = env_test.step(a_s)
            game_r += r

        games_r.append(game_r)
    return np.mean(games_r), np.std(games_r)

class Buffer():
    '''
    Class to store the experience from a unique policy
    '''
    def __init__(self, gamma=0.99, lam=0.95):
        self.gamma = gamma
        self.lam = lam
        self.adv = []
        self.ob = []
        self.ac = []
        self.rtg = []

    def store(self, temp_traj, last_sv):
        '''
        Add temp_traj values to the buffers and compute the advantage and reward to go

        Parameters:
        -----------
        temp_traj: list where each element is a list that contains: observation, reward, action, state-value
        last_sv: value of the last state (Used to Bootstrap)
        '''
        # store only if there are temporary trajectories
        if len(temp_traj) > 0:
            self.ob.extend(temp_traj[:,0])
            rtg = discounted_rewards(temp_traj[:,1], last_sv, self.gamma)
            self.adv.extend(GAE(temp_traj[:,1], temp_traj[:,3], last_sv, self.gamma, self.lam))
            self.rtg.extend(rtg)
            self.ac.extend(temp_traj[:,2])

    def get_batch(self):
        # standardize the advantage values
        norm_adv = (self.adv - np.mean(self.adv)) / (np.std(self.adv) + 1e-10)
        return np.array(self.ob), np.array(np.expand_dims(self.ac,-1)), np.array(norm_adv), np.array(self.rtg)

    def __len__(self):
        assert(len(self.adv) == len(self.ob) == len(self.ac) == len(self.rtg))
        return len(self.ob)


class FullBuffer():
    def __init__(self):
        self.rew = []
        self.obs = []
        self.act = []
        self.nxt_obs = []
        self.done = []
        
        self.train_idx = []
        self.valid_idx = []
        self.idx = 0

        
    def store(self, obs, act, rew, nxt_obs, done):
        self.rew.append(rew)
        self.obs.append(obs)
        self.act.append(act)
        self.nxt_obs.append(nxt_obs)
        self.done.append(done)
          
        self.idx += 1

    def generate_random_dataset(self):
        rnd = np.arange(len(self.obs))
        np.random.shuffle(rnd)
        self.valid_idx = rnd[ : int(len(self.obs)/3)]
        self.train_idx = rnd[int(len(self.obs)/3) : ]
        print('Train set:', len(self.train_idx), 'Valid set:', len(self.valid_idx))  
      
    def get_training_batch(self):
        return np.array(self.obs)[self.train_idx], np.array(np.expand_dims(self.act,-1))[self.train_idx], np.array(self.rew)[self.train_idx], np.array(self.nxt_obs)[self.train_idx], np.array(self.done)[self.train_idx]
      
      
    def get_valid_batch(self):
        return np.array(self.obs)[self.valid_idx], np.array(np.expand_dims(self.act,-1))[self.valid_idx], np.array(self.rew)[self.valid_idx], np.array(self.nxt_obs)[self.valid_idx], np.array(self.done)[self.valid_idx]
      
    def __len__(self):
        assert(len(self.rew) == len(self.obs) == len(self.act) == len(self.nxt_obs) == len(self.done))
        return len(self.obs)

      
def simulate_environment(env, policy, simulated_steps):

    buffer = Buffer(0.99, 0.95)
    # lists to store rewards and length of the trajectories completed
    steps = 0
    number_episodes = 0

    while steps < simulated_steps:
        temp_buf = []
        obs = env.reset()
        number_episodes += 1
        done = False

        while not done:
            act, val = policy([obs])

            obs2, rew, done, _ = env.step([act])
          
            temp_buf.append([obs.copy(), rew, np.squeeze(act), np.squeeze(val)])

            obs = obs2.copy()
            steps += 1
                
            if done:
                buffer.store(np.array(temp_buf), 0)
                temp_buf = []

            if steps == simulated_steps:
                break

        buffer.store(np.array(temp_buf), np.squeeze(policy([obs])[1]))
        
    print('Sim ep:',number_episodes, end=' ')
    
    return buffer.get_batch()


class NetworkEnv(gym.Wrapper):
    def __init__(self, env, model_func, reward_func, done_func, number_models):
        gym.Wrapper.__init__(self, env)
        self.model_func = model_func
        self.reward_func = reward_func
        self.done_func = done_func
        self.number_models = number_models
        self.len_episode = 0

    def reset(self, **kwargs):
        self.len_episode = 0
        self.obs = self.env.reset(**kwargs)
          
        return self.obs
    
    def step(self, action):
        # predict the next state on a random model
        obs = self.model_func(self.obs, [np.squeeze(action)], np.random.randint(0,self.number_models))
        rew = self.reward_func(self.obs, [np.squeeze(action)])
        done = self.done_func(obs)
        
        self.len_episode += 1

        if self.len_episode >= 990:
          done = True
        
        self.obs = obs
        
        return self.obs, rew, done, ""

class StructEnv(gym.Wrapper):
    '''
    Gym Wrapper to store information like number of steps and total reward of the last espisode.
    '''
    def __init__(self, env):
        gym.Wrapper.__init__(self, env)
        self.n_obs = self.env.reset()
        self.total_rew = 0
        self.len_episode = 0

    def reset(self, **kwargs):
        self.n_obs = self.env.reset(**kwargs)
        self.total_rew = 0
        self.len_episode = 0
        return self.n_obs.copy()
        
    def step(self, action):
        ob, reward, done, info = self.env.step(action)
        self.total_rew += reward
        self.len_episode += 1
        return ob, reward, done, info

    def get_episode_reward(self):
        return self.total_rew

    def get_episode_length(self):
        return self.len_episode

def pendulum_done(ob):
  return np.abs(np.arcsin(np.squeeze(ob[3]))) > .2

def pendulum_reward(ob, ac):
  return 1


def restore_model(old_model_variables, m_variables):    
    # variable used as index for restoring the actor's parameters
    it_v2 = tf.Variable(0, trainable=False)
    restore_m_params = []
    
    for m_v in m_variables:
        upd_m_rsh = tf.reshape(old_model_variables[it_v2 : it_v2+tf.reduce_prod(m_v.shape)], shape=m_v.shape)
        restore_m_params.append(m_v.assign(upd_m_rsh)) 
        it_v2 += tf.reduce_prod(m_v.shape)
        
    return tf.group(*restore_m_params)
      
      
def METRPO(env_name, hidden_sizes=[32], cr_lr=5e-3, num_epochs=50, gamma=0.99, lam=0.95, number_envs=1, 
        critic_iter=10, steps_per_env=100, delta=0.002, algorithm='TRPO', conj_iters=10, minibatch_size=1000,
          mb_lr=0.0001, model_batch_size=512, simulated_steps=300, num_ensemble_models=2, model_iter=15):
    '''
    Model Ensemble Trust Region Policy Optimization

    Parameters:
    -----------
    env_name: Name of the environment
    hidden_sizes: list of the number of hidden units for each layer
    cr_lr: critic learning rate
    num_epochs: number of training epochs
    gamma: discount factor
    lam: lambda parameter for computing the GAE
    number_envs: number of "parallel" synchronous environments
        # NB: it isn't distributed across multiple CPUs
    critic_iter: NUmber of SGD iterations on the critic per epoch
    steps_per_env: number of steps per environment
            # NB: the total number of steps per epoch will be: steps_per_env*number_envs
    delta: Maximum KL divergence between two policies. Scalar value
    algorithm: type of algorithm. Either 'TRPO' or 'NPO'
    conj_iters: number of conjugate gradient iterations
    minibatch_size: Batch size used to train the critic
    mb_lr: learning rate of the environment model
    model_batch_size: batch size of the environment model
    simulated_steps: number of simulated steps for each policy update
    num_ensemble_models: number of models
    model_iter: number of iterations without improvement before stopping training the model
    '''
    # TODO: add ME-TRPO hyperparameters

    tf.reset_default_graph()

    # Create a few environments to collect the trajectories
    envs = [StructEnv(gym.make(env_name)) for _ in range(number_envs)]
    env_test = gym.make(env_name)
    #env_test = gym.wrappers.Monitor(env_test, "VIDEOS/", force=True, video_callable=lambda x: x%10 == 0)

    low_action_space = envs[0].action_space.low
    high_action_space = envs[0].action_space.high

    obs_dim = envs[0].observation_space.shape
    act_dim = envs[0].action_space.shape[0]
    
    print(envs[0].action_space, envs[0].observation_space)

    # Placeholders
    act_ph = tf.placeholder(shape=(None,act_dim), dtype=tf.float32, name='act')
    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')
    # NEW
    nobs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='nobs')
    ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')
    adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='adv')
    old_p_log_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='old_p_log')
    old_mu_ph = tf.placeholder(shape=(None, act_dim), dtype=tf.float32, name='old_mu')
    old_log_std_ph = tf.placeholder(shape=(act_dim), dtype=tf.float32, name='old_log_std')
    p_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='p_ph')

    # result of the conjugate gradient algorithm
    cg_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='cg')
    
    #########################################################
    ######################## POLICY #########################
    #########################################################

    old_model_variables = tf.placeholder(shape=(None,), dtype=tf.float32, name='old_model_variables')
        
    # Neural network that represent the policy
    with tf.variable_scope('actor_nn'):
        p_means = mlp(obs_ph, hidden_sizes, act_dim, tf.tanh, last_activation=tf.tanh)
        p_means = tf.clip_by_value(p_means, low_action_space, high_action_space)
        log_std = tf.get_variable(name='log_std', initializer=np.ones(act_dim, dtype=np.float32))

    # Neural network that represent the value function
    with tf.variable_scope('critic_nn'):
        s_values = mlp(obs_ph, hidden_sizes, 1, tf.tanh, last_activation=None)
        s_values = tf.squeeze(s_values)    


    # Add "noise" to the predicted mean following the Gaussian distribution with standard deviation e^(log_std)
    p_noisy = p_means + tf.random_normal(tf.shape(p_means), 0, 1) * tf.exp(log_std)
    # Clip the noisy actions
    a_sampl = tf.clip_by_value(p_noisy, low_action_space, high_action_space)
    # Compute the gaussian log likelihood
    p_log = gaussian_log_likelihood(act_ph, p_means, log_std)

    # Measure the divergence
    diverg = tf.reduce_mean(tf.exp(old_p_log_ph - p_log))
    
    # ratio
    ratio_new_old = tf.exp(p_log - old_p_log_ph)
    # TRPO surrogate loss function
    p_loss = - tf.reduce_mean(ratio_new_old * adv_ph)

    # MSE loss function
    v_loss = tf.reduce_mean((ret_ph - s_values)**2)
    # Critic optimization
    v_opt = tf.train.AdamOptimizer(cr_lr).minimize(v_loss)

    def variables_in_scope(scope):
        # get all trainable variables in 'scope'
        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
    
    # Gather and flatten the actor parameters
    p_variables = variables_in_scope('actor_nn')
    p_var_flatten = flatten_list(p_variables)

    # Gradient of the policy loss with respect to the actor parameters
    p_grads = tf.gradients(p_loss, p_variables)
    p_grads_flatten = flatten_list(p_grads)

    ########### RESTORE ACTOR PARAMETERS ###########
    p_old_variables = tf.placeholder(shape=(None,), dtype=tf.float32, name='p_old_variables')
    # variable used as index for restoring the actor's parameters
    it_v1 = tf.Variable(0, trainable=False)
    restore_params = []

    for p_v in p_variables:
        upd_rsh = tf.reshape(p_old_variables[it_v1 : it_v1+tf.reduce_prod(p_v.shape)], shape=p_v.shape)
        restore_params.append(p_v.assign(upd_rsh)) 
        it_v1 += tf.reduce_prod(p_v.shape)

    restore_params = tf.group(*restore_params)

    # gaussian KL divergence of the two policies 
    dkl_diverg = gaussian_DKL(old_mu_ph, old_log_std_ph, p_means, log_std) 

    # Jacobian of the KL divergence (Needed for the Fisher matrix-vector product)
    dkl_diverg_grad = tf.gradients(dkl_diverg, p_variables) 

    dkl_matrix_product = tf.reduce_sum(flatten_list(dkl_diverg_grad) * p_ph)
    print('dkl_matrix_product', dkl_matrix_product.shape)
    # Fisher vector product
    # The Fisher-vector product is a way to compute the A matrix without the need of the full A
    Fx = flatten_list(tf.gradients(dkl_matrix_product, p_variables))

    ## Step length
    beta_ph = tf.placeholder(shape=(), dtype=tf.float32, name='beta')
    # NPG update
    npg_update = beta_ph * cg_ph
    
    ## alpha is found through line search
    alpha = tf.Variable(1., trainable=False)
    # TRPO update
    trpo_update = alpha * npg_update

    ####################   POLICY UPDATE  ###################
    # variable used as an index
    it_v = tf.Variable(0, trainable=False)
    p_opt = []
    # Apply the updates to the policy
    for p_v in p_variables:
        upd_rsh = tf.reshape(trpo_update[it_v : it_v+tf.reduce_prod(p_v.shape)], shape=p_v.shape)
        p_opt.append(p_v.assign_sub(upd_rsh))
        it_v += tf.reduce_prod(p_v.shape)

    p_opt = tf.group(*p_opt)
        

    #########################################################
    ######################### MODEL #########################
    #########################################################
     
    m_opts = []
    m_losses = []
    
    nobs_pred_m = []
    act_obs = tf.concat([obs_ph, act_ph], 1)
    # computational graph of N models
    for i in range(num_ensemble_models):
        with tf.variable_scope('model_'+str(i)+'_nn'):
            nobs_pred = mlp(act_obs, [64, 64], obs_dim[0], tf.nn.relu, last_activation=None)
            nobs_pred_m.append(nobs_pred)
        
        m_loss = tf.reduce_mean((nobs_ph - nobs_pred)**2)
        m_losses.append(m_loss)

        m_opts.append(tf.train.AdamOptimizer(mb_lr).minimize(m_loss))

      
    ##################### RESTORE MODEL ######################
    initialize_models = []
    models_variables = []
    for i in range(num_ensemble_models):
      m_variables = variables_in_scope('model_'+str(i)+'_nn')
      initialize_models.append(restore_model(old_model_variables, m_variables))

      models_variables.append(flatten_list(m_variables))

    
    # Time
    now = datetime.now()
    clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, now.second)
    print('Time:', clock_time)


    # Set scalars and hisograms for TensorBoard
    tf.summary.scalar('p_loss', p_loss, collections=['train'])
    tf.summary.scalar('v_loss', v_loss, collections=['train'])
    tf.summary.scalar('p_divergence', diverg, collections=['train'])
    tf.summary.scalar('ratio_new_old',tf.reduce_mean(ratio_new_old), collections=['train'])
    tf.summary.scalar('dkl_diverg', dkl_diverg, collections=['train'])
    tf.summary.scalar('alpha', alpha, collections=['train'])
    tf.summary.scalar('beta', beta_ph, collections=['train'])
    tf.summary.scalar('p_std_mn', tf.reduce_mean(tf.exp(log_std)), collections=['train'])
    tf.summary.scalar('s_values_mn', tf.reduce_mean(s_values), collections=['train'])
    tf.summary.histogram('p_log', p_log, collections=['train'])
    tf.summary.histogram('p_means', p_means, collections=['train'])
    tf.summary.histogram('s_values', s_values, collections=['train'])
    tf.summary.histogram('adv_ph',adv_ph, collections=['train'])
    tf.summary.histogram('log_std',log_std, collections=['train'])
    scalar_summary = tf.summary.merge_all('train')

    tf.summary.scalar('old_v_loss', v_loss, collections=['pre_train'])
    tf.summary.scalar('old_p_loss', p_loss, collections=['pre_train'])
    pre_scalar_summary = tf.summary.merge_all('pre_train')

    hyp_str = '-spe_'+str(steps_per_env)+'-envs_'+str(number_envs)+'-cr_lr'+str(cr_lr)+'-crit_it_'+str(critic_iter)+'-delta_'+str(delta)+'-conj_iters_'+str(conj_iters)
    file_writer = tf.summary.FileWriter('log_dir/'+env_name+'/'+algorithm+'_'+clock_time+'_'+hyp_str, tf.get_default_graph())
    
    # create a session
    sess = tf.Session()
    # initialize the variables
    sess.run(tf.global_variables_initializer())
    
    def action_op(o):
        return sess.run([p_means, s_values], feed_dict={obs_ph:o})

    def action_op_noise(o):
        return sess.run([a_sampl, s_values], feed_dict={obs_ph:o})

    def model_op(o, a, md_idx):
        mo = sess.run(nobs_pred_m[md_idx], feed_dict={obs_ph:[o], act_ph:[a]})
        return np.squeeze(mo)
      
    def run_model_loss(model_idx, r_obs, r_act, r_nxt_obs):
        return sess.run(m_losses[model_idx], feed_dict={obs_ph:r_obs, act_ph:r_act, nobs_ph:r_nxt_obs})
      
    def run_model_opt_loss(model_idx, r_obs, r_act, r_nxt_obs):
        return sess.run([m_opts[model_idx], m_losses[model_idx]], feed_dict={obs_ph:r_obs, act_ph:r_act, nobs_ph:r_nxt_obs})      
      
    def model_assign(i, model_variables_to_assign):
        '''
        Update the i-th model's parameters
        '''
        return sess.run(initialize_models[i], feed_dict={old_model_variables:model_variables_to_assign})
    
    def policy_update(obs_batch, act_batch, adv_batch, rtg_batch):
        # log probabilities, logits and log std of the "old" policy
        # "old" policy refer to the policy to optimize and that has been used to sample from the environment

        old_p_log, old_p_means, old_log_std = sess.run([p_log, p_means, log_std], feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})
        # get also the "old" parameters
        old_actor_params = sess.run(p_var_flatten)

        # old_p_loss is later used in the line search
        # run pre_scalar_summary for a summary before the optimization
        old_p_loss, summary = sess.run([p_loss,pre_scalar_summary], feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})
        file_writer.add_summary(summary, step_count)

        def H_f(p):
            '''
            Run the Fisher-Vector product on 'p' to approximate the Hessian of the DKL
            '''
            return sess.run(Fx, feed_dict={old_mu_ph:old_p_means, old_log_std_ph:old_log_std, p_ph:p, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})

        g_f = sess.run(p_grads_flatten, feed_dict={old_mu_ph:old_p_means,obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})
        ## Compute the Conjugate Gradient so to obtain an approximation of H^(-1)*g
        # Where H in reality isn't the true Hessian of the KL divergence but an approximation of it computed via Fisher-Vector Product (F)
        conj_grad = conjugate_gradient(H_f, g_f, iters=conj_iters)

        # Compute the step length
        beta_np = np.sqrt(2*delta / (1e-10 + np.sum(conj_grad * H_f(conj_grad))))
        
        def DKL(alpha_v):
            '''
            Compute the KL divergence.
            It optimize the function to compute the DKL. Afterwards it restore the old parameters.
            '''
            sess.run(p_opt, feed_dict={beta_ph:beta_np, alpha:alpha_v, cg_ph:conj_grad, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, old_p_log_ph:old_p_log})
            a_res = sess.run([dkl_diverg, p_loss], feed_dict={old_mu_ph:old_p_means, old_log_std_ph:old_log_std, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})
            sess.run(restore_params, feed_dict={p_old_variables: old_actor_params})
            return a_res

        # Actor optimization step
        # Different for TRPO or NPG
        # Backtracing line search to find the maximum alpha coefficient s.t. the constraint is valid
        best_alpha = backtracking_line_search(DKL, delta, old_p_loss, p=0.8)
        sess.run(p_opt, feed_dict={beta_ph:beta_np, alpha:best_alpha, cg_ph:conj_grad, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, old_p_log_ph:old_p_log})

        lb = len(obs_batch)
        shuffled_batch = np.arange(lb)
        np.random.shuffle(shuffled_batch)

        # Value function optimization steps
        for _ in range(critic_iter):
            # shuffle the batch on every iteration
            np.random.shuffle(shuffled_batch)
            for idx in range(0,lb, minibatch_size):
                minib = shuffled_batch[idx:min(idx+minibatch_size,lb)]
                sess.run(v_opt, feed_dict={obs_ph:obs_batch[minib], ret_ph:rtg_batch[minib]})


    def train_model(tr_obs, tr_act, tr_nxt_obs, v_obs, v_act, v_nxt_obs, step_count, model_idx):

        # Get validation loss on the old model
        mb_valid_loss1 = run_model_loss(model_idx, v_obs, v_act, v_nxt_obs)

        # Restore the random weights to have a new, clean neural network
        model_assign(model_idx, initial_variables_models[model_idx])

        mb_valid_loss = run_model_loss(model_idx, v_obs, v_act, v_nxt_obs)

        acc_m_losses = []
        last_m_losses = []
        md_params = sess.run(models_variables[model_idx])
        best_mb = {'iter':0, 'loss':mb_valid_loss, 'params':md_params}
        it = 0


        lb = len(tr_obs)
        shuffled_batch = np.arange(lb)
        np.random.shuffle(shuffled_batch)

        while best_mb['iter'] > it - model_iter:
            
            # update the model on each mini-batch
            last_m_losses = []
            for idx in range(0, lb, model_batch_size):
                minib = shuffled_batch[idx:min(idx+minibatch_size,lb)]
                
                if len(minib) != minibatch_size:
                  _, ml = run_model_opt_loss(model_idx, tr_obs[minib], tr_act[minib], tr_nxt_obs[minib])
                  acc_m_losses.append(ml)
                  last_m_losses.append(ml)
                else:
                  print('Warning!')

            # Check if the loss on the validation set has improved
            mb_valid_loss = run_model_loss(model_idx, v_obs, v_act, v_nxt_obs)
            if mb_valid_loss < best_mb['loss']:
                best_mb['loss'] = mb_valid_loss
                best_mb['iter'] = it
                best_mb['params'] = sess.run(models_variables[model_idx])

            it += 1

        # Restore the model with the lower validation loss
        model_assign(model_idx, best_mb['params'])

        print('Model:{}, iter:{} -- Old Val loss:{:.6f}  New Val loss:{:.6f} -- New Train loss:{:.6f}'.format(model_idx, it, mb_valid_loss1, best_mb['loss'], np.mean(last_m_losses)))
        summary = tf.Summary()
        summary.value.add(tag='supplementary/m_loss', simple_value=np.mean(acc_m_losses))
        summary.value.add(tag='supplementary/iterations', simple_value=it)
        file_writer.add_summary(summary, step_count)
        file_writer.flush()
    
    # variable to store the total number of steps
    step_count = 0
    model_buffer = FullBuffer()
    print('Env batch size:',steps_per_env, ' Batch size:',steps_per_env*number_envs)

    # Create a simulated environment
    sim_env = NetworkEnv(gym.make(env_name), model_op, pendulum_reward, pendulum_done, num_ensemble_models)
    
    # Get the initial parameters of each model
    # These are used in later epochs when we aim to re-train the models anew with the new dataset
    initial_variables_models = []
    for model_var in models_variables:
        initial_variables_models.append(sess.run(model_var))

    for ep in range(num_epochs):
        # lists to store rewards and length of the trajectories completed
        batch_rew = []
        batch_len = []
        print('============================', ep, '============================')
        # Execute in serial the environment, storing temporarily the trajectories.
        for env in envs:
            init_log_std = np.ones(act_dim) * np.log(np.random.rand()*1)
            env.reset()
            
            # iterate over a fixed number of steps
            for _ in range(steps_per_env):
                # run the policy
                
                if ep == 0:
                    # Sample random action during the first epoch
                    act = env.action_space.sample()
                else:
                    act = sess.run(a_sampl, feed_dict={obs_ph:[env.n_obs], log_std:init_log_std})
                    
                    
                act = np.squeeze(act)

                # take a step in the environment
                obs2, rew, done, _ = env.step(np.array([act]))

                # add the new transition to the temporary buffer
                model_buffer.store(env.n_obs.copy(), act, rew, obs2.copy(), done)

                env.n_obs = obs2.copy()
                step_count += 1

                if done:
                    batch_rew.append(env.get_episode_reward())
                    batch_len.append(env.get_episode_length())

                    env.reset()
                    init_log_std = np.ones(act_dim) * np.log(np.random.rand()*1)

                    
        print('Ep:%d Rew:%.2f -- Step:%d' % (ep, np.mean(batch_rew), step_count))
        
        ############################################################
        ###################### MODEL LEARNING ######################
        ############################################################
        
        # Initialize randomly a training and validation set
        model_buffer.generate_random_dataset()

        # get both datasets
        train_obs, train_act, _, train_nxt_obs, _ = model_buffer.get_training_batch()
        valid_obs, valid_act, _, valid_nxt_obs, _ = model_buffer.get_valid_batch()
            
        print('Log Std policy:', sess.run(log_std))
        for i in range(num_ensemble_models):
            
            # train the dynamic model on the datasets just sampled
            train_model(train_obs, train_act, train_nxt_obs, valid_obs, valid_act, valid_nxt_obs, step_count, i)

        ############################################################
        ###################### POLICY LEARNING ######################
        ############################################################

        best_sim_test = np.zeros(num_ensemble_models)
        for it in range(80):
            print('\t Policy it', it, end='.. ')
            ##################### MODEL SIMLUATION #####################
            obs_batch, act_batch, adv_batch, rtg_batch = simulate_environment(sim_env, action_op_noise, simulated_steps)
            
            ################# TRPO UPDATE ################
            policy_update(obs_batch, act_batch, adv_batch, rtg_batch)
            
            # Testing the policy on a real environment
            mn_test = test_agent(env_test, action_op, num_games=10)[0]        
            print(' Test score: ', np.round(mn_test, 2))
            
            summary = tf.Summary()
            summary.value.add(tag='test/performance', simple_value=mn_test)
            file_writer.add_summary(summary, step_count)
            file_writer.flush()
            
            # Test the policy on simulated environment. 
            if (it+1) % 5 == 0:
                print('Simulated test:', end=' -- ')
                sim_rewards = []

                for i in range(num_ensemble_models):
                    sim_m_env = NetworkEnv(gym.make(env_name), model_op, pendulum_reward, pendulum_done, i+1)
                    mn_sim_rew, _ = test_agent(sim_m_env, action_op, num_games=5)
                    sim_rewards.append(mn_sim_rew)
                    print(mn_sim_rew, end=' -- ')

                print("")
                sim_rewards = np.array(sim_rewards)
                # stop training if the policy hasn't improved
                if (np.sum(best_sim_test >= sim_rewards) > int(num_ensemble_models*0.7)) \
                    or (len(sim_rewards[sim_rewards >= 990]) > int(num_ensemble_models*0.7)):
                    break
                else:
                  best_sim_test = sim_rewards


    # closing environments..
    for env in envs:
        env.close()
    file_writer.close()

if __name__ == '__main__':
    METRPO('RoboschoolInvertedPendulum-v1', hidden_sizes=[32,32], cr_lr=1e-3, gamma=0.99, lam=0.95, num_epochs=7, steps_per_env=300, 
        number_envs=1, critic_iter=10, delta=0.01, algorithm='TRPO', conj_iters=10, minibatch_size=5000,
        mb_lr=0.00001, model_batch_size=50, simulated_steps=50000, num_ensemble_models=10, model_iter=15)

================================================
FILE: Chapter10/DAgger.py
================================================
import numpy as np 
import tensorflow as tf
from datetime import datetime
import time
from ple.games.flappybird import FlappyBird
from ple import PLE


def mlp(x, hidden_layers, output_layer, activation=tf.tanh, last_activation=None):
    '''
    Multi-layer perceptron
    '''
    for l in hidden_layers:
        x = tf.layers.dense(x, units=l, activation=activation)
    return tf.layers.dense(x, units=output_layer, activation=last_activation)

def flappy_to_list(fd):
    '''
    Return the state dictionary as a list
    '''
    return fd['player_y'], fd['player_vel'], fd['next_pipe_dist_to_player'], fd['next_pipe_top_y'], \
            fd['next_pipe_bottom_y'], fd['next_next_pipe_dist_to_player'], fd['next_next_pipe_top_y'], \
            fd['next_next_pipe_bottom_y']

def flappy_game_state(bol):
    '''
    Normalize the game state
    '''
    stat = flappy_to_list(bol.getGameState())
    stat = (np.array(stat, dtype=np.float32) / 300.0) - 0.5
    return stat

def no_op(env, n_act=5):
    for _ in range(n_act):
        env.act(119 if np.random.randn() < 0.5 else None)


def expert():
    '''
    Load the computational graph and pretarined weights of the expert
    '''
    graph = tf.get_default_graph()

    sess_expert = tf.Session(graph=graph)

    saver = tf.train.import_meta_graph('expert/model.ckpt.meta')
    saver.restore(sess_expert,tf.train.latest_checkpoint('expert/'))
    
    p_argmax = graph.get_tensor_by_name('actor_nn/max_act:0') 
    obs_ph = graph.get_tensor_by_name('obs:0') 

    def expert_policy(state):
        act = sess_expert.run(p_argmax, feed_dict={obs_ph:[state]})
        return np.squeeze(act)

    return expert_policy

def test_agent(policy, file_writer=None, test_games=10, step=0):
    game = FlappyBird()
    env = PLE(game, fps=30, display_screen=False)
    env.init()

    test_rewards = []
    for _ in range(test_games):
        env.reset_game()
        no_op(env)

        game_rew = 0

        while not env.game_over():

            state = flappy_game_state(env)

            action = 119 if policy(state) == 1 else None

            for _ in range(2):
                game_rew += env.act(action)

        test_rewards.append(game_rew)

        if file_writer is not None:
            summary = tf.Summary()
            summary.value.add(tag='test_performance', simple_value=game_rew)
            file_writer.add_summary(summary, step)
            file_writer.flush()

    return test_rewards


def DAgger(hidden_sizes=[32,32], dagger_iterations=20, p_lr=1e-3, step_iterations=1000, batch_size=128, train_epochs=20, obs_dim=8, act_dim=2):

    tf.reset_default_graph()

    ############################## EXPERT ###############################
    # load the expert and return a function that predict the expert action given a state
    expert_policy = expert()     
    print('Expert performance: ', np.mean(test_agent(expert_policy)))


    #################### LEARNER COMPUTATIONAL GRAPH ####################
    obs_ph = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32, name='obs')
    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')

    # Multi-layer perceptron
    p_logits = mlp(obs_ph, hidden_sizes, act_dim, tf.nn.relu, last_activation=None)
        
    act_max = tf.math.argmax(p_logits, axis=1)
    act_onehot = tf.one_hot(act_ph, depth=act_dim)

    # softmax cross entropy loss
    p_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=act_onehot, logits=p_logits))
    # Adam optimizer
    p_opt = tf.train.AdamOptimizer(p_lr).minimize(p_loss)


    now = datetime.now()
    clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, now.second)
    file_writer = tf.summary.FileWriter('log_dir/FlappyBird/DAgger_'+clock_time, tf.get_default_graph())

    sess = tf.Session()
    sess.run(tf.global_variables_initializer())
    
    def learner_policy(state):
        action = sess.run(act_max, feed_dict={obs_ph:[state]})
        return np.squeeze(action)

    X = []
    y = []

    env = FlappyBird()

    env = PLE(env, fps=30, display_screen=False)
    env.init()    

    #################### DAgger iterations ####################
    
    for it in range(dagger_iterations):
        sess.run(tf.global_variables_initializer())
        env.reset_game()
        no_op(env)

        game_rew = 0
        rewards = []

        ###################### Populate the dataset #####################

        for _ in range(step_iterations):
            # get the current state from the environment
            state = flappy_game_state(env)

            # As the iterations continue use more and more actions sampled from the learner
            if np.random.rand() < (1 - it/5):
                action = expert_policy(state)
            else:
                action = learner_policy(state)

            action = 119 if action == 1 else None

            rew = env.act(action)
            rew += env.act(action)

            # Add the state and the expert action to the dataset
            X.append(state)
            y.append(expert_policy(state))

            game_rew += rew

            # Whenever the game stop, reset the environment and initailize the variables
            if env.game_over():
                env.reset_game()
                no_op(env)

                rewards.append(game_rew)
                game_rew = 0

        ##################### Training #####################

        # Calculate the number of minibatches
        n_batches = int(np.floor(len(X)/batch_size))

        # shuffle the dataset
        shuffle = np.arange(len(X))
        np.random.shuffle(shuffle)

        
        shuffled_X = np.array(X)[shuffle]
        shuffled_y = np.array(y)[shuffle]
        
        
        for _ in range(train_epochs):
            ep_loss = []
            # Train the model on each minibatch in the dataset
            for b in range(n_batches):
                p_start = b*batch_size

                # mini-batch training
                tr_loss, _ = sess.run([p_loss, p_opt], feed_dict={
                                obs_ph:shuffled_X[p_start:p_start+batch_size], 
                                act_ph:shuffled_y[p_start:p_start+batch_size]})

                ep_loss.append(tr_loss)
            
        agent_tests = test_agent(learner_policy, file_writer, step=len(X))

        print('Ep:', it, np.mean(ep_loss), 'Test:', np.mean(agent_tests))


if __name__ == "__main__":
    DAgger(hidden_sizes=[16,16], dagger_iterations=10, p_lr=1e-4, step_iterations=100, batch_size=50, train_epochs=2000)

================================================
FILE: Chapter10/expert/checkpoint
================================================
model_checkpoint_path: "model.ckpt"
all_model_checkpoint_paths: "model.ckpt"


================================================
FILE: Chapter11/ES.py
================================================
import numpy as np 
import tensorflow as tf
from datetime import datetime
import time
import gym

import multiprocessing as mp
import scipy.stats as ss
import contextlib
import numpy as np

@contextlib.contextmanager
def temp_seed(seed):
    state = np.random.get_state()
    np.random.seed(seed)
    try:
        yield
    finally:
        np.random.set_state(state)

def mlp(x, hidden_layers, output_layer, activation=tf.tanh, last_activation=None):
    '''
    Multi-layer perceptron
    '''
    for l in hidden_layers:
        x = tf.layers.dense(x, units=l, activation=activation)
        
    return tf.layers.dense(x, units=output_layer, activation=last_activation)


def test_agent(env_test, agent_op, num_games=1):
    '''
    Test an agent 'agent_op', 'num_games' times
    Return mean and std
    '''
    games_r = []
    steps = 0
    for _ in range(num_games):
        d = False
        game_r = 0
        o = env_test.reset()

        while not d:
            a_s = agent_op(o)
            o, r, d, _ = env_test.step(a_s)
            game_r += r
            steps += 1

        games_r.append(game_r)
    return games_r, steps


def worker(env_name, initial_seed, hidden_sizes, lr, std_noise, indiv_per_worker, worker_name, params_queue, output_queue):

    env = gym.make(env_name)
    obs_dim = env.observation_space.shape[0]
    act_dim = env.action_space.shape[0]

    import tensorflow as tf

    # set an initial seed common to all the workers
    tf.random.set_random_seed(initial_seed)
    np.random.seed(initial_seed)
    

    with tf.device("/cpu:" + worker_name):
        
        obs_ph = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32, name='obs_ph')
        new_weights_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='new_weights_ph')
        
        def variables_in_scope(scope):
            # get all trainable variables in 'scope'
            return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)

        with tf.variable_scope('nn_' + worker_name):
            acts = mlp(obs_ph, hidden_sizes, act_dim, tf.tanh, last_activation=tf.tanh)

        agent_variables = variables_in_scope('nn_' + worker_name)
        agent_variables_flatten = flatten_list(agent_variables)

        # Update the agent parameters with new weights new_weights_ph
        it_v1 = tf.Variable(0, trainable=False)
        update_weights = []
        for a_v in agent_variables:
            upd_rsh = tf.reshape(new_weights_ph[it_v1 : it_v1+tf.reduce_prod(a_v.shape)], shape=a_v.shape)
            update_weights.append(a_v.assign(upd_rsh))
            it_v1 += tf.reduce_prod(a_v.shape)


        # Reshape the new_weights_ph following the neural network shape
        it_v2 = tf.Variable(0, trainable=False)
        vars_grads_list = []
        for a_v in agent_variables:
            vars_grads_list.append(tf.reshape(new_weights_ph[it_v2 : it_v2+tf.reduce_prod(a_v.shape)], shape=a_v.shape))
            it_v2 += tf.reduce_prod(a_v.shape)

        # Create the optimizer
        opt = tf.train.AdamOptimizer(lr)
        # Apply the "gradients" using Adam
        apply_g = opt.apply_gradients([(g, v) for g, v in zip(vars_grads_list, agent_variables)])
        
    def agent_op(o):
        a = np.squeeze(sess.run(acts, feed_dict={obs_ph:[o]}))
        return np.clip(a, env.action_space.low, env.action_space.high)


    def evaluation_on_noise(noise):
        '''
        Evaluate the agent with the noise
        ''' 
        # Get the original weights that will be restored after the evaluation
        original_weights = sess.run(agent_variables_flatten)

        # Update the weights of the agent/individual by adding the extra noise noise*STD_NOISE
        sess.run(update_weights, feed_dict={new_weights_ph:original_weights + noise*std_noise})

        # Test the agent with the new weights
        rewards, steps = test_agent(env, agent_op)

        # Restore the original weights
        sess.run(update_weights, feed_dict={new_weights_ph:original_weights})

        return np.mean(rewards), steps

    config_proto = tf.ConfigProto(device_count={'CPU': 4}, allow_soft_placement=True)
    sess = tf.Session(config=config_proto)
    sess.run(tf.global_variables_initializer())


    agent_flatten_shape = sess.run(agent_variables_flatten).shape

    while True:

        for _ in range(indiv_per_worker):
            seed = np.random.randint(1e7)

            with temp_seed(seed):
                # sample, for each weight of the agent, from a normal distribution
                sampled_noise = np.random.normal(size=agent_flatten_shape)
            
            # Mirrored sampling
            pos_rew, stp1 = evaluation_on_noise(sampled_noise)
            neg_rew, stp2 = evaluation_on_noise(-sampled_noise)

            # Put the returns and seeds on the queue
            # Note that here we are just sending the seed (a scalar value), not the complete perturbation sampled_noise
            output_queue.put([[pos_rew, neg_rew], seed, stp1+stp2])

        # Get all the returns and seed from each other worker
        batch_return, batch_seed = params_queue.get()

        batch_noise = []
        for seed in batch_seed:

            # reconstruct the perturbations from the seed
            with temp_seed(seed):
                sampled_noise = np.random.normal(size=agent_flatten_shape)

            batch_noise.append(sampled_noise)
            batch_noise.append(-sampled_noise)
            

        # Compute the sthocastic gradient estimate 
        vars_grads = np.zeros(agent_flatten_shape)
        for n, r in zip(batch_noise, batch_return):
            vars_grads += n * r
        vars_grads /= len(batch_noise) * std_noise

        # run Adam optimization on the estimate gradient just computed
        sess.run(apply_g, feed_dict={new_weights_ph:-vars_grads})


def normalized_rank(rewards):
    '''
    Rank the rewards and normalize them.
    '''
    ranked = ss.rankdata(rewards)
    norm = (ranked - 1) / (len(ranked) - 1)
    norm -= 0.5
    return norm


def flatten(tensor):
    '''
    Flatten a tensor
    '''
    return tf.reshape(tensor, shape=(-1,))

def flatten_list(tensor_list):
    '''
    Flatten a list of tensors
    '''
    return tf.concat([flatten(t) for t in tensor_list], axis=0)


def ES(env_name, hidden_sizes=[8,8], number_iter=1000, num_workers=4, lr=0.01, indiv_per_worker=10, std_noise=0.01):


    initial_seed = np.random.randint(1e7)

    # Create a queue for the output values (single returns and seeds values)
    output_queue = mp.Queue(maxsize=num_workers*indiv_per_worker)
    # Create a queue for the input paramaters (batch return and batch seeds)
    params_queue = mp.Queue(maxsize=num_workers)


    now = datetime.now()
    clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, now.second)
    hyp_str = '-numworkers_'+str(num_workers)+'-lr_'+str(lr)
    file_writer = tf.summary.FileWriter('log_dir/'+env_name+'/'+clock_time+'_'+hyp_str, tf.get_default_graph())
    
    processes = []
    # Create a parallel process for each worker
    for widx in range(num_workers):
        p = mp.Process(target=worker, args=(env_name, initial_seed, hidden_sizes, lr, std_noise, indiv_per_worker, str(widx), params_queue, output_queue))
        p.start()
        processes.append(p)

    tot_steps = 0
    # Iterate over all the training iterations
    for n_iter in range(number_iter):

        batch_seed = []
        batch_return = []
        
        # Wait until enough candidate individuals are evaluated
        for _ in range(num_workers*indiv_per_worker):
            p_rews, p_seed, p_steps = output_queue.get()

            batch_seed.append(p_seed)
            batch_return.extend(p_rews)
            tot_steps += p_steps

        print('Iter: {} Reward: {:.2f}'.format(n_iter, np.mean(batch_return)))

        # Let's save the population's performance
        summary = tf.Summary()
        for r in batch_return:
            summary.value.add(tag='performance', simple_value=r)
        file_writer.add_summary(summary, tot_steps)
        file_writer.flush()

        # Rank and normalize the returns
        batch_return = normalized_rank(batch_return)

        # Put on the queue all the returns and seed so that each worker can optimize the neural network
        for _ in range(num_workers):
            params_queue.put([batch_return, batch_seed])
    
    # terminate all workers
    for p in processes:
        p.terminate()


if __name__ == '__main__':
    ES('LunarLanderContinuous-v2', hidden_sizes=[32,32], number_iter=200, num_workers=4, lr=0.02, indiv_per_worker=12, std_noise=0.05)


================================================
FILE: Chapter12/ESBAS.py
================================================
import numpy as np 
import tensorflow as tf
import gym
from datetime import datetime
from collections import deque
import time
import sys


gym.logger.set_level(40)

current_milli_time = lambda: int(round(time.time() * 1000))
    

def mlp(x, hidden_layers, output_layer, activation=tf.tanh, last_activation=None):
    '''
    Multi-layer perceptron
    '''
    for l in hidden_layers:
        x = tf.layers.dense(x, units=l, activation=activation)
        
    return tf.layers.dense(x, units=output_layer, activation=last_activation)

class ExperienceBuffer():
    '''
    Experience Replay Buffer
    '''
    def __init__(self, buffer_size):
        self.obs_buf = deque(maxlen=buffer_size)
        self.rew_buf = deque(maxlen=buffer_size)
        self.act_buf = deque(maxlen=buffer_size)
        self.obs2_buf = deque(maxlen=buffer_size)
        self.done_buf = deque(maxlen=buffer_size)


    def add(self, obs, rew, act, obs2, done):
        # Add a new transition to the buffers
        self.obs_buf.append(obs)
        self.rew_buf.append(rew)
        self.act_buf.append(act)
        self.obs2_buf.append(obs2)
        self.done_buf.append(done)
        

    def sample_minibatch(self, batch_size):
        # Sample a minibatch of size batch_size
        mb_indices = np.random.randint(len(self.obs_buf), size=batch_size)

        mb_obs = [self.obs_buf[i] for i in mb_indices]
        mb_rew = [self.rew_buf[i] for i in mb_indices]
        mb_act = [self.act_buf[i] for i in mb_indices]
        mb_obs2 = [self.obs2_buf[i] for i in mb_indices]
        mb_done = [self.done_buf[i] for i in mb_indices]

        return mb_obs, mb_rew, mb_act, mb_obs2, mb_done

    def __len__(self):
        return len(self.obs_buf)


def q_target_values(mini_batch_rw, mini_batch_done, av, discounted_value):   
    '''
    Calculate the target value y for each transition
    '''
    max_av = np.max(av, axis=1)
    
    # if episode terminate, y take value r
    # otherwise, q-learning step
    ys = []
    for r, d, av in zip(mini_batch_rw, mini_batch_done, max_av):
        if d:
            ys.append(r)
        else:
            q_step = r + discounted_value * av
            ys.append(q_step)
    
    assert len(ys) == len(mini_batch_rw)
    return ys

def greedy(action_values):
    '''
    Greedy policy
    '''
    return np.argmax(action_values)

def eps_greedy(action_values, eps=0.1):
    '''
    Eps-greedy policy
    '''
    if np.random.uniform(0,1) < eps:
        # Choose a uniform random action
        return np.random.randint(len(action_values))
    else:
        # Choose the greedy action
        return np.argmax(action_values)

def test_agent(env_test, agent_op, num_games=20, summary=None):
    '''
    Test an agent
    '''
    games_r = []

    for _ in range(num_games):
        d = False
        game_r = 0
        o = env_test.reset()

        while not d:
            a = greedy(np.squeeze(agent_op(o)))
            o, r, d, _ = env_test.step(a)

            game_r += r

        if summary is not None:
            summary.value.add(tag='test_performance', simple_value=game_r)

        games_r.append(game_r)

    return games_r


class DQN_optimization:
    def __init__(self, obs_dim, act_dim, hidden_layers, lr, discount):
        self.obs_dim = obs_dim
        self.act_dim = act_dim
        self.hidden_layers = hidden_layers
        self.lr = lr
        self.discount = discount

        self.__build_graph()


    def __build_graph(self):
        
        self.g = tf.Graph()
        with self.g.as_default():
            # Create all the placeholders
            self.obs_ph = tf.placeholder(shape=(None, self.obs_dim[0]), dtype=tf.float32, name='obs')
            self.act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')
            self.y_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='y')

            # Create the target network
            with tf.variable_scope('target_network'):
                self.target_qv = mlp(self.obs_ph, self.hidden_layers, self.act_dim, tf.nn.relu, last_activation=None)
            target_vars = tf.trainable_variables()

            # Create the online network (i.e. the behavior policy)
            with tf.variable_scope('online_network'):
                self.online_qv = mlp(self.obs_ph, self.hidden_layers, self.act_dim, tf.nn.relu, last_activation=None)
            train_vars = tf.trainable_variables()

            # Update the target network by assigning to it the variables of the online network
            # Note that the target network and the online network have the same exact architecture
            update_target = [train_vars[i].assign(train_vars[i+len(target_vars)]) for i in range(len(train_vars) - len(target_vars))]
            self.update_target_op = tf.group(*update_target)

            # One hot encoding of the action
            act_onehot = tf.one_hot(self.act_ph, depth=self.act_dim)
            # We are interested only in the Q-values of those actions
            q_values = tf.reduce_sum(act_onehot * self.online_qv, axis=1)
            
            # MSE loss function
            self.v_loss = tf.reduce_mean((self.y_ph - q_values)**2)
            # Adam optimize that minimize the loss v_loss
            self.v_opt = tf.train.AdamOptimizer(self.lr).minimize(self.v_loss)

            self.__create_session()

            # Copy the online network in the target network
            self.sess.run(self.update_target_op)

    def __create_session(self):
         # open a session
        self.sess = tf.Session(graph=self.g)
        # and initialize all the variables
        self.sess.run(tf.global_variables_initializer())      
    

    def act(self, o):
        '''
        Forward pass to obtain the Q-values from the online network of a single observation
        '''
        return self.sess.run(self.online_qv, feed_dict={self.obs_ph:[o]})

    def optimize(self, mb_obs, mb_rew, mb_act, mb_obs2, mb_done):
        mb_trg_qv = self.sess.run(self.target_qv, feed_dict={self.obs_ph:mb_obs2})
        y_r = q_target_values(mb_rew, mb_done, mb_trg_qv, self.discount)

        # training step
        # optimize, compute the loss and return the TB summary
        self.sess.run(self.v_opt, feed_dict={self.obs_ph:mb_obs, self.y_ph:y_r, self.act_ph: mb_act})

    def update_target_network(self):
        # run the session to update the target network and get the mean loss sumamry 
        self.sess.run(self.update_target_op)


class UCB1:
    def __init__(self, algos, epsilon):
        self.n = 0
        self.epsilon = epsilon
        self.algos = algos

        self.nk = np.zeros(len(algos))
        self.xk = np.zeros(len(algos))

    def choose_algorithm(self):
        # take the best algorithm following UCB1
        current_best = np.argmax([self.xk[i] + np.sqrt(self.epsilon * np.log(self.n) / self.nk[i]) for i in range(len(self.algos))])
        for i in range(len(self.algos)):
            if self.nk[i] < 5:
                return np.random.randint(len(self.algos))

        return current_best

    def update(self, idx_algo, traj_return):
        # Update the mean RL return 
        self.xk[idx_algo] = (self.nk[idx_algo] * self.xk[idx_algo] + traj_return) / (self.nk[idx_algo] + 1)
        # increase the number of trajectories run
        self.nk[idx_algo] += 1
        self.n += 1


def ESBAS(env_name, hidden_sizes=[32], lr=1e-2, num_epochs=2000, buffer_size=100000, discount=0.99, render_cycle=100, update_target_net=1000, 
        batch_size=64, update_freq=4, min_buffer_size=5000, test_frequency=20, start_explor=1, end_explor=0.1, explor_steps=100000,
        xi=1):

    # reset the default graph
    tf.reset_default_graph()

    # Create the environment both for train and test
    env = gym.make(env_name)
    # Add a monitor to the test env to store the videos
    env_test = gym.wrappers.Monitor(gym.make(env_name), "VIDEOS/TEST_VIDEOS"+env_name+str(current_milli_time()),force=True, video_callable=lambda x: x%20==0)

    dqns = []
    for l in hidden_sizes:
        dqns.append(DQN_optimization(env.observation_space.shape, env.action_space.n, l, lr, discount))

    # Time
    now = datetime.now()
    clock_time = "{}_{}.{}.{}".format(now.day, now.hour, now.minute, int(now.second))
    print('Time:', clock_time)

    LOG_DIR = 'log_dir/'+env_name
    hyp_str = "-lr_{}-upTN_{}-upF_{}-xi_{}" .format(lr, update_target_net, update_freq, xi)

    # initialize the File Writer for writing TensorBoard summaries
    file_writer = tf.summary.FileWriter(LOG_DIR+'/ESBAS_'+clock_time+'_'+hyp_str, tf.get_default_graph())

    def DQNs_update(step_counter):
        # If it's time to train the network:
        if len(buffer) > min_buffer_size and (step_counter % update_freq == 0):
        
            # sample a minibatch from the buffer
            mb_obs, mb_rew, mb_act, mb_obs2, mb_done = buffer.sample_minibatch(batch_size)

            for dqn in dqns:
                dqn.optimize(mb_obs, mb_rew, mb_act, mb_obs2, mb_done)

        # Every update_target_net steps, update the target network
        if len(buffer) > min_buffer_size and (step_counter % update_target_net == 0):

            for dqn in dqns:
                dqn.update_target_network()
    

    step_count = 0
    episode = 0
    beta = 1

    # Initialize the experience buffer
    buffer = ExperienceBuffer(buffer_size)

    obs = env.reset()

    # policy exploration initialization
    eps = start_explor
    eps_decay = (start_explor - end_explor) / explor_steps


    for ep in range(num_epochs):

        # Policies' training
        for i in range(2**(beta-1), 2**beta):
            DQNs_update(i)

        ucb1 = UCB1(dqns, xi)
        list_bests = []
        ep_rew = []
        beta += 1

        while step_count < 2**beta:

            # Chose the best policy's algortihm that will run the next trajectory 
            best_dqn = ucb1.choose_algorithm()
            list_bests.append(best_dqn)

            summary = tf.Summary()
            summary.value.add(tag='algorithm_selected', simple_value=best_dqn)
            file_writer.add_summary(summary, step_count)
            file_writer.flush()

            g_rew = 0
            done = False
                
            while not done:
                # Epsilon decay
                if eps > end_explor:
                    eps -= eps_decay
                

                # Choose an eps-greedy action 
                act = eps_greedy(np.squeeze(dqns[best_dqn].act(obs)), eps=eps)

                # execute the action in the environment
                obs2, rew, done, _ = env.step(act)

                # Add the transition to the replay buffer
                buffer.add(obs, rew, act, obs2, done)

                obs = obs2
                g_rew += rew
                step_count += 1
            

            # Update the UCB parameters of the algortihm just used
            ucb1.update(best_dqn, g_rew)

            # The environment is ended.. reset it and initialize the variables
            obs = env.reset()
            ep_rew.append(g_rew)
            g_rew = 0
            episode += 1


            # Print some stats and test the best policy
            summary = tf.Summary()
            summary.value.add(tag='train_performance', simple_value=np.mean(ep_rew))

            if episode % 10 == 0:
                unique, counts = np.unique(list_bests, return_counts=True)
                print(dict(zip(unique, counts)))

                test_agent_results = test_agent(env_test, dqns[best_dqn].act, num_games=10, summary=summary)
                print('Epoch:%4d Episode:%4d Rew:%4.2f, Eps:%2.2f -- Step:%5d -- Test:%4.2f Best:%2d Last:%2d' % (ep,episode,np.mean(ep_rew), eps, step_count, np.mean(test_agent_results), best_dqn, g_rew))

            file_writer.add_summary(summary, step_count)
            file_writer.flush()


    file_writer.close()
    env.close()


if __name__ == '__main__':

    #ESBAS('Acrobot-v1', hidden_sizes=[[64, 64]], lr=4e-4, buffer_size=100000, update_target_net=100, batch_size=32, 
    #    update_freq=4, min_buffer_size=100, render_cycle=10000, explor_steps=50000, num_epochs=20000, end_explor=0.1)

    ESBAS('Acrobot-v1', hidden_sizes=[[64], [16, 16], [64, 64]], lr=4e-4, buffer_size=100000, update_target_net=100, batch_size=32, 
        update_freq=4, min_buffer_size=100, render_cycle=10000, explor_steps=50000, num_epochs=20000, end_explor=0.1,
        xi=1./4)

================================================
FILE: README.md
================================================


# Reinforcement Learning Algorithms with Python

<a href="https://www.packtpub.com/data/hands-on-reinforcement-learning-algorithms-with-python"><img src="https://www.packtpub.com/media/catalog/product/cache/ecd051e9670bd57df35c8f0b122d8aea/9/7/9781789131116-original.jpeg" alt="Reinforcement Learning Algorithms with Python" height="256px" align="right"></a>

This is the code repository for [Reinforcement Learning Algorithms with Python](https://www.packtpub.com/data/hands-on-reinforcement-learning-algorithms-with-python), published by Packt.

**Learn, understand, and develop smart algorithms for addressing AI challenges**

## What is this book about?
Reinforcement Learning (RL) is a popular and promising branch of AI that involves making smarter models and agents that can automatically determine ideal behavior based on changing requirements. This book will help you master RL algorithms and understand their implementation as you build self-learning agents.
Starting with an introduction to the tools, libraries, and setup needed to work in the RL environment, this book covers the building blocks of RL and delves into value-based methods, such as the application of Q-learning and SARSA algorithms. You'll learn how to use a combination of Q-learning and neural networks to solve complex problems. Furthermore, you'll study the policy gradient methods, TRPO, and PPO, to improve performance and stability, before moving on to the DDPG and TD3 deterministic algorithms. This book also covers how imitation learning techniques work and how Dagger can teach an agent to drive. You'll discover evolutionary strategies and black-box optimization techniques, and see how they can improve RL algorithms. Finally, you'll get to grips with exploration approaches, such as UCB and UCB1, and develop a meta-algorithm called ESBAS.
By the end of the book, you'll have worked with key RL algorithms to overcome challenges in real-world applications, and be part of the RL research community.


This book covers the following exciting features:
* Develop an agent to play CartPole using the OpenAI Gym interface
* Discover the model-based reinforcement learning paradigm
* Solve the Frozen Lake problem with dynamic programming
* Explore Q-learning and SARSA with a view to playing a taxi game
* Apply Deep Q-Networks (DQNs) to Atari games using Gym
* Study policy gradient algorithms, including Actor-Critic and REINFORCE
* Understand and apply PPO and TRPO in continuous locomotion environments
* Get to grips with evolution strategies for solving the lunar lander problem

If you feel this book is for you, get your [copy](https://www.amazon.com/Reinforcement-Learning-Algorithms-Python-understand/dp/1789131111/) today!

<a href="https://www.packtpub.com/?utm_source=github&utm_medium=banner&utm_campaign=GitHubBanner"><img src="https://raw.githubusercontent.com/PacktPublishing/GitHub/master/GitHub.png" 
alt="https://www.packtpub.com/" border="5" /></a>

## Instructions and Navigations
All of the code is organized into folders. For example, Chapter02.

The code will look like the following:
```
import gym

# create the environment 
env = gym.make("CartPole-v1")
# reset the environment before starting
env.reset()

# loop 10 times
for i in range(10):
    # take a random action
    env.step(env.action_space.sample())
    # render the game
   env.render()

# close the environment
env.close()
```

**Following is what you need for this book:**
If you are an AI researcher, deep learning user, or anyone who wants to learn reinforcement learning from scratch, this book is for you. You’ll also find this reinforcement learning book useful if you want to learn about the advancements in the field. Working knowledge of Python is necessary.	


With the following software and hardware list you can run all code files present in the book (Chapter 1-11).
### Software and Hardware List
| Chapter | Software required | OS required |
| -------- | ------------------------------------ | ----------------------------------- |
| All | Python 3.6 or higher | Windows, Mac OS X, and Linux (Any) |
| All | TensorFlow 1.14 or higher | Windows, Mac OS X, and Linux (Any) |

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. [Click here to download it](http://www.packtpub.com/sites/default/files/downloads/9781789131116_ColorImages.pdf).

### Related products
* Hands-On Reinforcement Learning with Python [[Packt]](https://www.packtpub.com/big-data-and-business-intelligence/hands-reinforcement-learning-python) [[Amazon]](https://www.amazon.com/Hands-Reinforcement-Learning-Python-reinforcement-ebook/dp/B079Q3WLM4/)

* Python Reinforcement Learning Projects [[Packt]](https://www.packtpub.com/big-data-and-business-intelligence/python-reinforcement-learning-projects) [[Amazon]](https://www.amazon.com/Python-Reinforcement-Learning-Projects-hands-ebook/dp/B07F2S82W3/)

## Get to Know the Author
**Andrea Lonza** is a deep learning engineer with a great passion for artificial intelligence and a desire to create machines that act intelligently. He has acquired expert knowledge in reinforcement learning, natural language processing, and computer vision through academic and industrial machine learning projects. He has also participated in several Kaggle competitions, achieving high results. He is always looking for compelling challenges and loves to prove himself.


### Suggestions and Feedback
[Click here](https://docs.google.com/forms/d/e/1FAIpQLSdy7dATC6QmEL81FIUuymZ0Wy9vH1jHkvpY57OiMeKGqib_Ow/viewform) if you have any feedback or suggestions.


### Download a free PDF

 <i>If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.<br>Simply click on the link to claim your free PDF.</i>
<p align="center"> <a href="https://packt.link/free-ebook/9781789131116">https://packt.link/free-ebook/9781789131116 </a> </p>