[
  {
    "path": "Chapter02/Code.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### TensorFlow installation\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"`pip3 install tensorflow`\\n\",\n    \"\\n\",\n    \"or\\n\",\n    \"\\n\",\n    \"`pip3 install tensorflow-gpu`\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### OpenAI Gym installation\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"On OSX: \\n\",\n    \"\\n\",\n    \"`brew install cmake boost boost-python sdl2 swig wget`\\n\",\n    \" \\n\",\n    \"On Ubuntu 16.04:\\n\",\n    \"\\n\",\n    \"`apt-get install -y python-pyglet python3-opengl zlib1g-dev libjpeg-dev patchelf cmake swig libboost-all-dev libsdl2-dev libosmesa6-dev xvfb ffmpeg`\\n\",\n    \"\\n\",\n    \"On Ubuntu 18.04\\n\",\n    \"\\n\",\n    \"`sudo apt install -y python3-dev zlib1g-dev libjpeg-dev cmake swig python-pyglet python3-opengl libboost-all-dev libsdl2-dev libosmesa6-dev patchelf ffmpeg xvfb `\\n\",\n    \"\\n\",\n    \"Then:\\n\",\n    \"\\n\",\n    \"```\\n\",\n    \"git clone https://github.com/openai/gym.git \\n\",\n    \"\\n\",\n    \"cd gym\\n\",\n    \"\\n\",\n    \"pip install -e '.[all]'\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"PyBox2D:\\n\",\n    \"\\n\",\n    \"```\\n\",\n    \"git clone https://github.com/pybox2d/pybox2d\\n\",\n    \"cd pybox2d\\n\",\n    \"pip3 install -e .\\n\",\n    \"```\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Duckietown installation\\n\",\n    \"\\n\",\n    \"```\\n\",\n    \"git clone https://github.com/duckietown/gym-duckietown.git\\n\",\n    \"cd gym-duckietown\\n\",\n    \"pip3 install -e .\\n\",\n    \"```\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Roboschool installation\\n\",\n    \"\\n\",\n    \"```\\n\",\n    \"git clone https://github.com/openai/roboschool\\n\",\n    \"cd roboschool\\n\",\n    \"ROBOSCHOOL_PATH=`pwd`\\n\",\n    \"git clone https://github.com/olegklimov/bullet3 -b roboschool_self_collision\\n\",\n    \"mkdir bullet3/build\\n\",\n    \"cd    bullet3/build\\n\",\n    \"cmake -DBUILD_SHARED_LIBS=ON -DUSE_DOUBLE_PRECISION=1 -DCMAKE_INSTALL_PREFIX:PATH=$ROBOSCHOOL_PATH/roboschool/cpp-household/bullet_local_install -DBUILD_CPU_DEMOS=OFF -DBUILD_BULLET2_DEMOS=OFF -DBUILD_EXTRAS=OFF  -DBUILD_UNIT_TESTS=OFF -DBUILD_CLSOCKET=OFF -DBUILD_ENET=OFF -DBUILD_OPENGL3_DEMOS=OFF ..\\n\",\n    \"\\n\",\n    \"make -j4\\n\",\n    \"make install\\n\",\n    \"cd ../..\\n\",\n    \"pip3 install -e $ROBOSCHOOL_PATH\\n\",\n    \"```\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## RL cycle\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\u001b[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.\\u001b[0m\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import gym\\n\",\n    \"\\n\",\n    \"# create the environment \\n\",\n    \"env = gym.make(\\\"CartPole-v1\\\")\\n\",\n    \"# reset the environment before starting\\n\",\n    \"env.reset()\\n\",\n    \"\\n\",\n    \"# loop 10 times\\n\",\n    \"for i in range(10):\\n\",\n    \"    # take a random action\\n\",\n    \"    env.step(env.action_space.sample())\\n\",\n    \"    # render the game\\n\",\n    \"    env.render()\\n\",\n    \"\\n\",\n    \"# close the environment\\n\",\n    \"env.close()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\u001b[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.\\u001b[0m\\n\",\n      \"Episode 0 finished, reward:15\\n\",\n      \"Episode 1 finished, reward:13\\n\",\n      \"Episode 2 finished, reward:20\\n\",\n      \"Episode 3 finished, reward:22\\n\",\n      \"Episode 4 finished, reward:13\\n\",\n      \"Episode 5 finished, reward:18\\n\",\n      \"Episode 6 finished, reward:15\\n\",\n      \"Episode 7 finished, reward:12\\n\",\n      \"Episode 8 finished, reward:58\\n\",\n      \"Episode 9 finished, reward:15\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import gym\\n\",\n    \"\\n\",\n    \"# create and initialize the environment\\n\",\n    \"env = gym.make(\\\"CartPole-v1\\\")\\n\",\n    \"env.reset()\\n\",\n    \"\\n\",\n    \"# play 10 games\\n\",\n    \"for i in range(10):\\n\",\n    \"    # initialize the variables\\n\",\n    \"    done = False\\n\",\n    \"    game_rew = 0\\n\",\n    \"\\n\",\n    \"    while not done:\\n\",\n    \"        # choose a random action\\n\",\n    \"        action = env.action_space.sample()\\n\",\n    \"        # take a step in the environment\\n\",\n    \"        new_obs, rew, done, info = env.step(action)\\n\",\n    \"        game_rew += rew\\n\",\n    \"    \\n\",\n    \"        # when is done, print the cumulative reward of the game and reset the environment\\n\",\n    \"        if done:\\n\",\n    \"            print('Episode %d finished, reward:%d' % (i, game_rew))\\n\",\n    \"            env.reset()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\u001b[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.\\u001b[0m\\n\",\n      \"Box(4,)\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import gym\\n\",\n    \"\\n\",\n    \"env = gym.make('CartPole-v1')\\n\",\n    \"print(env.observation_space)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Discrete(2)\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.action_space)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"1\\n\",\n      \"0\\n\",\n      \"0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.action_space.sample())\\n\",\n    \"print(env.action_space.sample())\\n\",\n    \"print(env.action_space.sample())\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.observation_space.low)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.observation_space.high)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## TensorFlow\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"c:\\\\users\\\\andrea\\\\appdata\\\\local\\\\programs\\\\python\\\\python35\\\\lib\\\\site-packages\\\\h5py\\\\__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\\n\",\n      \"  from ._conv import register_converters as _register_converters\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Tensor(\\\"add:0\\\", shape=(), dtype=int32)\\n\",\n      \"7\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"\\n\",\n    \"# create two constants: a and b\\n\",\n    \"a = tf.constant(4)\\n\",\n    \"b = tf.constant(3)\\n\",\n    \"\\n\",\n    \"# perform a computation\\n\",\n    \"c = a + b\\n\",\n    \"print(c) # print the shape of c\\n\",\n    \"\\n\",\n    \"# create a session\\n\",\n    \"session = tf.Session()\\n\",\n    \"# run the session. It compute the sum\\n\",\n    \"res = session.run(c)\\n\",\n    \"print(res) # print the actual result\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# reset the graph\\n\",\n    \"tf.reset_default_graph()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Tensor\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"()\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"a = tf.constant(1)\\n\",\n    \"print(a.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"(5,)\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# array of five elements\\n\",\n    \"b = tf.constant([1,2,3,4,5])\\n\",\n    \"print(b.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[1 2 3]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"#NB: a can be of any type of tensor\\n\",\n    \"a = tf.constant([1,2,3,4,5])\\n\",\n    \"first_three_elem = a[:3]\\n\",\n    \"fourth_elem = a[3]\\n\",\n    \"\\n\",\n    \"sess = tf.Session()\\n\",\n    \"print(sess.run(first_three_elem))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"4\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(sess.run(fourth_elem))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Constant\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Tensor(\\\"a_const:0\\\", shape=(4,), dtype=float32)\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"a = tf.constant([1.0, 1.1, 2.1, 3.1], dtype=tf.float32, name='a_const')\\n\",\n    \"print(a)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Placeholder\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[[10.1 10.2 10.3]]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"a = tf.placeholder(shape=(1,3), dtype=tf.float32)\\n\",\n    \"b = tf.constant([[10,10,10]], dtype=tf.float32)\\n\",\n    \"\\n\",\n    \"c = a + b\\n\",\n    \"\\n\",\n    \"sess = tf.Session()\\n\",\n    \"res = sess.run(c, feed_dict={a:[[0.1,0.2,0.3]]})\\n\",\n    \"print(res)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"tf.reset_default_graph()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Tensor(\\\"Placeholder:0\\\", shape=(?, 3), dtype=float32)\\n\",\n      \"[[10.1 10.2 10.3]]\\n\",\n      \"[[7. 7. 7.]\\n\",\n      \" [7. 7. 7.]]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"# NB: the fist dimension is 'None', meaning that it can be of any lenght\\n\",\n    \"a = tf.placeholder(shape=(None,3), dtype=tf.float32)\\n\",\n    \"b = tf.placeholder(shape=(None,3), dtype=tf.float32)\\n\",\n    \"\\n\",\n    \"c = a + b\\n\",\n    \"\\n\",\n    \"print(a)\\n\",\n    \"\\n\",\n    \"sess = tf.Session()\\n\",\n    \"print(sess.run(c, feed_dict={a:[[0.1,0.2,0.3]], b:[[10,10,10]]}))\\n\",\n    \"\\n\",\n    \"v_a = np.array([[1,2,3],[4,5,6]])\\n\",\n    \"v_b = np.array([[6,5,4],[3,2,1]])\\n\",\n    \"print(sess.run(c, feed_dict={a:v_a, b:v_b}))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[[10.1 10.2 10.3]]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"sess = tf.Session()\\n\",\n    \"print(sess.run(c, feed_dict={a:[[0.1,0.2,0.3]], b:[[10,10,10]]}))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Variable\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[[0.4478302  0.7014905  0.36300516]]\\n\",\n      \"[[4 5]]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"tf.reset_default_graph()\\n\",\n    \"\\n\",\n    \"# variable initialized using the glorot uniform initializer\\n\",\n    \"var = tf.get_variable(\\\"first_variable\\\", shape=[1,3], dtype=tf.float32, initializer=tf.glorot_uniform_initializer)\\n\",\n    \"\\n\",\n    \"# variable initialized with constant values\\n\",\n    \"init_val = np.array([4,5])\\n\",\n    \"var2 = tf.get_variable(\\\"second_variable\\\", shape=[1,2], dtype=tf.int32, initializer=tf.constant_initializer(init_val))\\n\",\n    \"\\n\",\n    \"# create the session\\n\",\n    \"sess = tf.Session()\\n\",\n    \"# initialize all the variables\\n\",\n    \"sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"print(sess.run(var))\\n\",\n    \"\\n\",\n    \"print(sess.run(var2))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# not trainable variable\\n\",\n    \"var2 = tf.get_variable(\\\"variable\\\", shape=[1,2], trainable=False, dtype=tf.int32)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[<tf.Variable 'first_variable:0' shape=(1, 3) dtype=float32_ref>, <tf.Variable 'second_variable:0' shape=(1, 2) dtype=int32_ref>, <tf.Variable 'variable:0' shape=(1, 2) dtype=int32_ref>]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(tf.global_variables())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Graph\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"-0.015899599\"\n      ]\n     },\n     \"execution_count\": 22,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.reset_default_graph()\\n\",\n    \"\\n\",\n    \"const1 = tf.constant(3.0, name='constant1')\\n\",\n    \"\\n\",\n    \"var = tf.get_variable(\\\"variable1\\\", shape=[1,2], dtype=tf.float32)\\n\",\n    \"var2 = tf.get_variable(\\\"variable2\\\", shape=[1,2], trainable=False, dtype=tf.float32)\\n\",\n    \"\\n\",\n    \"op1 = const1 * var\\n\",\n    \"op2 = op1 + var2\\n\",\n    \"op3 = tf.reduce_mean(op2)\\n\",\n    \"\\n\",\n    \"sess = tf.Session()\\n\",\n    \"sess.run(tf.global_variables_initializer())\\n\",\n    \"sess.run(op3)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Simple Linear Regression Example\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Epoch:   0, MSE: 4617.4390, W: 1.295, b: -0.407\\n\",\n      \"Epoch:  40, MSE: 5.3334, W: 0.496, b: -0.727\\n\",\n      \"Epoch:  80, MSE: 4.5894, W: 0.529, b: -0.012\\n\",\n      \"Epoch: 120, MSE: 4.1029, W: 0.512, b: 0.608\\n\",\n      \"Epoch: 160, MSE: 3.8552, W: 0.506, b: 1.092\\n\",\n      \"Epoch: 200, MSE: 3.7597, W: 0.501, b: 1.418\\n\",\n      \"Final weight: 0.500, bias: 1.473\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"tf.reset_default_graph()\\n\",\n    \"\\n\",\n    \"np.random.seed(10)\\n\",\n    \"tf.set_random_seed(10)\\n\",\n    \"\\n\",\n    \"W, b = 0.5, 1.4\\n\",\n    \"# create a dataset of 100 examples\\n\",\n    \"X = np.linspace(0,100, num=100)\\n\",\n    \"# add random noise to the y labels\\n\",\n    \"y = np.random.normal(loc=W * X + b, scale=2.0, size=len(X))\\n\",\n    \"\\n\",\n    \"# create the placeholders\\n\",\n    \"x_ph = tf.placeholder(shape=[None,], dtype=tf.float32)\\n\",\n    \"y_ph = tf.placeholder(shape=[None,], dtype=tf.float32)\\n\",\n    \"\\n\",\n    \"# create the variables.\\n\",\n    \"v_weight = tf.get_variable(\\\"weight\\\", shape=[1], dtype=tf.float32)\\n\",\n    \"v_bias = tf.get_variable(\\\"bias\\\", shape=[1], dtype=tf.float32)\\n\",\n    \"\\n\",\n    \"# linear computation\\n\",\n    \"out = v_weight * x_ph + v_bias\\n\",\n    \"\\n\",\n    \"# compute the Mean Squared Error\\n\",\n    \"loss = tf.reduce_mean((out - y_ph)**2)\\n\",\n    \"\\n\",\n    \"# optimizer\\n\",\n    \"opt = tf.train.AdamOptimizer(0.4).minimize(loss)\\n\",\n    \"\\n\",\n    \"# create the session\\n\",\n    \"session = tf.Session()\\n\",\n    \"session.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"# loop to train the parameters\\n\",\n    \"for ep in range(210):\\n\",\n    \"    # run the optimizer and get the loss\\n\",\n    \"    train_loss, _ = session.run([loss, opt], feed_dict={x_ph:X, y_ph:y})\\n\",\n    \" \\n\",\n    \"    # print epoch number and loss\\n\",\n    \"    if ep % 40 == 0:\\n\",\n    \"        print('Epoch: %3d, MSE: %.4f, W: %.3f, b: %.3f' % (ep, train_loss, session.run(v_weight), session.run(v_bias)))\\n\",\n    \"        \\n\",\n    \"print('Final weight: %.3f, bias: %.3f' % (session.run(v_weight), session.run(v_bias)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### .. with TensorBoard\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 24,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Epoch:   0, MSE: 4617.4390, W: 1.295, b: -0.407\\n\",\n      \"Epoch:  40, MSE: 5.3334, W: 0.496, b: -0.727\\n\",\n      \"Epoch:  80, MSE: 4.5894, W: 0.529, b: -0.012\\n\",\n      \"Epoch: 120, MSE: 4.1029, W: 0.512, b: 0.608\\n\",\n      \"Epoch: 160, MSE: 3.8552, W: 0.506, b: 1.092\\n\",\n      \"Epoch: 200, MSE: 3.7597, W: 0.501, b: 1.418\\n\",\n      \"Final weight: 0.500, bias: 1.473\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from datetime import datetime\\n\",\n    \"\\n\",\n    \"tf.reset_default_graph()\\n\",\n    \"\\n\",\n    \"np.random.seed(10)\\n\",\n    \"tf.set_random_seed(10)\\n\",\n    \"\\n\",\n    \"W, b = 0.5, 1.4\\n\",\n    \"# create a dataset of 100 examples\\n\",\n    \"X = np.linspace(0,100, num=100)\\n\",\n    \"# add random noise to the y labels\\n\",\n    \"y = np.random.normal(loc=W * X + b, scale=2.0, size=len(X))\\n\",\n    \"\\n\",\n    \"# create the placeholders\\n\",\n    \"x_ph = tf.placeholder(shape=[None,], dtype=tf.float32)\\n\",\n    \"y_ph = tf.placeholder(shape=[None,], dtype=tf.float32)\\n\",\n    \"\\n\",\n    \"# create the variables.\\n\",\n    \"v_weight = tf.get_variable(\\\"weight\\\", shape=[1], dtype=tf.float32)\\n\",\n    \"v_bias = tf.get_variable(\\\"bias\\\", shape=[1], dtype=tf.float32)\\n\",\n    \"\\n\",\n    \"# linear computation\\n\",\n    \"out = v_weight * x_ph + v_bias\\n\",\n    \"\\n\",\n    \"# compute the Mean Squared Error\\n\",\n    \"loss = tf.reduce_mean((out - y_ph)**2)\\n\",\n    \"\\n\",\n    \"# optimizer\\n\",\n    \"opt = tf.train.AdamOptimizer(0.4).minimize(loss)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"tf.summary.scalar('MSEloss', loss)\\n\",\n    \"tf.summary.histogram('model_weight', v_weight)\\n\",\n    \"tf.summary.histogram('model_bias', v_bias)\\n\",\n    \"all_summary = tf.summary.merge_all()\\n\",\n    \"\\n\",\n    \"now = datetime.now()\\n\",\n    \"clock_time = \\\"{}_{}.{}.{}\\\".format(now.day, now.hour, now.minute, now.second)\\n\",\n    \"file_writer = tf.summary.FileWriter('log_dir/'+clock_time, tf.get_default_graph())\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"# create the session\\n\",\n    \"session = tf.Session()\\n\",\n    \"session.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"# loop to train the parameters\\n\",\n    \"for ep in range(210):\\n\",\n    \"    # run the optimizer and get the loss\\n\",\n    \"    train_loss, _, train_summary = session.run([loss, opt, all_summary], feed_dict={x_ph:X, y_ph:y})\\n\",\n    \"    file_writer.add_summary(train_summary, ep)\\n\",\n    \" \\n\",\n    \"    # print epoch number and loss\\n\",\n    \"    if ep % 40 == 0:\\n\",\n    \"        print('Epoch: %3d, MSE: %.4f, W: %.3f, b: %.3f' % (ep, train_loss, session.run(v_weight), session.run(v_bias)))\\n\",\n    \"        \\n\",\n    \"print('Final weight: %.3f, bias: %.3f' % (session.run(v_weight), session.run(v_bias)))\\n\",\n    \"file_writer.close()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "Chapter03/frozenlake8x8_policyiteration.py",
    "content": "import numpy as np\nimport gym\n\ndef eval_state_action(V, s, a, gamma=0.99):\n    return np.sum([p * (rew + gamma*V[next_s]) for p, next_s, rew, _ in env.P[s][a]])\n\ndef policy_evaluation(V, policy, eps=0.0001):\n    '''\n    Policy evaluation. Update the value function until it reach a steady state\n    '''\n    while True:\n        delta = 0\n        # loop over all states\n        for s in range(nS):\n            old_v = V[s]\n            # update V[s] using the Bellman equation\n            V[s] = eval_state_action(V, s, policy[s])\n            delta = max(delta, np.abs(old_v - V[s]))\n\n        if delta < eps:\n            break\n\ndef policy_improvement(V, policy):\n    '''\n    Policy improvement. Update the policy based on the value function\n    '''\n    policy_stable = True\n    for s in range(nS):\n        old_a = policy[s]\n        # update the policy with the action that bring to the highest state value\n        policy[s] = np.argmax([eval_state_action(V, s, a) for a in range(nA)])\n        if old_a != policy[s]: \n            policy_stable = False\n\n    return policy_stable\n\n\ndef run_episodes(env, policy, num_games=100):\n    '''\n    Run some games to test a policy\n    '''\n    tot_rew = 0\n    state = env.reset()\n\n    for _ in range(num_games):\n        done = False\n        while not done:\n            # select the action accordingly to the policy\n            next_state, reward, done, _ = env.step(policy[state])\n                \n            state = next_state\n            tot_rew += reward \n            if done:\n                state = env.reset()\n\n    print('Won %i of %i games!'%(tot_rew, num_games))\n\n            \nif __name__ == '__main__':\n    # create the environment\n    env = gym.make('FrozenLake-v0')\n    # enwrap it to have additional information from it\n    env = env.unwrapped\n\n    # spaces dimension\n    nA = env.action_space.n\n    nS = env.observation_space.n\n    \n    # initializing value function and policy\n    V = np.zeros(nS)\n    policy = np.zeros(nS)\n\n    # some useful variable\n    policy_stable = False\n    it = 0\n\n    while not policy_stable:\n        policy_evaluation(V, policy)\n        policy_stable = policy_improvement(V, policy)\n        it += 1\n\n    print('Converged after %i policy iterations'%(it))\n    run_episodes(env, policy)\n    print(V.reshape((4,4)))\n    print(policy.reshape((4,4)))"
  },
  {
    "path": "Chapter03/frozenlake8x8_valueiteration.py",
    "content": "import numpy as np\nimport gym\n\ndef eval_state_action(V, s, a, gamma=0.99):\n    return np.sum([p * (rew + gamma*V[next_s]) for p, next_s, rew, _ in env.P[s][a]])\n\ndef value_iteration(eps=0.0001):\n    '''\n    Value iteration algorithm\n    '''\n    V = np.zeros(nS)\n    it = 0\n\n    while True:\n        delta = 0\n        # update the value of each state using as \"policy\" the max operator\n        for s in range(nS):\n            old_v = V[s]\n            V[s] = np.max([eval_state_action(V, s, a) for a in range(nA)])\n            delta = max(delta, np.abs(old_v - V[s]))\n\n        if delta < eps:\n            break\n        else:\n            print('Iter:', it, ' delta:', np.round(delta, 5))\n        it += 1\n\n    return V\n\ndef run_episodes(env, V, num_games=100):\n    '''\n    Run some test games\n    '''\n    tot_rew = 0\n    state = env.reset()\n\n    for _ in range(num_games):\n        done = False\n        while not done:\n            action = np.argmax([eval_state_action(V, state, a) for a in range(nA)])\n            next_state, reward, done, _ = env.step(action)\n\n            state = next_state\n            tot_rew += reward \n            if done:\n                state = env.reset()\n\n    print('Won %i of %i games!'%(tot_rew, num_games))\n\n            \nif __name__ == '__main__':\n    # create the environment\n    env = gym.make('FrozenLake-v0')\n    # enwrap it to have additional information from it\n    env = env.unwrapped\n\n    # spaces dimension\n    nA = env.action_space.n\n    nS = env.observation_space.n\n\n    # Value iteration\n    V = value_iteration(eps=0.0001)\n    # test the value function on 100 games\n    run_episodes(env, V, 100)\n    # print the state values\n    print(V.reshape((4,4)))\n\n"
  },
  {
    "path": "Chapter04/SARSA Q_learning Taxi-v2.py",
    "content": "import numpy as np \nimport gym\n\n\ndef eps_greedy(Q, s, eps=0.1):\n    '''\n    Epsilon greedy policy\n    '''\n    if np.random.uniform(0,1) < eps:\n        # Choose a random action\n        return np.random.randint(Q.shape[1])\n    else:\n        # Choose the action of a greedy policy\n        return greedy(Q, s)\n\n\ndef greedy(Q, s):\n    '''\n    Greedy policy\n\n    return the index corresponding to the maximum action-state value\n    '''\n    return np.argmax(Q[s])\n\n\ndef run_episodes(env, Q, num_episodes=100, to_print=False):\n    '''\n    Run some episodes to test the policy\n    '''\n    tot_rew = []\n    state = env.reset()\n\n    for _ in range(num_episodes):\n        done = False\n        game_rew = 0\n\n        while not done:\n            # select a greedy action\n            next_state, rew, done, _ = env.step(greedy(Q, state))\n\n            state = next_state\n            game_rew += rew \n            if done:\n                state = env.reset()\n                tot_rew.append(game_rew)\n\n    if to_print:\n        print('Mean score: %.3f of %i games!'%(np.mean(tot_rew), num_episodes))\n\n    return np.mean(tot_rew)\n\ndef Q_learning(env, lr=0.01, num_episodes=10000, eps=0.3, gamma=0.95, eps_decay=0.00005):\n    nA = env.action_space.n\n    nS = env.observation_space.n\n\n    # Initialize the Q matrix\n    # Q: matrix nS*nA where each row represent a state and each colums represent a different action\n    Q = np.zeros((nS, nA))\n    games_reward = []\n    test_rewards = []\n\n    for ep in range(num_episodes):\n        state = env.reset()\n        done = False\n        tot_rew = 0\n        \n        # decay the epsilon value until it reaches the threshold of 0.01\n        if eps > 0.01:\n            eps -= eps_decay\n\n        # loop the main body until the environment stops\n        while not done:\n            # select an action following the eps-greedy policy\n            action = eps_greedy(Q, state, eps)\n\n            next_state, rew, done, _ = env.step(action) # Take one step in the environment\n\n            # Q-learning update the state-action value (get the max Q value for the next state)\n            Q[state][action] = Q[state][action] + lr*(rew + gamma*np.max(Q[next_state]) - Q[state][action])\n\n            state = next_state\n            tot_rew += rew\n            if done:\n                games_reward.append(tot_rew)\n\n        # Test the policy every 300 episodes and print the results\n        if (ep % 300) == 0:\n            test_rew = run_episodes(env, Q, 1000)\n            print(\"Episode:{:5d}  Eps:{:2.4f}  Rew:{:2.4f}\".format(ep, eps, test_rew))\n            test_rewards.append(test_rew)\n            \n    return Q\n\n\ndef SARSA(env, lr=0.01, num_episodes=10000, eps=0.3, gamma=0.95, eps_decay=0.00005):\n    nA = env.action_space.n\n    nS = env.observation_space.n\n\n    # Initialize the Q matrix\n    # Q: matrix nS*nA where each row represent a state and each colums represent a different action\n    Q = np.zeros((nS, nA))\n    games_reward = []\n    test_rewards = []\n\n    for ep in range(num_episodes):\n        state = env.reset()\n        done = False\n        tot_rew = 0\n\n        # decay the epsilon value until it reaches the threshold of 0.01\n        if eps > 0.01:\n            eps -= eps_decay\n\n\n        action = eps_greedy(Q, state, eps) \n\n        # loop the main body until the environment stops\n        while not done:\n            next_state, rew, done, _ = env.step(action) # Take one step in the environment\n\n            # choose the next action (needed for the SARSA update)\n            next_action = eps_greedy(Q, next_state, eps) \n            # SARSA update\n            Q[state][action] = Q[state][action] + lr*(rew + gamma*Q[next_state][next_action] - Q[state][action])\n\n            state = next_state\n            action = next_action\n            tot_rew += rew\n            if done:\n                games_reward.append(tot_rew)\n\n        # Test the policy every 300 episodes and print the results\n        if (ep % 300) == 0:\n            test_rew = run_episodes(env, Q, 1000)\n            print(\"Episode:{:5d}  Eps:{:2.4f}  Rew:{:2.4f}\".format(ep, eps, test_rew))\n            test_rewards.append(test_rew)\n\n    return Q\n\n\nif __name__ == '__main__':\n    env = gym.make('Taxi-v2')\n    \n    Q_qlearning = Q_learning(env, lr=.1, num_episodes=5000, eps=0.4, gamma=0.95, eps_decay=0.001)\n\n    Q_sarsa = SARSA(env, lr=.1, num_episodes=5000, eps=0.4, gamma=0.95, eps_decay=0.001)"
  },
  {
    "path": "Chapter05/.ipynb_checkpoints/Untitled-checkpoint.ipynb",
    "content": "{\n \"cells\": [],\n \"metadata\": {},\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "Chapter05/DQN_Atari.py",
    "content": "import numpy as np \nimport tensorflow as tf\nimport gym\nfrom datetime import datetime\nfrom collections import deque\nimport time\nimport sys\n\nfrom atari_wrappers import make_env\n\n\ngym.logger.set_level(40)\n\ncurrent_milli_time = lambda: int(round(time.time() * 1000))\n\ndef cnn(x):\n    '''\n    Convolutional neural network\n    '''\n    x = tf.layers.conv2d(x, filters=16, kernel_size=8, strides=4, padding='valid', activation='relu') \n    x = tf.layers.conv2d(x, filters=32, kernel_size=4, strides=2, padding='valid', activation='relu') \n    return tf.layers.conv2d(x, filters=32, kernel_size=3, strides=1, padding='valid', activation='relu') \n    \n\ndef fnn(x, hidden_layers, output_layer, activation=tf.nn.relu, last_activation=None):\n    '''\n    Feed-forward neural network\n    '''\n    for l in hidden_layers:\n        x = tf.layers.dense(x, units=l, activation=activation)\n    return tf.layers.dense(x, units=output_layer, activation=last_activation)\n\ndef qnet(x, hidden_layers, output_size, fnn_activation=tf.nn.relu, last_activation=None):\n    '''\n    Deep Q network: CNN followed by FNN\n    '''\n    x = cnn(x)\n    x = tf.layers.flatten(x)\n\n    return fnn(x, hidden_layers, output_size, fnn_activation, last_activation)\n\n\nclass ExperienceBuffer():\n    '''\n    Experience Replay Buffer\n    '''\n    def __init__(self, buffer_size):\n        self.obs_buf = deque(maxlen=buffer_size)\n        self.rew_buf = deque(maxlen=buffer_size)\n        self.act_buf = deque(maxlen=buffer_size)\n        self.obs2_buf = deque(maxlen=buffer_size)\n        self.done_buf = deque(maxlen=buffer_size)\n\n\n    def add(self, obs, rew, act, obs2, done):\n        # Add a new transition to the buffers\n        self.obs_buf.append(obs)\n        self.rew_buf.append(rew)\n        self.act_buf.append(act)\n        self.obs2_buf.append(obs2)\n        self.done_buf.append(done)\n        \n\n    def sample_minibatch(self, batch_size):\n        # Sample a minibatch of size batch_size\n        mb_indices = np.random.randint(len(self.obs_buf), size=batch_size)\n\n        mb_obs = scale_frames([self.obs_buf[i] for i in mb_indices])\n        mb_rew = [self.rew_buf[i] for i in mb_indices]\n        mb_act = [self.act_buf[i] for i in mb_indices]\n        mb_obs2 = scale_frames([self.obs2_buf[i] for i in mb_indices])\n        mb_done = [self.done_buf[i] for i in mb_indices]\n\n        return mb_obs, mb_rew, mb_act, mb_obs2, mb_done\n\n    def __len__(self):\n        return len(self.obs_buf)\n\n\ndef q_target_values(mini_batch_rw, mini_batch_done, av, discounted_value):   \n    '''\n    Calculate the target value y for each transition\n    '''\n    max_av = np.max(av, axis=1)\n    \n    # if episode terminate, y take value r\n    # otherwise, q-learning step\n    \n    ys = []\n    for r, d, av in zip(mini_batch_rw, mini_batch_done, max_av):\n        if d:\n            ys.append(r)\n        else:\n            q_step = r + discounted_value * av\n            ys.append(q_step)\n    \n    assert len(ys) == len(mini_batch_rw)\n    return ys\n\ndef greedy(action_values):\n    '''\n    Greedy policy\n    '''\n    return np.argmax(action_values)\n\ndef eps_greedy(action_values, eps=0.1):\n    '''\n    Eps-greedy policy\n    '''\n    if np.random.uniform(0,1) < eps:\n        # Choose a uniform random action\n        return np.random.randint(len(action_values))\n    else:\n        # Choose the greedy action\n        return np.argmax(action_values)\n\ndef test_agent(env_test, agent_op, num_games=20):\n    '''\n    Test an agent\n    '''\n    games_r = []\n\n    for _ in range(num_games):\n        d = False\n        game_r = 0\n        o = env_test.reset()\n\n        while not d:\n            # Use an eps-greedy policy with eps=0.05 (to add stochasticity to the policy)\n            # Needed because Atari envs are deterministic\n            # If you would use a greedy policy, the results will be always the same\n            a = eps_greedy(np.squeeze(agent_op(o)), eps=0.05)\n            o, r, d, _ = env_test.step(a)\n\n            game_r += r\n\n        games_r.append(game_r)\n\n    return games_r\n\ndef scale_frames(frames):\n    '''\n    Scale the frame with number between 0 and 1\n    '''\n    return np.array(frames, dtype=np.float32) / 255.0\n\ndef DQN(env_name, hidden_sizes=[32], lr=1e-2, num_epochs=2000, buffer_size=100000, discount=0.99, render_cycle=100, update_target_net=1000, \n        batch_size=64, update_freq=4, frames_num=2, min_buffer_size=5000, test_frequency=20, start_explor=1, end_explor=0.1, explor_steps=100000):\n\n    # Create the environment both for train and test\n    env = make_env(env_name, frames_num=frames_num, skip_frames=True, noop_num=20)\n    env_test = make_env(env_name, frames_num=frames_num, skip_frames=True, noop_num=20)\n    # Add a monitor to the test env to store the videos\n    env_test = gym.wrappers.Monitor(env_test, \"VIDEOS/TEST_VIDEOS\"+env_name+str(current_milli_time()),force=True, video_callable=lambda x: x%20==0)\n\n    tf.reset_default_graph()\n\n    obs_dim = env.observation_space.shape\n    act_dim = env.action_space.n \n\n    # Create all the placeholders\n    obs_ph = tf.placeholder(shape=(None, obs_dim[0], obs_dim[1], obs_dim[2]), dtype=tf.float32, name='obs')\n    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')\n    y_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='y')\n\n    # Create the target network\n    with tf.variable_scope('target_network'):\n        target_qv = qnet(obs_ph, hidden_sizes, act_dim)\n    target_vars = tf.trainable_variables()\n\n    # Create the online network (i.e. the behavior policy)\n    with tf.variable_scope('online_network'):\n        online_qv = qnet(obs_ph, hidden_sizes, act_dim)\n    train_vars = tf.trainable_variables()\n\n    # Update the target network by assigning to it the variables of the online network\n    # Note that the target network and the online network have the same exact architecture\n    update_target = [train_vars[i].assign(train_vars[i+len(target_vars)]) for i in range(len(train_vars) - len(target_vars))]\n    update_target_op = tf.group(*update_target)\n\n    # One hot encoding of the action\n    act_onehot = tf.one_hot(act_ph, depth=act_dim)\n    # We are interested only in the Q-values of those actions\n    q_values = tf.reduce_sum(act_onehot * online_qv, axis=1)\n    \n    # MSE loss function\n    v_loss = tf.reduce_mean((y_ph - q_values)**2)\n    # Adam optimize that minimize the loss v_loss\n    v_opt = tf.train.AdamOptimizer(lr).minimize(v_loss)\n\n    def agent_op(o):\n        '''\n        Forward pass to obtain the Q-values from the online network of a single observation\n        '''\n        # Scale the frames\n        o = scale_frames(o)\n        return sess.run(online_qv, feed_dict={obs_ph:[o]})\n\n    # Time\n    now = datetime.now()\n    clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, int(now.second))\n    print('Time:', clock_time)\n\n    mr_v = tf.Variable(0.0)\n    ml_v = tf.Variable(0.0)\n\n\n    # TensorBoard summaries\n    tf.summary.scalar('v_loss', v_loss)\n    tf.summary.scalar('Q-value', tf.reduce_mean(q_values))\n    tf.summary.histogram('Q-values', q_values)\n\n    scalar_summary = tf.summary.merge_all()\n    reward_summary = tf.summary.scalar('test_rew', mr_v) \n    mean_loss_summary = tf.summary.scalar('mean_loss', ml_v)\n\n    LOG_DIR = 'log_dir/'+env_name\n    hyp_str = \"-lr_{}-upTN_{}-upF_{}-frms_{}\" .format(lr, update_target_net, update_freq, frames_num)\n\n    # initialize the File Writer for writing TensorBoard summaries\n    file_writer = tf.summary.FileWriter(LOG_DIR+'/DQN_'+clock_time+'_'+hyp_str, tf.get_default_graph())\n\n    # open a session\n    sess = tf.Session()\n    # and initialize all the variables\n    sess.run(tf.global_variables_initializer())\n    \n    render_the_game = False\n    step_count = 0\n    last_update_loss = []\n    ep_time = current_milli_time()\n    batch_rew = []\n    old_step_count = 0\n\n    obs = env.reset()\n\n    # Initialize the experience buffer\n    buffer = ExperienceBuffer(buffer_size)\n    \n    # Copy the online network in the target network\n    sess.run(update_target_op)\n\n    ########## EXPLORATION INITIALIZATION ######\n    eps = start_explor\n    eps_decay = (start_explor - end_explor) / explor_steps\n\n    for ep in range(num_epochs):\n        g_rew = 0\n        done = False\n\n        # Until the environment does not end..\n        while not done:\n                \n            # Epsilon decay\n            if eps > end_explor:\n                eps -= eps_decay\n\n            # Choose an eps-greedy action \n            act = eps_greedy(np.squeeze(agent_op(obs)), eps=eps)\n\n            # execute the action in the environment\n            obs2, rew, done, _ = env.step(act)\n\n            # Render the game if you want to\n            if render_the_game:\n                env.render()\n\n            # Add the transition to the replay buffer\n            buffer.add(obs, rew, act, obs2, done)\n\n            obs = obs2\n            g_rew += rew\n            step_count += 1\n\n            ################ TRAINING ###############\n            # If it's time to train the network:\n            if len(buffer) > min_buffer_size and (step_count % update_freq == 0):\n                \n                # sample a minibatch from the buffer\n                mb_obs, mb_rew, mb_act, mb_obs2, mb_done = buffer.sample_minibatch(batch_size)\n\n \n                mb_trg_qv = sess.run(target_qv, feed_dict={obs_ph:mb_obs2})\n                y_r = q_target_values(mb_rew, mb_done, mb_trg_qv, discount)\n\n                # TRAINING STEP\n                # optimize, compute the loss and return the TB summary\n                train_summary, train_loss, _ = sess.run([scalar_summary, v_loss, v_opt], feed_dict={obs_ph:mb_obs, y_ph:y_r, act_ph: mb_act})\n\n                # Add the train summary to the file_writer\n                file_writer.add_summary(train_summary, step_count)\n                last_update_loss.append(train_loss)\n\n            # Every update_target_net steps, update the target network\n            if (len(buffer) > min_buffer_size) and (step_count % update_target_net == 0):\n\n                # run the session to update the target network and get the mean loss sumamry \n                _, train_summary = sess.run([update_target_op, mean_loss_summary], feed_dict={ml_v:np.mean(last_update_loss)})\n                file_writer.add_summary(train_summary, step_count)\n                last_update_loss = []\n\n\n            # If the environment is ended, reset it and initialize the variables\n            if done:\n                obs = env.reset()\n                batch_rew.append(g_rew)\n                g_rew, render_the_game = 0, False\n\n        # every test_frequency episodes, test the agent and write some stats in TensorBoard\n        if ep % test_frequency == 0:\n            # Test the agent to 10 games\n            test_rw = test_agent(env_test, agent_op, num_games=10)\n\n            # Run the test stats and add them to the file_writer\n            test_summary = sess.run(reward_summary, feed_dict={mr_v: np.mean(test_rw)})\n            file_writer.add_summary(test_summary, step_count)\n\n            # Print some useful stats\n            ep_sec_time = int((current_milli_time()-ep_time) / 1000)\n            print('Ep:%4d Rew:%4.2f, Eps:%2.2f -- Step:%5d -- Test:%4.2f %4.2f -- Time:%d -- Ep_Steps:%d' %\n                        (ep,np.mean(batch_rew), eps, step_count, np.mean(test_rw), np.std(test_rw), ep_sec_time, (step_count-old_step_count)/test_frequency))\n\n            ep_time = current_milli_time()\n            batch_rew = []\n            old_step_count = step_count\n                            \n        if ep % render_cycle == 0:\n            render_the_game = True\n\n    file_writer.close()\n    env.close()\n\n\nif __name__ == '__main__':\n\n    DQN('PongNoFrameskip-v4', hidden_sizes=[128], lr=2e-4, buffer_size=100000, update_target_net=1000, batch_size=32, \n        update_freq=2, frames_num=2, min_buffer_size=10000, render_cycle=10000)"
  },
  {
    "path": "Chapter05/DQN_variations_Atari.py",
    "content": "import numpy as np \nimport tensorflow as tf\nimport gym\nfrom datetime import datetime\nfrom collections import deque\nimport time\nimport sys\n\nfrom atari_wrappers import make_env\n\n\ngym.logger.set_level(40)\n\ncurrent_milli_time = lambda: int(round(time.time() * 1000))\n\n\ndef cnn(x):\n    '''\n    Convolutional neural network\n    '''\n    x = tf.layers.conv2d(x, filters=16, kernel_size=8, strides=4, padding='valid', activation='relu') \n    x = tf.layers.conv2d(x, filters=32, kernel_size=4, strides=2, padding='valid', activation='relu') \n    return tf.layers.conv2d(x, filters=32, kernel_size=3, strides=1, padding='valid', activation='relu') \n    \n\ndef fnn(x, hidden_layers, output_layer, activation=tf.nn.relu, last_activation=None):\n    '''\n    Feed-forward neural network\n    '''\n    for l in hidden_layers:\n        x = tf.layers.dense(x, units=l, activation=activation)\n    return tf.layers.dense(x, units=output_layer, activation=last_activation)\n\ndef qnet(x, hidden_layers, output_size, fnn_activation=tf.nn.relu, last_activation=None):\n    '''\n    Deep Q network: CNN followed by FNN\n    '''\n    x = cnn(x)\n    x = tf.layers.flatten(x)\n\n    return fnn(x, hidden_layers, output_size, fnn_activation, last_activation)\n\ndef greedy(action_values):\n    '''\n    Greedy policy\n    '''\n    return np.argmax(action_values)\n\ndef eps_greedy(action_values, eps=0.1):\n    '''\n    Eps-greedy policy\n    '''\n    if np.random.uniform(0,1) < eps:\n        # Choose a uniform random action\n        return np.random.randint(len(action_values))\n    else:\n        # Choose the greedy action\n        return np.argmax(action_values)\n\ndef q_target_values(mini_batch_rw, mini_batch_done, av, discounted_value):   \n    '''\n    Calculate the target value y for each transition\n    '''\n    max_av = np.max(av, axis=1)\n    \n    # if episode terminate, y take value r\n    # otherwise, q-learning step\n    \n    ys = []\n    for r, d, av in zip(mini_batch_rw, mini_batch_done, max_av):\n        if d:\n            ys.append(r)\n        else:\n            q_step = r + discounted_value * av\n            ys.append(q_step)\n    \n    assert len(ys) == len(mini_batch_rw)\n    return ys\n\ndef test_agent(env_test, agent_op, num_games=20):\n    '''\n    Test an agent\n    '''\n    games_r = []\n\n    for _ in range(num_games):\n        d = False\n        game_r = 0\n        o = env_test.reset()\n\n        while not d:\n            # Use an eps-greedy policy with eps=0.05 (to add stochasticity to the policy)\n            # Needed because Atari envs are deterministic\n            # If you would use a greedy policy, the results will be always the same\n            a = eps_greedy(np.squeeze(agent_op(o)), eps=0.05)\n            o, r, d, _ = env_test.step(a)\n\n            game_r += r\n\n        games_r.append(game_r)\n\n    return games_r\n\ndef scale_frames(frames):\n    '''\n    Scale the frame with number between 0 and 1\n    '''\n    return np.array(frames, dtype=np.float32) / 255.0\n\n\ndef dueling_qnet(x, hidden_layers, output_size, fnn_activation=tf.nn.relu, last_activation=None):\n    '''\n    Dueling neural network\n    '''\n    x = cnn(x)\n    x = tf.layers.flatten(x)\n\n    qf = fnn(x, hidden_layers, 1, fnn_activation, last_activation)\n    aaqf = fnn(x, hidden_layers, output_size, fnn_activation, last_activation)\n\n    return qf + aaqf - tf.reduce_mean(aaqf)\n\ndef double_q_target_values(mini_batch_rw, mini_batch_done, target_qv, online_qv, discounted_value):   ## IS THE NAME CORRECT???\n    '''\n    Calculate the target value y following the double Q-learning update\n    '''\n    argmax_online_qv = np.argmax(online_qv, axis=1)\n    \n    # if episode terminate, y take value r\n    # otherwise, q-learning step\n    \n    ys = []\n    assert len(mini_batch_rw) == len(mini_batch_done) == len(target_qv) == len(argmax_online_qv)\n    for r, d, t_av, arg_a in zip(mini_batch_rw, mini_batch_done, target_qv, argmax_online_qv):\n        if d:\n            ys.append(r)\n        else:\n            q_value = r + discounted_value * t_av[arg_a]\n            ys.append(q_value)\n    \n    assert len(ys) == len(mini_batch_rw)\n\n    return ys\n\nclass MultiStepExperienceBuffer():\n    '''\n    Experience Replay Buffer for multi-step learning\n    '''\n    def __init__(self, buffer_size, n_step, gamma):\n        self.obs_buf = deque(maxlen=buffer_size)\n        self.act_buf = deque(maxlen=buffer_size)\n\n        self.n_obs_buf = deque(maxlen=buffer_size)\n        self.n_done_buf = deque(maxlen=buffer_size)\n        self.n_rew_buf = deque(maxlen=buffer_size)\n\n        self.n_step = n_step\n        self.last_rews = deque(maxlen=self.n_step+1)\n        self.gamma = gamma\n\n\n    def add(self, obs, rew, act, obs2, done):\n        self.obs_buf.append(obs)\n        self.act_buf.append(act)\n        # the following buffers will be updated in the next n_step steps\n        # their values are not known, yet\n        self.n_obs_buf.append(None)\n        self.n_rew_buf.append(None)\n        self.n_done_buf.append(None)\n\n        self.last_rews.append(rew)\n\n        ln = len(self.obs_buf)\n        len_rews = len(self.last_rews)\n\n        # Update the indices of the buffer that are n_steps old\n        if done:\n            # In case it's the last step, update up to the n_steps indices fo the buffer\n            # it cannot update more than len(last_rews), otherwise will update the previous traj\n            for i in range(len_rews):\n                self.n_obs_buf[ln-(len_rews-i-1)-1] = obs2\n                self.n_done_buf[ln-(len_rews-i-1)-1] = done\n                rgt = np.sum([(self.gamma**k)*r for k,r in enumerate(np.array(self.last_rews)[i:len_rews])])\n                self.n_rew_buf[ln-(len_rews-i-1)-1] = rgt\n\n            # reset the reward deque\n            self.last_rews = deque(maxlen=self.n_step+1)\n        else:\n            # Update the elements of the buffer that has been added n_step steps ago\n            # Add only if the multi-step values are updated\n            if len(self.last_rews) >= (self.n_step+1):\n                self.n_obs_buf[ln-self.n_step-1] = obs2\n                self.n_done_buf[ln-self.n_step-1] = done\n                rgt = np.sum([(self.gamma**k)*r for k,r in enumerate(np.array(self.last_rews)[:len_rews])])\n                self.n_rew_buf[ln-self.n_step-1] = rgt\n        \n\n    def sample_minibatch(self, batch_size):\n        # Sample a minibatch of size batch_size\n        # Note: the samples should be at least of n_step steps ago\n        mb_indices = np.random.randint(len(self.obs_buf)-self.n_step, size=batch_size)\n\n        mb_obs = scale_frames([self.obs_buf[i] for i in mb_indices])\n        mb_rew = [self.n_rew_buf[i] for i in mb_indices]\n        mb_act = [self.act_buf[i] for i in mb_indices]\n        mb_obs2 = scale_frames([self.n_obs_buf[i] for i in mb_indices])\n        mb_done = [self.n_done_buf[i] for i in mb_indices]\n\n        return mb_obs, mb_rew, mb_act, mb_obs2, mb_done\n\n    def __len__(self):\n        return len(self.obs_buf)\n\ndef DQN_with_variations(env_name, extensions_hyp, hidden_sizes=[32], lr=1e-2, num_epochs=2000, buffer_size=100000, discount=0.99, render_cycle=100, update_target_net=1000, \n        batch_size=64, update_freq=4, frames_num=2, min_buffer_size=5000, test_frequency=20, start_explor=1, end_explor=0.1, explor_steps=100000):\n\n    # Create the environment both for train and test\n    env = make_env(env_name, frames_num=frames_num, skip_frames=True, noop_num=20)\n    env_test = make_env(env_name, frames_num=frames_num, skip_frames=True, noop_num=20)\n    # Add a monitor to the test env to store the videos\n    env_test = gym.wrappers.Monitor(env_test, \"VIDEOS/TEST_VIDEOS\"+env_name+str(current_milli_time()),force=True, video_callable=lambda x: x%20==0)\n\n    tf.reset_default_graph()\n\n    obs_dim = env.observation_space.shape\n    act_dim = env.action_space.n \n\n    # Create all the placeholders\n    obs_ph = tf.placeholder(shape=(None, obs_dim[0], obs_dim[1], obs_dim[2]), dtype=tf.float32, name='obs')\n    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')\n    y_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='y')\n\n    # Create the target network\n    with tf.variable_scope('target_network'):\n        if extensions_hyp['dueling']:\n            target_qv = dueling_qnet(obs_ph, hidden_sizes, act_dim)\n        else:\n            target_qv = qnet(obs_ph, hidden_sizes, act_dim)\n    target_vars = tf.trainable_variables()\n\n    # Create the online network (i.e. the behavior policy)\n    with tf.variable_scope('online_network'):\n        if extensions_hyp['dueling']:\n            online_qv = dueling_qnet(obs_ph, hidden_sizes, act_dim)\n        else:\n            online_qv = qnet(obs_ph, hidden_sizes, act_dim)\n    train_vars = tf.trainable_variables()\n\n    # Update the target network by assigning to it the variables of the online network\n    # Note that the target network and the online network have the same exact architecture\n    update_target = [train_vars[i].assign(train_vars[i+len(target_vars)]) for i in range(len(train_vars) - len(target_vars))]\n    update_target_op = tf.group(*update_target)\n\n    # One hot encoding of the action\n    act_onehot = tf.one_hot(act_ph, depth=act_dim)\n    # We are interested only in the Q-values of those actions\n    q_values = tf.reduce_sum(act_onehot * online_qv, axis=1)\n    \n    # MSE loss function\n    v_loss = tf.reduce_mean((y_ph - q_values)**2)\n    # Adam optimize that minimize the loss v_loss\n    v_opt = tf.train.AdamOptimizer(lr).minimize(v_loss)\n\n    def agent_op(o):\n        '''\n        Forward pass to obtain the Q-values from the online network of a single observation\n        '''\n        # Scale the frames\n        o = scale_frames(o)\n        return sess.run(online_qv, feed_dict={obs_ph:[o]})\n\n    # Time\n    now = datetime.now()\n    clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, int(now.second))\n    print('Time:', clock_time)\n\n    mr_v = tf.Variable(0.0)\n    ml_v = tf.Variable(0.0)\n\n\n    # TensorBoard summaries\n    tf.summary.scalar('v_loss', v_loss)\n    tf.summary.scalar('Q-value', tf.reduce_mean(q_values))\n    tf.summary.histogram('Q-values', q_values)\n\n    scalar_summary = tf.summary.merge_all()\n    reward_summary = tf.summary.scalar('test_rew', mr_v)\n    mean_loss_summary = tf.summary.scalar('mean_loss', ml_v)\n\n    LOG_DIR = 'log_dir/'+env_name\n    hyp_str = \"-lr_{}-upTN_{}-upF_{}-frms_{}-ddqn_{}-duel_{}-nstep_{}\" \\\n                .format(lr, update_target_net, update_freq, frames_num, extensions_hyp['DDQN'], extensions_hyp['dueling'], extensions_hyp['multi_step'])\n\n    # initialize the File Writer for writing TensorBoard summaries\n    file_writer = tf.summary.FileWriter(LOG_DIR+'/DQN_'+clock_time+'_'+hyp_str, tf.get_default_graph())\n\n    # open a session\n    sess = tf.Session()\n    # and initialize all the variables\n    sess.run(tf.global_variables_initializer())\n    \n    render_the_game = False\n    step_count = 0\n    last_update_loss = []\n    ep_time = current_milli_time()\n    batch_rew = []\n    old_step_count = 0\n\n    obs = env.reset()\n\n    # Initialize the experience buffer\n    #buffer = ExperienceBuffer(buffer_size)\n    buffer = MultiStepExperienceBuffer(buffer_size, extensions_hyp['multi_step'], discount)\n    \n    # Copy the online network in the target network\n    sess.run(update_target_op)\n\n    ########## EXPLORATION INITIALIZATION ######\n    eps = start_explor\n    eps_decay = (start_explor - end_explor) / explor_steps\n\n    for ep in range(num_epochs):\n        g_rew = 0\n        done = False\n\n        # Until the environment does not end..\n        while not done:\n                \n            # Epsilon decay\n            if eps > end_explor:\n                eps -= eps_decay\n\n            # Choose an eps-greedy action \n            act = eps_greedy(np.squeeze(agent_op(obs)), eps=eps)\n\n            # execute the action in the environment\n            obs2, rew, done, _ = env.step(act)\n\n            # Render the game if you want to\n            if render_the_game:\n                env.render()\n\n            # Add the transition to the replay buffer\n            buffer.add(obs, rew, act, obs2, done)\n\n            obs = obs2\n            g_rew += rew\n            step_count += 1\n\n            ################ TRAINING ###############\n            # If it's time to train the network:\n            if len(buffer) > min_buffer_size and (step_count % update_freq == 0):\n                \n                # sample a minibatch from the buffer\n                mb_obs, mb_rew, mb_act, mb_obs2, mb_done = buffer.sample_minibatch(batch_size)\n\n                if extensions_hyp['DDQN']:\n                    mb_onl_qv, mb_trg_qv = sess.run([online_qv,target_qv], feed_dict={obs_ph:mb_obs2})\n                    y_r = double_q_target_values(mb_rew, mb_done, mb_trg_qv, mb_onl_qv, discount)\n                else:\n                    mb_trg_qv = sess.run(target_qv, feed_dict={obs_ph:mb_obs2})\n                    y_r = q_target_values(mb_rew, mb_done, mb_trg_qv, discount)\n\n                # optimize, compute the loss and return the TB summary\n                train_summary, train_loss, _ = sess.run([scalar_summary, v_loss, v_opt], feed_dict={obs_ph:mb_obs, y_ph:y_r, act_ph: mb_act})\n\n                # Add the train summary to the file_writer\n                file_writer.add_summary(train_summary, step_count)\n                last_update_loss.append(train_loss)\n\n            # Every update_target_net steps, update the target network\n            if (len(buffer) > min_buffer_size) and (step_count % update_target_net == 0):\n\n                # run the session to update the target network and get the mean loss sumamry \n                _, train_summary = sess.run([update_target_op, mean_loss_summary], feed_dict={ml_v:np.mean(last_update_loss)})\n                file_writer.add_summary(train_summary, step_count)\n                last_update_loss = []\n\n\n            # If the environment is ended, reset it and initialize the variables\n            if done:\n                obs = env.reset()\n                batch_rew.append(g_rew)\n                g_rew, render_the_game = 0, False\n\n        # every test_frequency episodes, test the agent and write some stats in TensorBoard\n        if ep % test_frequency == 0:\n            # Test the agent to 10 games\n            test_rw = test_agent(env_test, agent_op, num_games=10)\n\n            # Run the test stats and add them to the file_writer\n            test_summary = sess.run(reward_summary, feed_dict={mr_v: np.mean(test_rw)})\n            file_writer.add_summary(test_summary, step_count)\n\n            # Print some useful stats\n            ep_sec_time = int((current_milli_time()-ep_time) / 1000)\n            print('Ep:%4d Rew:%4.2f, Eps:%2.2f -- Step:%5d -- Test:%4.2f %4.2f -- Time:%d -- Ep_Steps:%d' %\n                        (ep,np.mean(batch_rew), eps, step_count, np.mean(test_rw), np.std(test_rw), ep_sec_time, (step_count-old_step_count)/test_frequency))\n\n            ep_time = current_milli_time()\n            batch_rew = []\n            old_step_count = step_count\n                            \n        if ep % render_cycle == 0:\n            render_the_game = True\n\n    file_writer.close()\n    env.close()\n\n\nif __name__ == '__main__':\n\n    extensions_hyp={\n        'DDQN':False,\n        'dueling':False,\n        'multi_step':1\n    }\n    DQN_with_variations('PongNoFrameskip-v4', extensions_hyp, hidden_sizes=[128], lr=2e-4, buffer_size=100000, update_target_net=1000, batch_size=32, \n        update_freq=2, frames_num=2, min_buffer_size=10000, render_cycle=10000)"
  },
  {
    "path": "Chapter05/Untitled.ipynb",
    "content": "{\n \"cells\": [],\n \"metadata\": {},\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "Chapter05/atari_wrappers.py",
    "content": "import numpy as np\nimport os\nfrom collections import deque\nimport gym\nfrom gym import spaces\nimport cv2\n\n''' \nAtari Wrapper copied from https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py\n'''\n\nclass NoopResetEnv(gym.Wrapper):\n    def __init__(self, env, noop_max=30):\n        \"\"\"Sample initial states by taking random number of no-ops on reset.\n        No-op is assumed to be action 0.\n        \"\"\"\n        gym.Wrapper.__init__(self, env)\n        self.noop_max = noop_max\n        self.override_num_noops = None\n        self.noop_action = 0\n        assert env.unwrapped.get_action_meanings()[0] == 'NOOP'\n\n    def reset(self, **kwargs):\n        \"\"\" Do no-op action for a number of steps in [1, noop_max].\"\"\"\n        self.env.reset(**kwargs)\n        if self.override_num_noops is not None:\n            noops = self.override_num_noops\n        else:\n            noops = self.unwrapped.np_random.randint(1, self.noop_max + 1) #pylint: disable=E1101\n        assert noops > 0\n        obs = None\n        for _ in range(noops):\n            obs, _, done, _ = self.env.step(self.noop_action)\n            if done:\n                obs = self.env.reset(**kwargs)\n        return obs\n\n    def step(self, ac):\n        return self.env.step(ac)\n\nclass LazyFrames(object):\n    def __init__(self, frames):\n        \"\"\"This object ensures that common frames between the observations are only stored once.\n        It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay\n        buffers.\n        This object should only be converted to numpy array before being passed to the model.\n        You'd not believe how complex the previous solution was.\"\"\"\n        self._frames = frames\n        self._out = None\n\n    def _force(self):\n        if self._out is None:\n            self._out = np.concatenate(self._frames, axis=2)\n            self._frames = None\n        return self._out\n\n    def __array__(self, dtype=None):\n        out = self._force()\n        if dtype is not None:\n            out = out.astype(dtype)\n        return out\n\n    def __len__(self):\n        return len(self._force())\n\n    def __getitem__(self, i):\n        return self._force()[i]\n\nclass FireResetEnv(gym.Wrapper):\n    def __init__(self, env):\n        \"\"\"Take action on reset for environments that are fixed until firing.\"\"\"\n        gym.Wrapper.__init__(self, env)\n        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'\n        assert len(env.unwrapped.get_action_meanings()) >= 3\n\n    def reset(self, **kwargs):\n        self.env.reset(**kwargs)\n        obs, _, done, _ = self.env.step(1)\n        if done:\n            self.env.reset(**kwargs)\n        obs, _, done, _ = self.env.step(2)\n        if done:\n            self.env.reset(**kwargs)\n        return obs\n\n    def step(self, ac):\n        return self.env.step(ac)\n\n\nclass MaxAndSkipEnv(gym.Wrapper):\n    def __init__(self, env, skip=4):\n        \"\"\"Return only every `skip`-th frame\"\"\"\n        gym.Wrapper.__init__(self, env)\n        # most recent raw observations (for max pooling across time steps)\n        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)\n        self._skip       = skip\n\n    def step(self, action):\n        \"\"\"Repeat action, sum reward, and max over last observations.\"\"\"\n        total_reward = 0.0\n        done = None\n        for i in range(self._skip):\n            obs, reward, done, info = self.env.step(action)\n            if i == self._skip - 2: self._obs_buffer[0] = obs\n            if i == self._skip - 1: self._obs_buffer[1] = obs\n            total_reward += reward\n            if done:\n                break\n        # Note that the observation on the done=True frame\n        # doesn't matter\n        max_frame = self._obs_buffer.max(axis=0)\n\n        return max_frame, total_reward, done, info\n\n    def reset(self, **kwargs):\n        return self.env.reset(**kwargs)\n\n\n\nclass WarpFrame(gym.ObservationWrapper):\n    def __init__(self, env):\n        \"\"\"Warp frames to 84x84 as done in the Nature paper and later work.\"\"\"\n        gym.ObservationWrapper.__init__(self, env)\n        self.width = 84\n        self.height = 84\n        self.observation_space = spaces.Box(low=0, high=255,\n            shape=(self.height, self.width, 1), dtype=np.uint8)\n\n    def observation(self, frame):\n        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)\n        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)\n        return frame[:, :, None]\n\n\n\nclass FrameStack(gym.Wrapper):\n    def __init__(self, env, k):\n        \"\"\"Stack k last frames.\n        Returns lazy array, which is much more memory efficient.\n        See Also\n        baselines.common.atari_wrappers.LazyFrames\n        \"\"\"\n        gym.Wrapper.__init__(self, env)\n        self.k = k\n        self.frames = deque([], maxlen=k)\n        shp = env.observation_space.shape\n        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=env.observation_space.dtype)\n\n    def reset(self):\n        ob = self.env.reset()\n        for _ in range(self.k):\n            self.frames.append(ob)\n        return self._get_ob()\n\n    def step(self, action):\n        ob, reward, done, info = self.env.step(action)\n        self.frames.append(ob)\n        return self._get_ob(), reward, done, info\n\n    def _get_ob(self):\n        assert len(self.frames) == self.k\n        return LazyFrames(list(self.frames))\n\n\nclass ScaledFloatFrame(gym.ObservationWrapper):\n    def __init__(self, env):\n        gym.ObservationWrapper.__init__(self, env)\n        self.observation_space = gym.spaces.Box(low=0, high=1, shape=env.observation_space.shape, dtype=np.float32)\n\n    def observation(self, observation):\n        # careful! This undoes the memory optimization, use\n        # with smaller replay buffers only.\n        return np.array(observation).astype(np.float32) / 255.0\n\n\ndef make_env(env_name, fire=True, frames_num=2, noop_num=30, skip_frames=True):\n    env = gym.make(env_name)\n    \n    if skip_frames:\n        env = MaxAndSkipEnv(env) ## Return only every `skip`-th frame\n    if fire:\n       env = FireResetEnv(env) ## Fire at the beginning\n    env = NoopResetEnv(env, noop_max=noop_num)\n    env = WarpFrame(env) ## Reshape image\n    env = FrameStack(env, frames_num) ## Stack last 4 frames\n    #env = ScaledFloatFrame(env) ## Scale frames\n    return env"
  },
  {
    "path": "Chapter05/untitled",
    "content": ""
  },
  {
    "path": "Chapter06/AC.py",
    "content": "import numpy as np \nimport tensorflow as tf\nimport gym\nfrom datetime import datetime\nimport time\n\n\ndef mlp(x, hidden_layers, output_size, activation=tf.nn.relu, last_activation=None):\n    '''\n    Multi-layer perceptron\n    '''\n    for l in hidden_layers:\n        x = tf.layers.dense(x, units=l, activation=activation)\n    return tf.layers.dense(x, units=output_size, activation=last_activation)\n\ndef softmax_entropy(logits):\n    '''\n    Softmax Entropy\n    '''\n    return tf.reduce_sum(tf.nn.softmax(logits, axis=-1) * tf.nn.log_softmax(logits, axis=-1), axis=-1)\n\ndef discounted_rewards(rews, last_sv, gamma):\n    '''\n    Discounted reward to go \n\n    Parameters:\n    ----------\n    rews: list of rewards\n    last_sv: value of the last state\n    gamma: discount value \n    '''\n    rtg = np.zeros_like(rews, dtype=np.float32)\n    rtg[-1] = rews[-1] + gamma*last_sv\n    for i in reversed(range(len(rews)-1)):\n        rtg[i] = rews[i] + gamma*rtg[i+1]\n    return rtg\n\nclass Buffer():\n    '''\n    Buffer class to store the experience from a unique policy\n    '''\n    def __init__(self, gamma=0.99):\n        self.gamma = gamma\n        self.obs = []\n        self.act = []\n        self.ret = []\n        self.rtg = []\n\n    def store(self, temp_traj, last_sv):\n        '''\n        Add temp_traj values to the buffers and compute the advantage and reward to go\n\n        Parameters:\n        -----------\n        temp_traj: list where each element is a list that contains: observation, reward, action, state-value\n        last_sv: value of the last state (Used to Bootstrap)\n        '''\n        # store only if the temp_traj list is not empty\n        if len(temp_traj) > 0:\n            self.obs.extend(temp_traj[:,0])\n            rtg = discounted_rewards(temp_traj[:,1], last_sv, self.gamma)\n            self.ret.extend(rtg - temp_traj[:,3])\n            self.rtg.extend(rtg)\n            self.act.extend(temp_traj[:,2])\n\n    def get_batch(self):\n        return self.obs, self.act, self.ret, self.rtg\n\n    def __len__(self):\n        assert(len(self.obs) == len(self.act) == len(self.ret) == len(self.rtg))\n        return len(self.obs)\n    \ndef AC(env_name, hidden_sizes=[32], ac_lr=5e-3, cr_lr=8e-3, num_epochs=50, gamma=0.99, steps_per_epoch=100, steps_to_print=100):\n    '''\n    Actor-Critic Algorithm\ns\n    Parameters:\n    -----------\n    env_name: Name of the environment\n    hidden_size: list of the number of hidden units for each layer\n    ac_lr: actor learning rate\n    cr_lr: critic learning rate\n    num_epochs: number of training epochs\n    gamma: discount factor\n    steps_per_epoch: number of steps per epoch\n    '''\n    tf.reset_default_graph()\n\n    env = gym.make(env_name)    \n\n    \n    obs_dim = env.observation_space.shape\n    act_dim = env.action_space.n \n\n    # Placeholders\n    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')\n    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')\n    ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')\n    rtg_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='rtg')\n\n    #####################################################\n    ########### COMPUTE THE PG LOSS FUNCTIONS ###########\n    #####################################################\n\n    # policy\n    p_logits = mlp(obs_ph, hidden_sizes, act_dim, activation=tf.tanh)\n\n    act_multn = tf.squeeze(tf.random.multinomial(p_logits, 1))\n    actions_mask = tf.one_hot(act_ph, depth=act_dim)\n    p_log = tf.reduce_sum(actions_mask * tf.nn.log_softmax(p_logits), axis=1)\n    # entropy useful to study the algorithms\n    entropy = -tf.reduce_mean(softmax_entropy(p_logits))\n    p_loss = -tf.reduce_mean(p_log*ret_ph)\n\n    # policy optimization\n    p_opt = tf.train.AdamOptimizer(ac_lr).minimize(p_loss)\n\n    #######################################\n    ###########  VALUE FUNCTION ###########\n    #######################################\n    \n    # value function\n    s_values = tf.squeeze(mlp(obs_ph, hidden_sizes, 1, activation=tf.tanh))\n    # MSE loss function\n    v_loss = tf.reduce_mean((rtg_ph - s_values)**2)\n    # value function optimization\n    v_opt = tf.train.AdamOptimizer(cr_lr).minimize(v_loss)\n\n    # Time\n    now = datetime.now()\n    clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, now.second)\n    print('Time:', clock_time)\n\n\n    # Set scalars and hisograms for TensorBoard\n    tf.summary.scalar('p_loss', p_loss, collections=['train'])\n    tf.summary.scalar('v_loss', v_loss, collections=['train'])\n    tf.summary.scalar('entropy', entropy, collections=['train'])\n    tf.summary.scalar('s_values', tf.reduce_mean(s_values), collections=['train'])\n    tf.summary.histogram('p_soft', tf.nn.softmax(p_logits), collections=['train'])\n    tf.summary.histogram('p_log', p_log, collections=['train'])\n    tf.summary.histogram('act_multn', act_multn, collections=['train'])\n    tf.summary.histogram('p_logits', p_logits, collections=['train'])\n    tf.summary.histogram('ret_ph', ret_ph, collections=['train'])\n    tf.summary.histogram('rtg_ph', rtg_ph, collections=['train'])\n    tf.summary.histogram('s_values', s_values, collections=['train'])\n    train_summary = tf.summary.merge_all('train')\n\n    tf.summary.scalar('old_v_loss', v_loss, collections=['pre_train'])\n    tf.summary.scalar('old_p_loss', p_loss, collections=['pre_train'])\n    pre_scalar_summary = tf.summary.merge_all('pre_train')\n\n    hyp_str = '-steps_{}-aclr_{}-crlr_{}'.format(steps_per_epoch, ac_lr, cr_lr)\n    file_writer = tf.summary.FileWriter('log_dir/{}/AC_{}_{}'.format(env_name, clock_time, hyp_str), tf.get_default_graph())\n    \n    # create a session\n    sess = tf.Session()\n    # initialize the variables\n    sess.run(tf.global_variables_initializer())\n\n    # few variables\n    step_count = 0\n    train_rewards = []\n    train_ep_len = []\n    timer = time.time()\n    last_print_step = 0\n\n    #Reset the environment at the beginning of the cycle\n    obs = env.reset()\n    ep_rews = []\n\n    # main cycle\n    for ep in range(num_epochs):\n\n        # intiaizlie buffer and other variables for the new epochs\n        buffer = Buffer(gamma)\n        env_buf = []\n        \n        #iterate always over a fixed number of iterations\n        for _ in range(steps_per_epoch):\n\n            # run the policy\n            act, val = sess.run([act_multn, s_values], feed_dict={obs_ph:[obs]})\n            # take a step in the environment\n            obs2, rew, done, _ = env.step(np.squeeze(act))\n\n            # add the new transition\n            env_buf.append([obs.copy(), rew, act, np.squeeze(val)])\n\n            obs = obs2.copy()\n\n            step_count += 1\n            last_print_step += 1\n            ep_rews.append(rew)\n\n            if done:\n                # store the trajectory just completed\n                # Changed from REINFORCE! The second parameter is the estimated value of the next state. Because the environment is done. \n                # we pass a value of 0\n                buffer.store(np.array(env_buf), 0)\n                env_buf = []\n                # store additionl information about the episode\n                train_rewards.append(np.sum(ep_rews))\n                train_ep_len.append(len(ep_rews))\n                # reset the environment\n                obs = env.reset()\n                ep_rews = []\n\n        # Bootstrap with the estimated state value of the next state!\n        if len(env_buf) > 0:\n            last_sv = sess.run(s_values, feed_dict={obs_ph:[obs]})\n            buffer.store(np.array(env_buf), last_sv)\n\n        # collect the episodes' information\n        obs_batch, act_batch, ret_batch, rtg_batch = buffer.get_batch()\n        \n        # run pre_scalar_summary before the optimization phase\n        old_p_loss, old_v_loss, epochs_summary = sess.run([p_loss, v_loss, pre_scalar_summary], feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch, rtg_ph:rtg_batch})\n        file_writer.add_summary(epochs_summary, step_count)\n\n        # Optimize the actor and the critic\n        sess.run([p_opt, v_opt], feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch, rtg_ph:rtg_batch})\n\n        # run train_summary to save the summary after the optimization\n        new_p_loss, new_v_loss, train_summary_run = sess.run([p_loss, v_loss, train_summary], feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch, rtg_ph:rtg_batch})\n        file_writer.add_summary(train_summary_run, step_count)\n        summary = tf.Summary()\n        summary.value.add(tag='diff/p_loss', simple_value=(old_p_loss - new_p_loss))\n        summary.value.add(tag='diff/v_loss', simple_value=(old_v_loss - new_v_loss))\n        file_writer.add_summary(summary, step_count)\n        file_writer.flush()\n\n        # it's time to print some useful information\n        if last_print_step > steps_to_print:\n            print('Ep:%d MnRew:%.2f MxRew:%.1f EpLen:%.1f Buffer:%d -- Step:%d -- Time:%d' % (ep, np.mean(train_rewards), np.max(train_rewards), np.mean(train_ep_len), len(buffer), step_count,time.time()-timer))\n\n            summary = tf.Summary()\n            summary.value.add(tag='supplementary/len', simple_value=np.mean(train_ep_len))\n            summary.value.add(tag='supplementary/train_rew', simple_value=np.mean(train_rewards))\n            file_writer.add_summary(summary, step_count)\n            file_writer.flush()\n\n            timer = time.time()\n            train_rewards = []\n            train_ep_len = []\n            last_print_step = 0\n\n    env.close()\n    file_writer.close()\n\n\nif __name__ == '__main__':\n    AC('LunarLander-v2', hidden_sizes=[64], ac_lr=4e-3, cr_lr=1.5e-2, gamma=0.99, steps_per_epoch=100, steps_to_print=5000, num_epochs=8000)\n"
  },
  {
    "path": "Chapter06/REINFORCE.py",
    "content": "import numpy as np \nimport tensorflow as tf\nimport gym\nfrom datetime import datetime\nimport time\n\n\ndef mlp(x, hidden_layers, output_size, activation=tf.nn.relu, last_activation=None):\n    '''\n    Multi-layer perceptron\n    '''\n    for l in hidden_layers:\n        x = tf.layers.dense(x, units=l, activation=activation)\n    return tf.layers.dense(x, units=output_size, activation=last_activation)\n\ndef softmax_entropy(logits):\n    '''\n    Softmax Entropy\n    '''\n    return tf.reduce_sum(tf.nn.softmax(logits, axis=-1) * tf.nn.log_softmax(logits, axis=-1), axis=-1)\n\n\ndef discounted_rewards(rews, gamma):\n    '''\n    Discounted reward to go \n\n    Parameters:\n    ----------\n    rews: list of rewards\n    gamma: discount value \n    '''\n    rtg = np.zeros_like(rews, dtype=np.float32)\n    rtg[-1] = rews[-1]\n    for i in reversed(range(len(rews)-1)):\n        rtg[i] = rews[i] + gamma*rtg[i+1]\n    return rtg\n\nclass Buffer():\n    '''\n    Buffer class to store the experience from a unique policy\n    '''\n    def __init__(self, gamma=0.99):\n        self.gamma = gamma\n        self.obs = []\n        self.act = []\n        self.ret = []\n\n    def store(self, temp_traj):\n        '''\n        Add temp_traj values to the buffers and compute the advantage and reward to go\n\n        Parameters:\n        -----------\n        temp_traj: list where each element is a list that contains: observation, reward, action, state-value\n        '''\n        # store only if the temp_traj list is not empty\n        if len(temp_traj) > 0:\n            self.obs.extend(temp_traj[:,0])\n            rtg = discounted_rewards(temp_traj[:,1], self.gamma)\n            self.ret.extend(rtg)\n            self.act.extend(temp_traj[:,2])\n\n    def get_batch(self):\n        b_ret = self.ret\n        return self.obs, self.act, b_ret\n\n    def __len__(self):\n        assert(len(self.obs) == len(self.act) == len(self.ret))\n        return len(self.obs)\n    \n\ndef REINFORCE(env_name, hidden_sizes=[32], lr=5e-3, num_epochs=50, gamma=0.99, steps_per_epoch=100):\n    '''\n    REINFORCE Algorithm\n\n    Parameters:\n    -----------\n    env_name: Name of the environment\n    hidden_size: list of the number of hidden units for each layer\n    lr: policy learning rate\n    gamma: discount factor\n    steps_per_epoch: number of steps per epoch\n    num_epochs: number train epochs (Note: they aren't properly epochs)\n    '''\n    tf.reset_default_graph()\n\n    env = gym.make(env_name)    \n\n    \n    obs_dim = env.observation_space.shape\n    act_dim = env.action_space.n \n\n    # Placeholders\n    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')\n    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')\n    ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')\n\n    ##################################################\n    ########### COMPUTE THE LOSS FUNCTIONS ###########\n    ##################################################\n\n\n    # policy\n    p_logits = mlp(obs_ph, hidden_sizes, act_dim, activation=tf.tanh)\n\n\n    act_multn = tf.squeeze(tf.random.multinomial(p_logits, 1))\n    actions_mask = tf.one_hot(act_ph, depth=act_dim)\n\n    p_log = tf.reduce_sum(actions_mask * tf.nn.log_softmax(p_logits), axis=1)\n\n    # entropy useful to study the algorithms\n    entropy = -tf.reduce_mean(softmax_entropy(p_logits))\n    p_loss = -tf.reduce_mean(p_log*ret_ph)\n\n    # policy optimization\n    p_opt = tf.train.AdamOptimizer(lr).minimize(p_loss)\n\n    # Time\n    now = datetime.now()\n    clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, now.second)\n    print('Time:', clock_time)\n\n\n    # Set scalars and hisograms for TensorBoard\n    tf.summary.scalar('p_loss', p_loss, collections=['train'])\n    tf.summary.scalar('entropy', entropy, collections=['train'])\n    tf.summary.histogram('p_soft', tf.nn.softmax(p_logits), collections=['train'])\n    tf.summary.histogram('p_log', p_log, collections=['train'])\n    tf.summary.histogram('act_multn', act_multn, collections=['train'])\n    tf.summary.histogram('p_logits', p_logits, collections=['train'])\n    tf.summary.histogram('ret_ph', ret_ph, collections=['train'])\n    train_summary = tf.summary.merge_all('train')\n\n    tf.summary.scalar('old_p_loss', p_loss, collections=['pre_train'])\n    pre_scalar_summary = tf.summary.merge_all('pre_train')\n\n    hyp_str = '-steps_{}-aclr_{}'.format(steps_per_epoch, lr)\n    file_writer = tf.summary.FileWriter('log_dir/{}/REINFORCE_{}_{}'.format(env_name, clock_time, hyp_str), tf.get_default_graph())\n    \n    # create a session\n    sess = tf.Session()\n    # initialize the variables\n    sess.run(tf.global_variables_initializer())\n\n    # few variables\n    step_count = 0\n    train_rewards = []\n    train_ep_len = []\n    timer = time.time()\n\n    # main cycle\n    for ep in range(num_epochs):\n\n        # initialize environment for the new epochs\n        obs = env.reset()\n\n        # intiaizlie buffer and other variables for the new epochs\n        buffer = Buffer(gamma)\n        env_buf = []\n        ep_rews = []\n        \n        while len(buffer) < steps_per_epoch:\n\n            # run the policy\n            act = sess.run(act_multn, feed_dict={obs_ph:[obs]})\n            # take a step in the environment\n            obs2, rew, done, _ = env.step(np.squeeze(act))\n\n            # add the new transition\n            env_buf.append([obs.copy(), rew, act])\n\n            obs = obs2.copy()\n\n            step_count += 1\n            ep_rews.append(rew)\n\n            if done:\n                # store the trajectory just completed\n                buffer.store(np.array(env_buf))\n                env_buf = []\n                # store additionl information about the episode\n                train_rewards.append(np.sum(ep_rews))\n                train_ep_len.append(len(ep_rews))\n                # reset the environment\n                obs = env.reset()\n                ep_rews = []\n\n        # collect the episodes' information\n        obs_batch, act_batch, ret_batch = buffer.get_batch()\n        \n        # run pre_scalar_summary before the optimization phase\n        epochs_summary = sess.run(pre_scalar_summary, feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch})\n        file_writer.add_summary(epochs_summary, step_count)\n\n        # Optimize the policy\n        sess.run(p_opt, feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch})\n\n        # run train_summary to save the summary after the optimization\n        train_summary_run = sess.run(train_summary, feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch})\n        file_writer.add_summary(train_summary_run, step_count)\n\n        # it's time to print some useful information\n        if ep % 10 == 0:\n            print('Ep:%d MnRew:%.2f MxRew:%.1f EpLen:%.1f Buffer:%d -- Step:%d -- Time:%d' % (ep, np.mean(train_rewards), np.max(train_rewards), np.mean(train_ep_len), len(buffer), step_count,time.time()-timer))\n\n            summary = tf.Summary()\n            summary.value.add(tag='supplementary/len', simple_value=np.mean(train_ep_len))\n            summary.value.add(tag='supplementary/train_rew', simple_value=np.mean(train_rewards))\n            file_writer.add_summary(summary, step_count)\n            file_writer.flush()\n\n            timer = time.time()\n            train_rewards = []\n            train_ep_len = []\n\n\n    env.close()\n    file_writer.close()\n\n\nif __name__ == '__main__':\n    REINFORCE('LunarLander-v2', hidden_sizes=[64], lr=8e-3, gamma=0.99, num_epochs=1000, steps_per_epoch=1000)"
  },
  {
    "path": "Chapter06/REINFORCE_baseline.py",
    "content": "import numpy as np \nimport tensorflow as tf\nimport gym\nfrom datetime import datetime\nimport time\n\n\ndef mlp(x, hidden_layers, output_size, activation=tf.nn.relu, last_activation=None):\n    '''\n    Multi-layer perceptron\n    '''\n    for l in hidden_layers:\n        x = tf.layers.dense(x, units=l, activation=activation)\n    return tf.layers.dense(x, units=output_size, activation=last_activation)\n\ndef softmax_entropy(logits):\n    '''\n    Softmax Entropy\n    '''\n    return tf.reduce_sum(tf.nn.softmax(logits, axis=-1) * tf.nn.log_softmax(logits, axis=-1), axis=-1)\n\n\ndef discounted_rewards(rews, gamma):\n    '''\n    Discounted reward to go \n\n    Parameters:\n    ----------\n    rews: list of rewards\n    gamma: discount value \n    '''\n    rtg = np.zeros_like(rews, dtype=np.float32)\n    rtg[-1] = rews[-1]\n    for i in reversed(range(len(rews)-1)):\n        rtg[i] = rews[i] + gamma*rtg[i+1]\n    return rtg\n\nclass Buffer():\n    '''\n    Buffer class to store the experience from a unique policy\n    '''\n    def __init__(self, gamma=0.99):\n        self.gamma = gamma\n        self.obs = []\n        self.act = []\n        self.ret = []\n        self.rtg = []\n\n    def store(self, temp_traj):\n        '''\n        Add temp_traj values to the buffers and compute the advantage and reward to go\n\n        Parameters:\n        -----------\n        temp_traj: list where each element is a list that contains: observation, reward, action, state-value\n        '''\n        # store only if the temp_traj list is not empty\n        if len(temp_traj) > 0:\n            self.obs.extend(temp_traj[:,0])\n            rtg = discounted_rewards(temp_traj[:,1], self.gamma)\n            # NEW\n            self.ret.extend(rtg - temp_traj[:,3])\n            self.rtg.extend(rtg)\n            self.act.extend(temp_traj[:,2])\n\n    def get_batch(self):\n        # MODIFIED\n        return self.obs, self.act, self.ret, self.rtg\n\n    def __len__(self):\n        assert(len(self.obs) == len(self.act) == len(self.ret) == len(self.rtg))\n        return len(self.obs)\n\n\ndef REINFORCE_baseline(env_name, hidden_sizes=[32], p_lr=5e-3, vf_lr=8e-3, gamma=0.99, steps_per_epoch=100, num_epochs=1000):\n    '''\n    REINFORCE with baseline Algorithm\n\n    Parameters:\n    -----------\n    env_name: Name of the environment\n    hidden_size: list of the number of hidden units for each layer\n    p_lr: policy learning rate\n    vf_lr: value function learning rate\n    gamma: discount factor\n    steps_per_epoch: number of steps per epoch\n    num_epochs: number train epochs (Note: they aren't properly epochs)\n    '''\n    tf.reset_default_graph()\n\n    env = gym.make(env_name)    \n    \n    obs_dim = env.observation_space.shape\n    act_dim = env.action_space.n \n\n    # Placeholders\n    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')\n    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')\n    ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')\n    rtg_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='rtg')\n\n    #####################################################\n    ########### COMPUTE THE PG LOSS FUNCTIONS ###########\n    #####################################################\n\n    # policy\n    p_logits = mlp(obs_ph, hidden_sizes, act_dim, activation=tf.tanh)\n\n    act_multn = tf.squeeze(tf.random.multinomial(p_logits, 1))\n    actions_mask = tf.one_hot(act_ph, depth=act_dim)\n    p_log = tf.reduce_sum(actions_mask * tf.nn.log_softmax(p_logits), axis=1)\n    # entropy useful to study the algorithms\n    entropy = -tf.reduce_mean(softmax_entropy(p_logits))\n    p_loss = -tf.reduce_mean(p_log*ret_ph)\n\n    # policy optimization\n    p_opt = tf.train.AdamOptimizer(p_lr).minimize(p_loss)\n\n    #######################################\n    ###########  VALUE FUNCTION ###########\n    #######################################\n    \n    ########### NEW ###########\n    # value function\n    s_values = tf.squeeze(mlp(obs_ph, hidden_sizes, 1, activation=tf.tanh))\n\n    # MSE loss function\n    v_loss = tf.reduce_mean((rtg_ph - s_values)**2)\n\n    # value function optimization\n    v_opt = tf.train.AdamOptimizer(vf_lr).minimize(v_loss)\n\n    # Time\n    now = datetime.now()\n    clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, now.second)\n    print('Time:', clock_time)\n\n\n    # Set scalars and hisograms for TensorBoard\n    tf.summary.scalar('p_loss', p_loss, collections=['train'])\n    tf.summary.scalar('v_loss', v_loss, collections=['train'])\n    tf.summary.scalar('entropy', entropy, collections=['train'])\n    tf.summary.scalar('s_values', tf.reduce_mean(s_values), collections=['train'])\n    tf.summary.histogram('p_soft', tf.nn.softmax(p_logits), collections=['train'])\n    tf.summary.histogram('p_log', p_log, collections=['train'])\n    tf.summary.histogram('act_multn', act_multn, collections=['train'])\n    tf.summary.histogram('p_logits', p_logits, collections=['train'])\n    tf.summary.histogram('ret_ph', ret_ph, collections=['train'])\n    tf.summary.histogram('rtg_ph', rtg_ph, collections=['train'])\n    tf.summary.histogram('s_values', s_values, collections=['train'])\n    train_summary = tf.summary.merge_all('train')\n\n    tf.summary.scalar('old_v_loss', v_loss, collections=['pre_train'])\n    tf.summary.scalar('old_p_loss', p_loss, collections=['pre_train'])\n    pre_scalar_summary = tf.summary.merge_all('pre_train')\n\n    hyp_str = '-steps_{}-plr_{}-vflr_{}'.format(steps_per_epoch, p_lr, vf_lr)\n    file_writer = tf.summary.FileWriter('log_dir/{}/REINFORCE_basel_{}_{}'.format(env_name, clock_time, hyp_str), tf.get_default_graph())\n    \n    # create a session\n    sess = tf.Session()\n    # initialize the variables\n    sess.run(tf.global_variables_initializer())\n\n    # few variables\n    step_count = 0\n    train_rewards = []\n    train_ep_len = []\n    timer = time.time()\n\n    # main cycle\n    for ep in range(num_epochs):\n\n        # initialize environment for the new epochs\n        obs = env.reset()\n\n        # intiaizlie buffer and other variables for the new epochs\n        buffer = Buffer(gamma)\n        env_buf = []\n        ep_rews = []\n        \n        while len(buffer) < steps_per_epoch:\n\n            # run the policy\n            act, val = sess.run([act_multn, s_values], feed_dict={obs_ph:[obs]})\n            # take a step in the environment\n            obs2, rew, done, _ = env.step(np.squeeze(act))\n\n            # add the new transition\n            env_buf.append([obs.copy(), rew, act, np.squeeze(val)])\n\n            obs = obs2.copy()\n\n            step_count += 1\n            ep_rews.append(rew)\n\n            if done:\n                # store the trajectory just completed\n                buffer.store(np.array(env_buf))\n                env_buf = []\n                # store additionl information about the episode\n                train_rewards.append(np.sum(ep_rews))\n                train_ep_len.append(len(ep_rews))\n                # reset the environment\n                obs = env.reset()\n                ep_rews = []\n\n        # collect the episodes' information\n        obs_batch, act_batch, ret_batch, rtg_batch = buffer.get_batch()\n        \n        # run pre_scalar_summary before the optimization phase\n        epochs_summary = sess.run(pre_scalar_summary, feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch, rtg_ph:rtg_batch})\n        file_writer.add_summary(epochs_summary, step_count)\n\n        # Optimize the NN policy and the NN value function\n        sess.run([p_opt, v_opt], feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch, rtg_ph:rtg_batch})\n\n        # run train_summary to save the summary after the optimization\n        train_summary_run = sess.run(train_summary, feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch, rtg_ph:rtg_batch})\n        file_writer.add_summary(train_summary_run, step_count)\n\n        # it's time to print some useful information\n        if ep % 10 == 0:\n            print('Ep:%d MnRew:%.2f MxRew:%.1f EpLen:%.1f Buffer:%d -- Step:%d -- Time:%d' % (ep, np.mean(train_rewards), np.max(train_rewards), np.mean(train_ep_len), len(buffer), step_count,time.time()-timer))\n\n            summary = tf.Summary()\n            summary.value.add(tag='supplementary/len', simple_value=np.mean(train_ep_len))\n            summary.value.add(tag='supplementary/train_rew', simple_value=np.mean(train_rewards))\n            file_writer.add_summary(summary, step_count)\n            file_writer.flush()\n\n            timer = time.time()\n            train_rewards = []\n            train_ep_len = []\n\n\n    env.close()\n    file_writer.close()\n\n\nif __name__ == '__main__':\n    REINFORCE_baseline('LunarLander-v2', hidden_sizes=[64], p_lr=8e-3, vf_lr=7e-3, gamma=0.99, steps_per_epoch=1000, num_epochs=1000)"
  },
  {
    "path": "Chapter07/PPO.py",
    "content": "import numpy as np \nimport tensorflow as tf\nimport gym\nfrom datetime import datetime\nimport time\nimport roboschool\n\ndef mlp(x, hidden_layers, output_layer, activation=tf.tanh, last_activation=None):\n    '''\n    Multi-layer perceptron\n    '''\n    for l in hidden_layers:\n        x = tf.layers.dense(x, units=l, activation=activation)\n    return tf.layers.dense(x, units=output_layer, activation=last_activation)\n\ndef softmax_entropy(logits):\n    '''\n    Softmax Entropy\n    '''\n    return -tf.reduce_sum(tf.nn.softmax(logits, axis=-1) * tf.nn.log_softmax(logits, axis=-1), axis=-1)\n\ndef clipped_surrogate_obj(new_p, old_p, adv, eps):\n    '''\n    Clipped surrogate objective function\n    '''\n    rt = tf.exp(new_p - old_p) # i.e. pi / old_pi\n    return -tf.reduce_mean(tf.minimum(rt*adv, tf.clip_by_value(rt, 1-eps, 1+eps)*adv))\n\ndef GAE(rews, v, v_last, gamma=0.99, lam=0.95):\n    '''\n    Generalized Advantage Estimation\n    '''\n    assert len(rews) == len(v)\n    vs = np.append(v, v_last)\n    delta = np.array(rews) + gamma*vs[1:] - vs[:-1]\n    gae_advantage = discounted_rewards(delta, 0, gamma*lam)\n    return gae_advantage\n\ndef discounted_rewards(rews, last_sv, gamma):\n    '''\n    Discounted reward to go \n\n    Parameters:\n    ----------\n    rews: list of rewards\n    last_sv: value of the last state\n    gamma: discount value \n    '''\n    rtg = np.zeros_like(rews, dtype=np.float32)\n    rtg[-1] = rews[-1] + gamma*last_sv\n    for i in reversed(range(len(rews)-1)):\n        rtg[i] = rews[i] + gamma*rtg[i+1]\n    return rtg\n\n\nclass StructEnv(gym.Wrapper):\n    '''\n    Gym Wrapper to store information like number of steps and total reward of the last espisode.\n    '''\n    def __init__(self, env):\n        gym.Wrapper.__init__(self, env)\n        self.n_obs = self.env.reset()\n        self.rew_episode = 0\n        self.len_episode = 0\n\n    def reset(self, **kwargs):\n        self.n_obs = self.env.reset(**kwargs)\n        self.rew_episode = 0\n        self.len_episode = 0\n        return self.n_obs.copy()\n        \n    def step(self, action):\n        ob, reward, done, info = self.env.step(action)\n        self.rew_episode += reward\n        self.len_episode += 1\n        return ob, reward, done, info\n\n    def get_episode_reward(self):\n        return self.rew_episode\n\n    def get_episode_length(self):\n        return self.len_episode\n\nclass Buffer():\n    '''\n    Class to store the experience from a unique policy\n    '''\n    def __init__(self, gamma=0.99, lam=0.95):\n        self.gamma = gamma\n        self.lam = lam\n        self.adv = []\n        self.ob = []\n        self.ac = []\n        self.rtg = []\n\n    def store(self, temp_traj, last_sv):\n        '''\n        Add temp_traj values to the buffers and compute the advantage and reward to go\n\n        Parameters:\n        -----------\n        temp_traj: list where each element is a list that contains: observation, reward, action, state-value\n        last_sv: value of the last state (Used to Bootstrap)\n        '''\n        # store only if there are temporary trajectories\n        if len(temp_traj) > 0:\n            self.ob.extend(temp_traj[:,0])\n            rtg = discounted_rewards(temp_traj[:,1], last_sv, self.gamma)\n            self.adv.extend(GAE(temp_traj[:,1], temp_traj[:,3], last_sv, self.gamma, self.lam))\n            self.rtg.extend(rtg)\n            self.ac.extend(temp_traj[:,2])\n\n    def get_batch(self):\n        # standardize the advantage values\n        norm_adv = (self.adv - np.mean(self.adv)) / (np.std(self.adv) + 1e-10)\n        return np.array(self.ob), np.array(self.ac), np.array(norm_adv), np.array(self.rtg)\n\n    def __len__(self):\n        assert(len(self.adv) == len(self.ob) == len(self.ac) == len(self.rtg))\n        return len(self.ob)\n    \ndef gaussian_log_likelihood(x, mean, log_std):\n    '''\n    Gaussian Log Likelihood \n    '''\n    log_p = -0.5 *((x-mean)**2 / (tf.exp(log_std)**2+1e-9) + 2*log_std + np.log(2*np.pi))\n    return tf.reduce_sum(log_p, axis=-1)\n\ndef PPO(env_name, hidden_sizes=[32], cr_lr=5e-3, ac_lr=5e-3, num_epochs=50, minibatch_size=5000, gamma=0.99, lam=0.95, number_envs=1, eps=0.1, \n        actor_iter=5, critic_iter=10, steps_per_env=100, action_type='Discrete'):\n    '''\n    Proximal Policy Optimization\n\n    Parameters:\n    -----------\n    env_name: Name of the environment\n    hidden_size: list of the number of hidden units for each layer\n    ac_lr: actor learning rate\n    cr_lr: critic learning rate\n    num_epochs: number of training epochs\n    minibatch_size: Batch size used to train the critic and actor\n    gamma: discount factor\n    lam: lambda parameter for computing the GAE\n    number_envs: number of parallel synchronous environments\n        # NB: it isn't distributed across multiple CPUs\n    eps: Clip threshold. Max deviation from previous policy.\n    actor_iter: Number of SGD iterations on the actor per epoch\n    critic_iter: NUmber of SGD iterations on the critic per epoch\n    steps_per_env: number of steps per environment\n            # NB: the total number of steps per epoch will be: steps_per_env*number_envs\n    action_type: class name of the action space: Either \"Discrete' or \"Box\"\n    '''\n\n    tf.reset_default_graph()\n\n    # Create some environments to collect the trajectories\n    envs = [StructEnv(gym.make(env_name)) for _ in range(number_envs)]\n    \n    obs_dim = envs[0].observation_space.shape\n\n    # Placeholders\n    if action_type == 'Discrete':\n        act_dim = envs[0].action_space.n \n        act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')\n\n    elif action_type == 'Box':\n        low_action_space = envs[0].action_space.low\n        high_action_space = envs[0].action_space.high\n        act_dim = envs[0].action_space.shape[0]\n        act_ph = tf.placeholder(shape=(None,act_dim), dtype=tf.float32, name='act')\n\n    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')\n    ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')\n    adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='adv')\n    old_p_log_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='old_p_log')\n\n    # Computational graph for the policy in case of a continuous action space\n    if action_type == 'Discrete':\n        with tf.variable_scope('actor_nn'):\n            p_logits = mlp(obs_ph, hidden_sizes, act_dim, tf.nn.relu, last_activation=tf.tanh)\n\n        act_smp = tf.squeeze(tf.random.multinomial(p_logits, 1))\n        act_onehot = tf.one_hot(act_ph, depth=act_dim)\n        p_log = tf.reduce_sum(act_onehot * tf.nn.log_softmax(p_logits), axis=-1)\n        \n    # Computational graph for the policy in case of a continuous action space\n    else:\n        with tf.variable_scope('actor_nn'):\n            p_logits = mlp(obs_ph, hidden_sizes, act_dim, tf.tanh, last_activation=tf.tanh)\n            log_std = tf.get_variable(name='log_std', initializer=np.zeros(act_dim, dtype=np.float32)-0.5)\n        \n        # Add noise to the mean values predicted\n        # The noise is proportional to the standard deviation\n        p_noisy = p_logits + tf.random_normal(tf.shape(p_logits), 0, 1) * tf.exp(log_std)\n        # Clip the noisy actions\n        act_smp = tf.clip_by_value(p_noisy, low_action_space, high_action_space)\n        # Compute the gaussian log likelihood\n        p_log = gaussian_log_likelihood(act_ph, p_logits, log_std)\n\n    # Nerual nework value function approximizer\n    with tf.variable_scope('critic_nn'):\n        s_values = mlp(obs_ph, hidden_sizes, 1, tf.tanh, last_activation=None)\n        s_values = tf.squeeze(s_values)\n            \n    # PPO loss function\n    p_loss = clipped_surrogate_obj(p_log, old_p_log_ph, adv_ph, eps)\n    # MSE loss function\n    v_loss = tf.reduce_mean((ret_ph - s_values)**2)\n\n    # policy optimizer\n    p_opt = tf.train.AdamOptimizer(ac_lr).minimize(p_loss)\n    # value function optimizer\n    v_opt = tf.train.AdamOptimizer(cr_lr).minimize(v_loss)\n\n    # Time\n    now = datetime.now()\n    clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, now.second)\n    print('Time:', clock_time)\n\n    # Set scalars and hisograms for TensorBoard\n    tf.summary.scalar('p_loss', p_loss, collections=['train'])\n    tf.summary.scalar('v_loss', v_loss, collections=['train'])\n    tf.summary.scalar('s_values_m', tf.reduce_mean(s_values), collections=['train'])\n\n    if action_type == 'Box':\n        tf.summary.scalar('p_std', tf.reduce_mean(tf.exp(log_std)), collections=['train'])\n        tf.summary.histogram('log_std',log_std, collections=['train'])\n    tf.summary.histogram('p_log', p_log, collections=['train'])\n    tf.summary.histogram('p_logits', p_logits, collections=['train'])\n    tf.summary.histogram('s_values', s_values, collections=['train'])\n    tf.summary.histogram('adv_ph',adv_ph, collections=['train'])\n    scalar_summary = tf.summary.merge_all('train')\n\n    # .. summary to run before the optimization steps\n    tf.summary.scalar('old_v_loss', v_loss, collections=['pre_train'])\n    tf.summary.scalar('old_p_loss', p_loss, collections=['pre_train'])\n    pre_scalar_summary = tf.summary.merge_all('pre_train')\n\n    hyp_str = '-bs_'+str(minibatch_size)+'-envs_'+str(number_envs)+'-ac_lr_'+str(ac_lr)+'-cr_lr'+str(cr_lr)+'-act_it_'+str(actor_iter)+'-crit_it_'+str(critic_iter)\n    file_writer = tf.summary.FileWriter('log_dir/'+env_name+'/PPO_'+clock_time+'_'+hyp_str, tf.get_default_graph())\n    \n    # create a session\n    sess = tf.Session()\n    # initialize the variables\n    sess.run(tf.global_variables_initializer())\n    \n    # variable to store the total number of steps\n    step_count = 0\n    \n    print('Env batch size:',steps_per_env, ' Batch size:',steps_per_env*number_envs)\n\n    for ep in range(num_epochs):\n        # Create the buffer that will contain the trajectories (full or partial) \n        # run with the last policy\n        buffer = Buffer(gamma, lam)\n        # lists to store rewards and length of the trajectories completed\n        batch_rew = []\n        batch_len = []\n\n        # Execute in serial the environments, storing temporarily the trajectories. \n        for env in envs:\n            temp_buf = []\n\n            #iterate over a fixed number of steps\n            for _ in range(steps_per_env):\n\n                # run the policy\n                act, val = sess.run([act_smp, s_values], feed_dict={obs_ph:[env.n_obs]})\n                act = np.squeeze(act)\n\n                # take a step in the environment\n                obs2, rew, done, _ = env.step(act)\n                \n                # add the new transition to the temporary buffer\n                temp_buf.append([env.n_obs.copy(), rew, act, np.squeeze(val)])\n\n                env.n_obs = obs2.copy()\n                step_count += 1\n\n                if done:\n                    # Store the full trajectory in the buffer \n                    # (the value of the last state is 0 as the trajectory is completed)\n                    buffer.store(np.array(temp_buf), 0)\n\n                    # Empty temporary buffer\n                    temp_buf = []\n                    \n                    batch_rew.append(env.get_episode_reward())\n                    batch_len.append(env.get_episode_length())\n                    \n                    # reset the environment\n                    env.reset()                 \n\n            # Bootstrap with the estimated state value of the next state!\n            last_v = sess.run(s_values, feed_dict={obs_ph:[env.n_obs]})\n            buffer.store(np.array(temp_buf), np.squeeze(last_v))\n\n\n        # Gather the entire batch from the buffer\n        # NB: all the batch is used and deleted after the optimization. That is because PPO is on-policy\n        obs_batch, act_batch, adv_batch, rtg_batch = buffer.get_batch()\n\n        old_p_log = sess.run(p_log, feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})\n        old_p_batch = np.array(old_p_log)\n\n        summary = sess.run(pre_scalar_summary, feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_batch})\n        file_writer.add_summary(summary, step_count)\n\n        lb = len(buffer)\n        shuffled_batch = np.arange(lb)    \n        \n        # Policy optimization steps\n        for _ in range(actor_iter):\n            # shuffle the batch on every iteration\n            np.random.shuffle(shuffled_batch)\n            for idx in range(0,lb, minibatch_size):\n                minib = shuffled_batch[idx:min(idx+minibatch_size,lb)]\n                sess.run(p_opt, feed_dict={obs_ph:obs_batch[minib], act_ph:act_batch[minib], adv_ph:adv_batch[minib], old_p_log_ph:old_p_batch[minib]})\n\n        # Value function optimization steps\n        for _ in range(critic_iter):\n            # shuffle the batch on every iteration\n            np.random.shuffle(shuffled_batch)\n            for idx in range(0,lb, minibatch_size):\n                minib = shuffled_batch[idx:min(idx+minibatch_size,lb)]\n                sess.run(v_opt, feed_dict={obs_ph:obs_batch[minib], ret_ph:rtg_batch[minib]})\n                \n\n        # print some statistics and run the summary for visualizing it on TB\n        if len(batch_rew) > 0:           \n            train_summary = sess.run(scalar_summary, feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, \n                                                                old_p_log_ph:old_p_batch, ret_ph:rtg_batch})\n            file_writer.add_summary(train_summary, step_count)\n\n            summary = tf.Summary()\n            summary.value.add(tag='supplementary/performance', simple_value=np.mean(batch_rew))\n            summary.value.add(tag='supplementary/len', simple_value=np.mean(batch_len))\n            file_writer.add_summary(summary, step_count)\n            file_writer.flush()\n        \n            print('Ep:%d Rew:%.2f -- Step:%d' % (ep, np.mean(batch_rew), step_count))\n\n    # closing environments..\n    for env in envs:\n        env.close()\n\n    # Close the writer\n    file_writer.close()\n\n\nif __name__ == '__main__':\n    PPO('RoboschoolWalker2d-v1', hidden_sizes=[64,64], cr_lr=5e-4, ac_lr=2e-4, gamma=0.99, lam=0.95, steps_per_env=5000, \n        number_envs=1, eps=0.15, actor_iter=6, critic_iter=10, action_type='Box', num_epochs=5000, minibatch_size=256)\n      "
  },
  {
    "path": "Chapter07/TRPO.py",
    "content": "import numpy as np \nimport tensorflow as tf\nimport gym\nfrom datetime import datetime\nimport roboschool\n\ndef mlp(x, hidden_layers, output_layer, activation=tf.tanh, last_activation=None):\n    '''\n    Multi-layer perceptron\n    '''\n    for l in hidden_layers:\n        x = tf.layers.dense(x, units=l, activation=activation)\n    return tf.layers.dense(x, units=output_layer, activation=last_activation)\n\ndef softmax_entropy(logits):\n    '''\n    Softmax Entropy\n    '''\n    return -tf.reduce_sum(tf.nn.softmax(logits, axis=-1) * tf.nn.log_softmax(logits, axis=-1), axis=-1)\n\n\ndef gaussian_log_likelihood(ac, mean, log_std):\n    '''\n    Gaussian Log Likelihood \n    '''\n    log_p = ((ac-mean)**2 / (tf.exp(log_std)**2+1e-9) + 2*log_std) + np.log(2*np.pi)\n    return -0.5 * tf.reduce_sum(log_p, axis=-1)\n\n\ndef conjugate_gradient(A, b, x=None, iters=10):\n    '''\n    Conjugate gradient method: approximate the solution of Ax=b\n    It solve Ax=b without forming the full matrix, just compute the matrix-vector product (The Fisher-vector product)\n    \n    NB: A is not the full matrix but is a useful matrix-vector product between the averaged Fisher information matrix and arbitrary vectors \n    Descibed in Appendix C.1 of the TRPO paper\n    '''\n    if x is None:\n        x = np.zeros_like(b)\n        \n    r = A(x) - b\n    p = -r\n    for _ in range(iters):\n        a = np.dot(r, r) / (np.dot(p, A(p))+1e-8)\n        x += a*p\n        r_n = r + a*A(p)\n        b = np.dot(r_n, r_n) / (np.dot(r, r)+1e-8)\n        p = -r_n + b*p\n        r = r_n\n    return x\n\ndef gaussian_DKL(mu_q, log_std_q, mu_p, log_std_p):\n    '''\n    Gaussian KL divergence in case of a diagonal covariance matrix\n    '''\n    return tf.reduce_mean(tf.reduce_sum(0.5 * (log_std_p - log_std_q + tf.exp(log_std_q - log_std_p) + (mu_q - mu_p)**2 / tf.exp(log_std_p) - 1), axis=1))\n\n\ndef backtracking_line_search(Dkl, delta, old_loss, p=0.8):\n    '''\n    Backtracking line searc. It look for a coefficient s.t. the constraint on the DKL is satisfied\n    It has both to\n     - improve the non-linear objective\n     - satisfy the constraint\n\n    '''\n    ## Explained in Appendix C of the TRPO paper\n    a = 1\n    it = 0\n \n    new_dkl, new_loss = Dkl(a) \n    while (new_dkl > delta) or (new_loss > old_loss):\n        a *= p\n        it += 1\n        new_dkl, new_loss = Dkl(a)\n\n    return a\n\n\n\ndef GAE(rews, v, v_last, gamma=0.99, lam=0.95):\n    '''\n    Generalized Advantage Estimation\n    '''\n    assert len(rews) == len(v)\n    vs = np.append(v, v_last)\n    d = np.array(rews) + gamma*vs[1:] - vs[:-1]\n    gae_advantage = discounted_rewards(d, 0, gamma*lam)\n    return gae_advantage\n\ndef discounted_rewards(rews, last_sv, gamma):\n    '''\n    Discounted reward to go \n\n    Parameters:\n    ----------\n    rews: list of rewards\n    last_sv: value of the last state\n    gamma: discount value \n    '''\n    rtg = np.zeros_like(rews, dtype=np.float32)\n    rtg[-1] = rews[-1] + gamma*last_sv\n    for i in reversed(range(len(rews)-1)):\n        rtg[i] = rews[i] + gamma*rtg[i+1]\n    return rtg\n\nclass Buffer():\n    '''\n    Class to store the experience from a unique policy\n    '''\n    def __init__(self, gamma=0.99, lam=0.95):\n        self.gamma = gamma\n        self.lam = lam\n        self.adv = []\n        self.ob = []\n        self.ac = []\n        self.rtg = []\n\n    def store(self, temp_traj, last_sv):\n        '''\n        Add temp_traj values to the buffers and compute the advantage and reward to go\n\n        Parameters:\n        -----------\n        temp_traj: list where each element is a list that contains: observation, reward, action, state-value\n        last_sv: value of the last state (Used to Bootstrap)\n        '''\n        # store only if there are temporary trajectories\n        if len(temp_traj) > 0:\n            self.ob.extend(temp_traj[:,0])\n            rtg = discounted_rewards(temp_traj[:,1], last_sv, self.gamma)\n            self.adv.extend(GAE(temp_traj[:,1], temp_traj[:,3], last_sv, self.gamma, self.lam))\n            self.rtg.extend(rtg)\n            self.ac.extend(temp_traj[:,2])\n\n    def get_batch(self):\n        # standardize the advantage values\n        norm_adv = (self.adv - np.mean(self.adv)) / (np.std(self.adv) + 1e-10)\n        return np.array(self.ob), np.array(self.ac), np.array(norm_adv), np.array(self.rtg)\n\n    def __len__(self):\n        assert(len(self.adv) == len(self.ob) == len(self.ac) == len(self.rtg))\n        return len(self.ob)\n\ndef flatten_list(tensor_list):\n    '''\n    Flatten a list of tensors\n    '''\n    return tf.concat([flatten(t) for t in tensor_list], axis=0)\n\ndef flatten(tensor):\n    '''\n    Flatten a tensor\n    '''\n    return tf.reshape(tensor, shape=(-1,))\n\n\nclass StructEnv(gym.Wrapper):\n    '''\n    Gym Wrapper to store information like number of steps and total reward of the last espisode.\n    '''\n    def __init__(self, env):\n        gym.Wrapper.__init__(self, env)\n        self.n_obs = self.env.reset()\n        self.total_rew = 0\n        self.len_episode = 0\n\n    def reset(self, **kwargs):\n        self.n_obs = self.env.reset(**kwargs)\n        self.total_rew = 0\n        self.len_episode = 0\n        return self.n_obs.copy()\n        \n    def step(self, action):\n        ob, reward, done, info = self.env.step(action)\n        self.total_rew += reward\n        self.len_episode += 1\n        return ob, reward, done, info\n\n    def get_episode_reward(self):\n        return self.total_rew\n\n    def get_episode_length(self):\n        return self.len_episode\n\n\ndef TRPO(env_name, hidden_sizes=[32], cr_lr=5e-3, num_epochs=50, gamma=0.99, lam=0.95, number_envs=1, \n        critic_iter=10, steps_per_env=100, delta=0.002, algorithm='TRPO', conj_iters=10, minibatch_size=1000):\n    '''\n    Trust Region Policy Optimization\n\n    Parameters:\n    -----------\n    env_name: Name of the environment\n    hidden_sizes: list of the number of hidden units for each layer\n    cr_lr: critic learning rate\n    num_epochs: number of training epochs\n    gamma: discount factor\n    lam: lambda parameter for computing the GAE\n    number_envs: number of \"parallel\" synchronous environments\n        # NB: it isn't distributed across multiple CPUs\n    critic_iter: NUmber of SGD iterations on the critic per epoch\n    steps_per_env: number of steps per environment\n            # NB: the total number of steps per epoch will be: steps_per_env*number_envs\n    delta: Maximum KL divergence between two policies. Scalar value\n    algorithm: type of algorithm. Either 'TRPO' or 'NPO'\n    conj_iters: number of conjugate gradient iterations\n    minibatch_size: Batch size used to train the critic\n    '''\n\n    tf.reset_default_graph()\n\n    # Create a few environments to collect the trajectories\n    envs = [StructEnv(gym.make(env_name)) for _ in range(number_envs)]\n\n    low_action_space = envs[0].action_space.low\n    high_action_space = envs[0].action_space.high\n\n    obs_dim = envs[0].observation_space.shape\n    act_dim = envs[0].action_space.shape[0]\n\n    # Placeholders\n    act_ph = tf.placeholder(shape=(None,act_dim), dtype=tf.float32, name='act')\n    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')\n    ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')\n    adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='adv')\n    old_p_log_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='old_p_log')\n    old_mu_ph = tf.placeholder(shape=(None, act_dim), dtype=tf.float32, name='old_mu')\n    old_log_std_ph = tf.placeholder(shape=(act_dim), dtype=tf.float32, name='old_log_std')\n    p_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='p_ph')\n    # result of the conjugate gradient algorithm\n    cg_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='cg')\n        \n    # Neural network that represent the policy\n    with tf.variable_scope('actor_nn'):\n        p_means = mlp(obs_ph, hidden_sizes, act_dim, tf.tanh, last_activation=tf.tanh)\n        log_std = tf.get_variable(name='log_std', initializer=np.zeros(act_dim, dtype=np.float32) - 0.5)\n\n    # Neural network that represent the value function\n    with tf.variable_scope('critic_nn'):\n        s_values = mlp(obs_ph, hidden_sizes, 1, tf.tanh, last_activation=None)\n        s_values = tf.squeeze(s_values)    \n\n    # Add \"noise\" to the predicted mean following the Guassian distribution with standard deviation e^(log_std)\n    p_noisy = p_means + tf.random_normal(tf.shape(p_means), 0, 1) * tf.exp(log_std)\n    # Clip the noisy actions\n    a_sampl = tf.clip_by_value(p_noisy, low_action_space, high_action_space)\n    # Compute the gaussian log likelihood\n    p_log = gaussian_log_likelihood(act_ph, p_means, log_std)\n\n    # Measure the divergence\n    diverg = tf.reduce_mean(tf.exp(old_p_log_ph - p_log))\n    \n    # ratio\n    ratio_new_old = tf.exp(p_log - old_p_log_ph)\n    # TRPO surrogate loss function\n    p_loss = - tf.reduce_mean(ratio_new_old * adv_ph)\n\n    # MSE loss function\n    v_loss = tf.reduce_mean((ret_ph - s_values)**2)\n    # Critic optimization\n    v_opt = tf.train.AdamOptimizer(cr_lr).minimize(v_loss)\n\n    def variables_in_scope(scope):\n        # get all trainable variables in 'scope'\n        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)\n    \n    # Gather and flatten the actor parameters\n    p_variables = variables_in_scope('actor_nn')\n    p_var_flatten = flatten_list(p_variables)\n\n    # Gradient of the policy loss with respect to the actor parameters\n    p_grads = tf.gradients(p_loss, p_variables)\n    p_grads_flatten = flatten_list(p_grads)\n\n    ########### RESTORE ACTOR PARAMETERS ###########\n    p_old_variables = tf.placeholder(shape=(None,), dtype=tf.float32, name='p_old_variables')\n    # variable used as index for restoring the actor's parameters\n    it_v1 = tf.Variable(0, trainable=False)\n    restore_params = []\n\n    for p_v in p_variables:\n        upd_rsh = tf.reshape(p_old_variables[it_v1 : it_v1+tf.reduce_prod(p_v.shape)], shape=p_v.shape)\n        restore_params.append(p_v.assign(upd_rsh)) \n        it_v1 += tf.reduce_prod(p_v.shape)\n\n    restore_params = tf.group(*restore_params)\n\n    # gaussian KL divergence of the two policies \n    dkl_diverg = gaussian_DKL(old_mu_ph, old_log_std_ph, p_means, log_std) \n\n    # Jacobian of the KL divergence (Needed for the Fisher matrix-vector product)\n    dkl_diverg_grad = tf.gradients(dkl_diverg, p_variables) \n\n    dkl_matrix_product = tf.reduce_sum(flatten_list(dkl_diverg_grad) * p_ph)\n    print('dkl_matrix_product', dkl_matrix_product.shape)\n    # Fisher vector product\n    # The Fisher-vector product is a way to compute the A matrix without the need of the full A\n    Fx = flatten_list(tf.gradients(dkl_matrix_product, p_variables))\n\n    ## Step length\n    beta_ph = tf.placeholder(shape=(), dtype=tf.float32, name='beta')\n    # NPG update\n    npg_update = beta_ph * cg_ph\n    \n    ## alpha is found through line search\n    alpha = tf.Variable(1., trainable=False)\n    # TRPO update\n    trpo_update = alpha * npg_update\n\n    ####################   POLICY UPDATE  ###################\n    # variable used as an index\n    it_v = tf.Variable(0, trainable=False)\n    p_opt = []\n    # Apply the updates to the policy\n    for p_v in p_variables:\n        upd_rsh = tf.reshape(trpo_update[it_v : it_v+tf.reduce_prod(p_v.shape)], shape=p_v.shape)\n        p_opt.append(p_v.assign_sub(upd_rsh))\n        it_v += tf.reduce_prod(p_v.shape)\n\n    p_opt = tf.group(*p_opt)\n        \n    # Time\n    now = datetime.now()\n    clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, now.second)\n    print('Time:', clock_time)\n\n\n    # Set scalars and hisograms for TensorBoard\n    tf.summary.scalar('p_loss', p_loss, collections=['train'])\n    tf.summary.scalar('v_loss', v_loss, collections=['train'])\n    tf.summary.scalar('p_divergence', diverg, collections=['train'])\n    tf.summary.scalar('ratio_new_old',tf.reduce_mean(ratio_new_old), collections=['train'])\n    tf.summary.scalar('dkl_diverg', dkl_diverg, collections=['train'])\n    tf.summary.scalar('alpha', alpha, collections=['train'])\n    tf.summary.scalar('beta', beta_ph, collections=['train'])\n    tf.summary.scalar('p_std_mn', tf.reduce_mean(tf.exp(log_std)), collections=['train'])\n    tf.summary.scalar('s_values_mn', tf.reduce_mean(s_values), collections=['train'])\n    tf.summary.histogram('p_log', p_log, collections=['train'])\n    tf.summary.histogram('p_means', p_means, collections=['train'])\n    tf.summary.histogram('s_values', s_values, collections=['train'])\n    tf.summary.histogram('adv_ph',adv_ph, collections=['train'])\n    tf.summary.histogram('log_std',log_std, collections=['train'])\n    scalar_summary = tf.summary.merge_all('train')\n\n    tf.summary.scalar('old_v_loss', v_loss, collections=['pre_train'])\n    tf.summary.scalar('old_p_loss', p_loss, collections=['pre_train'])\n    pre_scalar_summary = tf.summary.merge_all('pre_train')\n\n    hyp_str = '-spe_'+str(steps_per_env)+'-envs_'+str(number_envs)+'-cr_lr'+str(cr_lr)+'-crit_it_'+str(critic_iter)+'-delta_'+str(delta)+'-conj_iters_'+str(conj_iters)\n    file_writer = tf.summary.FileWriter('log_dir/'+env_name+'/'+algorithm+'_'+clock_time+'_'+hyp_str, tf.get_default_graph())\n    \n    # create a session\n    sess = tf.Session()\n    # initialize the variables\n    sess.run(tf.global_variables_initializer())\n    \n    # variable to store the total number of steps\n    step_count = 0\n    \n    print('Env batch size:',steps_per_env, ' Batch size:',steps_per_env*number_envs)\n\n    for ep in range(num_epochs):\n        # Create the buffer that will contain the trajectories (full or partial) \n        # run with the last policy\n        buffer = Buffer(gamma, lam)\n        # lists to store rewards and length of the trajectories completed\n        batch_rew = []\n        batch_len = []\n\n        # Execute in serial the environment, storing temporarily the trajectories.\n        for env in envs:\n            temp_buf = []\n\n            # iterate over a fixed number of steps\n            for _ in range(steps_per_env):\n                # run the policy\n                act, val = sess.run([a_sampl, s_values], feed_dict={obs_ph:[env.n_obs]})\n                act = np.squeeze(act)\n\n                # take a step in the environment\n                obs2, rew, done, _ = env.step(act)\n\n                # add the new transition to the temporary buffer\n                temp_buf.append([env.n_obs.copy(), rew, act, np.squeeze(val)])\n\n                env.n_obs = obs2.copy()\n                step_count += 1\n\n                if done:\n                    # Store the full trajectory in the buffer \n                    # (the value of the last state is 0 as the trajectory is completed)\n                    buffer.store(np.array(temp_buf), 0)\n                    # Empty temporary buffer\n                    temp_buf = []\n\n                    batch_rew.append(env.get_episode_reward())\n                    batch_len.append(env.get_episode_length())\n\n                    env.reset()\n                    \n            # Bootstrap with the estimated state value of the next state!\n            lsv = sess.run(s_values, feed_dict={obs_ph:[env.n_obs]})\n            buffer.store(np.array(temp_buf), np.squeeze(lsv))\n\n\n        # Get the entire batch from the buffer\n        # NB: all the batch is used and deleted after the optimization. This is because PPO is on-policy\n        obs_batch, act_batch, adv_batch, rtg_batch = buffer.get_batch()\n\n        # log probabilities, logits and log std of the \"old\" policy\n        # \"old\" policy refer to the policy to optimize and that has been used to sample from the environment\n        old_p_log, old_p_means, old_log_std = sess.run([p_log, p_means, log_std], feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})\n        # get also the \"old\" parameters\n        old_actor_params = sess.run(p_var_flatten)\n\n        # old_p_loss is later used in the line search\n        # run pre_scalar_summary for a summary before the optimization\n        old_p_loss, summary = sess.run([p_loss,pre_scalar_summary], feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})\n        file_writer.add_summary(summary, step_count)\n\n        def H_f(p):\n            '''\n            Run the Fisher-Vector product on 'p' to approximate the Hessian of the DKL\n            '''\n            return sess.run(Fx, feed_dict={old_mu_ph:old_p_means, old_log_std_ph:old_log_std, p_ph:p, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})\n\n        g_f = sess.run(p_grads_flatten, feed_dict={old_mu_ph:old_p_means,obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})\n        ## Compute the Conjugate Gradient so to obtain an approximation of H^(-1)*g\n        # Where H in reality isn't the true Hessian of the KL divergence but an approximation of it computed via Fisher-Vector Product (F)\n        conj_grad = conjugate_gradient(H_f, g_f, iters=conj_iters)\n\n        # Compute the step length\n        beta_np = np.sqrt(2*delta / np.sum(conj_grad * H_f(conj_grad)))\n        \n        def DKL(alpha_v):\n            '''\n            Compute the KL divergence.\n            It optimize the function to compute the DKL. Afterwards it restore the old parameters.\n            '''\n            sess.run(p_opt, feed_dict={beta_ph:beta_np, alpha:alpha_v, cg_ph:conj_grad, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, old_p_log_ph:old_p_log})\n            a_res = sess.run([dkl_diverg, p_loss], feed_dict={old_mu_ph:old_p_means, old_log_std_ph:old_log_std, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})\n            sess.run(restore_params, feed_dict={p_old_variables: old_actor_params})\n            return a_res\n\n        # Actor optimization step\n        # Different for TRPO or NPG\n        if algorithm=='TRPO':\n            # Backtracing line search to find the maximum alpha coefficient s.t. the constraint is valid\n            best_alpha = backtracking_line_search(DKL, delta, old_p_loss, p=0.8)\n            sess.run(p_opt, feed_dict={beta_ph:beta_np, alpha:best_alpha, cg_ph:conj_grad, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, old_p_log_ph:old_p_log})\n        elif algorithm=='NPG':\n            # In case of NPG, no line search\n            sess.run(p_opt, feed_dict={beta_ph:beta_np, alpha:1, cg_ph:conj_grad, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, old_p_log_ph:old_p_log})\n\n\n        lb = len(buffer)\n        shuffled_batch = np.arange(lb)\n        np.random.shuffle(shuffled_batch)\n\n        # Value function optimization steps\n        for _ in range(critic_iter):\n            # shuffle the batch on every iteration\n            np.random.shuffle(shuffled_batch)\n            for idx in range(0,lb, minibatch_size):\n                minib = shuffled_batch[idx:min(idx+minibatch_size,lb)]\n                sess.run(v_opt, feed_dict={obs_ph:obs_batch[minib], ret_ph:rtg_batch[minib]})\n\n        # print some statistics and run the summary for visualizing it on TB\n        if len(batch_rew) > 0:\n            train_summary = sess.run(scalar_summary, feed_dict={beta_ph:beta_np, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, cg_ph:conj_grad,\n                                                                old_p_log_ph:old_p_log, ret_ph:rtg_batch, old_mu_ph:old_p_means, old_log_std_ph:old_log_std})\n            file_writer.add_summary(train_summary, step_count)\n            \n            summary = tf.Summary()\n            summary.value.add(tag='supplementary/performance', simple_value=np.mean(batch_rew))\n            summary.value.add(tag='supplementary/len', simple_value=np.mean(batch_len))\n            file_writer.add_summary(summary, step_count)\n            file_writer.flush()\n\n            print('Ep:%d Rew:%.2f -- Step:%d' % (ep, np.mean(batch_rew), step_count))\n\n    # closing environments..\n    for env in envs:\n        env.close()\n\n    file_writer.close()\n\nif __name__ == '__main__':\n    TRPO('RoboschoolWalker2d-v1', hidden_sizes=[64,64], cr_lr=2e-3, gamma=0.99, lam=0.95, num_epochs=1000, steps_per_env=6000, \n         number_envs=1, critic_iter=10, delta=0.01, algorithm='TRPO', conj_iters=10, minibatch_size=1000)\n"
  },
  {
    "path": "Chapter08/DDPG.py",
    "content": "import numpy as np \nimport tensorflow as tf\nimport gym\nfrom datetime import datetime\nfrom collections import deque\nimport time\n\ncurrent_milli_time = lambda: int(round(time.time() * 1000))\n\ndef mlp(x, hidden_layers, output_layer, activation=tf.nn.relu, last_activation=None):\n    '''\n    Multi-layer perceptron\n    '''\n    for l in hidden_layers:\n        x = tf.layers.dense(x, units=l, activation=activation)\n    return tf.layers.dense(x, units=output_layer, activation=last_activation)\n\ndef deterministic_actor_critic(x, a, hidden_sizes, act_dim, max_act):\n    '''\n    Deterministic Actor-Critic\n    '''\n    # Actor\n    with tf.variable_scope('p_mlp'):\n        p_means = max_act * mlp(x, hidden_sizes, act_dim, last_activation=tf.tanh) \n    \n    # Critic with as input the deterministic action of the actor\n    with tf.variable_scope('q_mlp'):\n        q_d = mlp(tf.concat([x,p_means], axis=-1), hidden_sizes, 1, last_activation=None) \n    \n    # Critic with as input an arbirtary action\n    with tf.variable_scope('q_mlp', reuse=True): # Use the weights of the mlp just defined\n        q_a = mlp(tf.concat([x,a], axis=-1), hidden_sizes, 1, last_activation=None)\n\n    return p_means, tf.squeeze(q_d), tf.squeeze(q_a)\n\nclass ExperiencedBuffer():\n    '''\n    Experienced buffer\n    '''\n    def __init__(self, buffer_size):\n        # Contains up to 'buffer_size' experience\n        self.obs_buf = deque(maxlen=buffer_size)\n        self.rew_buf = deque(maxlen=buffer_size)\n        self.act_buf = deque(maxlen=buffer_size)\n        self.obs2_buf = deque(maxlen=buffer_size)\n        self.done_buf = deque(maxlen=buffer_size)\n\n\n    def add(self, obs, rew, act, obs2, done):\n        '''\n        Add a new transition to the buffers\n        '''\n        self.obs_buf.append(obs)\n        self.rew_buf.append(rew)\n        self.act_buf.append(act)\n        self.obs2_buf.append(obs2)\n        self.done_buf.append(done)\n        \n\n    def sample_minibatch(self, batch_size):\n        '''\n        Sample a mini-batch of size 'batch_size'\n        '''\n        mb_indices = np.random.randint(len(self.obs_buf), size=batch_size)\n\n        mb_obs = [self.obs_buf[i] for i in mb_indices]\n        mb_rew = [self.rew_buf[i] for i in mb_indices]\n        mb_act = [self.act_buf[i] for i in mb_indices]\n        mb_obs2 = [self.obs2_buf[i] for i in mb_indices]\n        mb_done = [self.done_buf[i] for i in mb_indices]\n\n        return mb_obs, mb_rew, mb_act, mb_obs2, mb_done\n\n    def __len__(self):\n        return len(self.obs_buf)\n\ndef test_agent(env_test, agent_op, num_games=10):\n    '''\n    Test an agent 'agent_op', 'num_games' times\n    Return mean and std\n    '''\n    games_r = []\n    for _ in range(num_games):\n        d = False\n        game_r = 0\n        o = env_test.reset()\n\n        while not d:\n            a_s = agent_op(o)\n            o, r, d, _ = env_test.step(a_s)\n            game_r += r\n\n        games_r.append(game_r)\n    return np.mean(games_r), np.std(games_r)\n\n\n\ndef DDPG(env_name, hidden_sizes=[32], ac_lr=1e-2, cr_lr=1e-2, num_epochs=2000, buffer_size=5000, discount=0.99, render_cycle=100, mean_summaries_steps=1000, \n        batch_size=128, min_buffer_size=5000, tau=0.005):\n\n    # Create an environment for training\n    env = gym.make(env_name)\n    # Create an environment for testing the actor\n    env_test = gym.make(env_name)\n\n    tf.reset_default_graph()\n\n    obs_dim = env.observation_space.shape\n    act_dim = env.action_space.shape\n    print('-- Observation space:', obs_dim, ' Action space:', act_dim, '--')\n\n    # Create some placeholders\n    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')\n    act_ph = tf.placeholder(shape=(None, act_dim[0]), dtype=tf.float32, name='act')\n    y_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='y')\n\n    # Create an online deterministic actor-critic \n    with tf.variable_scope('online'):\n        p_onl, qd_onl, qa_onl = deterministic_actor_critic(obs_ph, act_ph, hidden_sizes, act_dim[0], np.max(env.action_space.high))\n    # and a target one\n    with tf.variable_scope('target'):\n        _, qd_tar, _ = deterministic_actor_critic(obs_ph, act_ph, hidden_sizes, act_dim[0], np.max(env.action_space.high))\n\n    def variables_in_scope(scope):\n        '''\n        Retrieve all the variables in the scope 'scope'\n        '''\n        return tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope)\n\n    # Copy all the online variables to the target networks i.e. target = online\n    # Needed only at the beginning\n    init_target = [target_var.assign(online_var) for target_var, online_var in zip(variables_in_scope('target'), variables_in_scope('online'))]\n    init_target_op = tf.group(*init_target)\n\n    # Soft update\n    update_target = [target_var.assign(tau*online_var + (1-tau)*target_var) for target_var, online_var in zip(variables_in_scope('target'), variables_in_scope('online'))]\n    update_target_op = tf.group(*update_target)\n\n    # Critic loss (MSE)\n    q_loss = tf.reduce_mean((qa_onl - y_ph)**2) \n    # Actor loss\n    p_loss = -tf.reduce_mean(qd_onl)\n\n    # Optimize the critic\n    q_opt = tf.train.AdamOptimizer(cr_lr).minimize(q_loss)\n    # Optimize the actor\n    p_opt = tf.train.AdamOptimizer(ac_lr).minimize(p_loss, var_list=variables_in_scope('online/p_mlp'))\n\n\n    def agent_op(o):\n        a = np.squeeze(sess.run(p_onl, feed_dict={obs_ph:[o]}))\n        return np.clip(a, env.action_space.low, env.action_space.high)\n\n    def agent_noisy_op(o, scale):\n        action = agent_op(o)\n        noisy_action = action + np.random.normal(loc=0.0, scale=scale, size=action.shape)\n        return np.clip(noisy_action, env.action_space.low, env.action_space.high)\n\n\n    # Time\n    now = datetime.now()\n    clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, int(now.second))\n    print('Time:', clock_time)\n\n\n    # Set TensorBoard\n    tf.summary.scalar('loss/q', q_loss)\n    tf.summary.scalar('loss/p', p_loss)\n    scalar_summary = tf.summary.merge_all()\n\n    hyp_str = '-aclr_'+str(ac_lr)+'-crlr_'+str(cr_lr)+'-tau_'+str(tau)\n\n    file_writer = tf.summary.FileWriter('log_dir/'+env_name+'/DDPG_'+clock_time+'_'+hyp_str, tf.get_default_graph())\n\n    # Create a session and initialize the variables\n    sess = tf.Session()\n    sess.run(tf.global_variables_initializer())\n    sess.run(init_target_op)\n    \n    # Some useful variables..\n    render_the_game = False\n    step_count = 0\n    last_q_update_loss = []\n    last_p_update_loss = []\n    ep_time = current_milli_time()\n    batch_rew = []\n\n    # Reset the environment\n    obs = env.reset()\n    # Initialize the buffer\n    buffer = ExperiencedBuffer(buffer_size)\n\n\n    for ep in range(num_epochs):\n        g_rew = 0\n        done = False\n\n        while not done:\n            # If not gathered enough experience yet, act randomly\n            if len(buffer) < min_buffer_size:\n                act = env.action_space.sample()\n            else:\n                act = agent_noisy_op(obs, 0.1)\n\n            # Take a step in the environment\n            obs2, rew, done, _ = env.step(act)\n\n            if render_the_game:\n                env.render()\n\n            # Add the transition in the buffer\n            buffer.add(obs.copy(), rew, act, obs2.copy(), done)\n\n            obs = obs2\n            g_rew += rew\n            step_count += 1\n\n            if len(buffer) > min_buffer_size:\n                # sample a mini batch from the buffer\n                mb_obs, mb_rew, mb_act, mb_obs2, mb_done = buffer.sample_minibatch(batch_size)\n\n                # Compute the target values\n                q_target_mb = sess.run(qd_tar, feed_dict={obs_ph:mb_obs2})\n                y_r = np.array(mb_rew) + discount*(1-np.array(mb_done))*q_target_mb\n\n                # optimize the critic\n                train_summary, _, q_train_loss = sess.run([scalar_summary, q_opt, q_loss], feed_dict={obs_ph:mb_obs, y_ph:y_r, act_ph: mb_act})\n                \n                # optimize the actor\n                _, p_train_loss = sess.run([p_opt, p_loss], feed_dict={obs_ph:mb_obs})\n\n                # summaries..\n                file_writer.add_summary(train_summary, step_count)\n                last_q_update_loss.append(q_train_loss)\n                last_p_update_loss.append(p_train_loss)\n\n                # Soft update of the target networks\n                sess.run(update_target_op)\n\n                # some 'mean' summaries to plot more smooth functions\n                if step_count % mean_summaries_steps == 0:\n                    summary = tf.Summary()\n                    summary.value.add(tag='loss/mean_q', simple_value=np.mean(last_q_update_loss))\n                    summary.value.add(tag='loss/mean_p', simple_value=np.mean(last_p_update_loss))\n                    file_writer.add_summary(summary, step_count)\n                    file_writer.flush()\n\n                    last_q_update_loss = []\n                    last_p_update_loss = []\n\n\n            if done:\n                obs = env.reset()\n                batch_rew.append(g_rew)\n                g_rew, render_the_game = 0, False\n\n        # Test the actor every 10 epochs\n        if ep % 10 == 0:\n            test_mn_rw, test_std_rw = test_agent(env_test, agent_op)\n\n            summary = tf.Summary()\n            summary.value.add(tag='test/reward', simple_value=test_mn_rw)\n            file_writer.add_summary(summary, step_count)\n            file_writer.flush()\n\n            ep_sec_time = int((current_milli_time()-ep_time) / 1000)\n            print('Ep:%4d Rew:%4.2f -- Step:%5d -- Test:%4.2f %4.2f -- Time:%d' %  (ep,np.mean(batch_rew), step_count, test_mn_rw, test_std_rw, ep_sec_time))\n\n            ep_time = current_milli_time()\n            batch_rew = []\n                \n        if ep % render_cycle == 0:\n            render_the_game = True\n\n    # close everything\n    file_writer.close()\n    env.close()\n    env_test.close()\n\n\nif __name__ == '__main__':\n    DDPG('BipedalWalker-v2', hidden_sizes=[64,64], ac_lr=3e-4, cr_lr=4e-4, buffer_size=200000, mean_summaries_steps=100, batch_size=64, \n        min_buffer_size=10000, tau=0.003)\n    \n"
  },
  {
    "path": "Chapter08/TD3.py",
    "content": "import numpy as np \nimport tensorflow as tf\nimport gym\nfrom datetime import datetime\nfrom collections import deque\nimport time\n\ncurrent_milli_time = lambda: int(round(time.time() * 1000))\n\ndef mlp(x, hidden_layers, output_layer, activation=tf.nn.relu, last_activation=None):\n    '''\n    Multi-layer perceptron\n    '''\n    for l in hidden_layers:\n        x = tf.layers.dense(x, units=l, activation=activation)\n    return tf.layers.dense(x, units=output_layer, activation=last_activation)\n\n# CHANGED FROM DDPG!\ndef deterministic_actor_double_critic(x, a, hidden_sizes, act_dim, max_act=1):\n    '''\n    Deterministic Actor-Critic\n    '''\n    # Actor\n    with tf.variable_scope('p_mlp'):\n        p_means = max_act * mlp(x, hidden_sizes, act_dim, last_activation=tf.tanh) \n    \n    # First critic\n    with tf.variable_scope('q1_mlp'):\n        q1_d = mlp(tf.concat([x,p_means], axis=-1), hidden_sizes, 1, last_activation=None) \n    \n    with tf.variable_scope('q1_mlp', reuse=True): # Use the weights of the mlp just defined\n        q1_a = mlp(tf.concat([x,a], axis=-1), hidden_sizes, 1, last_activation=None)\n\n    # Second critic\n    with tf.variable_scope('q2_mlp'):\n        q2_d = mlp(tf.concat([x,p_means], axis=-1), hidden_sizes, 1, last_activation=None)\n    with tf.variable_scope('q2_mlp', reuse=True):\n        q2_a = mlp(tf.concat([x,a], axis=-1), hidden_sizes, 1, last_activation=None)\n\n    return p_means, tf.squeeze(q1_d), tf.squeeze(q1_a), tf.squeeze(q2_d), tf.squeeze(q2_a)\n\nclass ExperiencedBuffer():\n    '''\n    Experienced buffer\n    '''\n    def __init__(self, buffer_size):\n        # Contains up to 'buffer_size' experience\n        self.obs_buf = deque(maxlen=buffer_size)\n        self.rew_buf = deque(maxlen=buffer_size)\n        self.act_buf = deque(maxlen=buffer_size)\n        self.obs2_buf = deque(maxlen=buffer_size)\n        self.done_buf = deque(maxlen=buffer_size)\n\n\n    def add(self, obs, rew, act, obs2, done):\n        '''\n        Add a new transition to the buffers\n        '''\n        self.obs_buf.append(obs)\n        self.rew_buf.append(rew)\n        self.act_buf.append(act)\n        self.obs2_buf.append(obs2)\n        self.done_buf.append(done)\n        \n\n    def sample_minibatch(self, batch_size):\n        '''\n        Sample a mini-batch of size 'batch_size'\n        '''\n        mb_indices = np.random.randint(len(self.obs_buf), size=batch_size)\n\n        mb_obs = [self.obs_buf[i] for i in mb_indices]\n        mb_rew = [self.rew_buf[i] for i in mb_indices]\n        mb_act = [self.act_buf[i] for i in mb_indices]\n        mb_obs2 = [self.obs2_buf[i] for i in mb_indices]\n        mb_done = [self.done_buf[i] for i in mb_indices]\n\n        return mb_obs, mb_rew, mb_act, mb_obs2, mb_done\n\n    def __len__(self):\n        return len(self.obs_buf)\n\ndef test_agent(env_test, agent_op, num_games=10):\n    '''\n    Test an agent 'agent_op', 'num_games' times\n    Return mean and std\n    '''\n    games_r = []\n\n    for _ in range(num_games):\n        d = False\n        game_r = 0\n        o = env_test.reset()\n\n        while not d:\n            a_s = agent_op(o)\n            o, r, d, _ = env_test.step(a_s)\n\n            game_r += r\n\n        games_r.append(game_r)\n\n    return np.mean(games_r), np.std(games_r)\n\n\n\ndef TD3(env_name, hidden_sizes=[32], ac_lr=1e-2, cr_lr=1e-2, num_epochs=2000, buffer_size=5000, discount=0.99, render_cycle=10000, mean_summaries_steps=1000, \n        batch_size=128, min_buffer_size=5000, tau=0.005, target_noise=0.2, expl_noise=0.1, policy_update_freq=2):\n\n    # Create an environment for training\n    env = gym.make(env_name)\n    # Create an environment for testing the actor\n    env_test = gym.make(env_name)\n\n    tf.reset_default_graph()\n\n    obs_dim = env.observation_space.shape\n    act_dim = env.action_space.shape\n    print('-- Observation space:', obs_dim, ' Action space:', act_dim, '--')\n\n    # Create some placeholders\n    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')\n    act_ph = tf.placeholder(shape=(None, act_dim[0]), dtype=tf.float32, name='act')\n    y_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='y')\n\n    # Create an online deterministic actor and a double critic \n    with tf.variable_scope('online'):\n        p_onl, qd1_onl, qa1_onl, _, qa2_onl = deterministic_actor_double_critic(obs_ph, act_ph, hidden_sizes, act_dim[0], np.max(env.action_space.high))\n\n    # and a target actor and double critic\n    with tf.variable_scope('target'):\n        p_tar, _, qa1_tar, _, qa2_tar = deterministic_actor_double_critic(obs_ph, act_ph, hidden_sizes, act_dim[0], np.max(env.action_space.high))\n\n    def variables_in_scope(scope):\n        '''\n        Retrieve all the variables in the scope 'scope'\n        '''\n        return tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope)\n\n    # Copy all the online variables to the target networks i.e. target = online\n    # Needed only at the beginning\n    init_target = [target_var.assign(online_var) for target_var, online_var in zip(variables_in_scope('target'), variables_in_scope('online'))]\n    init_target_op = tf.group(*init_target)\n\n    # Soft update\n    update_target = [target_var.assign(tau*online_var + (1-tau)*target_var) for target_var, online_var in zip(variables_in_scope('target'), variables_in_scope('online'))]\n    update_target_op = tf.group(*update_target)\n\n    # Critics loss (MSE)\n    q1_loss = tf.reduce_mean((qa1_onl - y_ph)**2) \n    q2_loss = tf.reduce_mean((qa2_onl - y_ph)**2)\n\n    # Actor loss\n    p_loss = -tf.reduce_mean(qd1_onl)\n    \n    # Optimize the critics\n    q1_opt = tf.train.AdamOptimizer(cr_lr).minimize(q1_loss)\n    q2_opt = tf.train.AdamOptimizer(cr_lr).minimize(q2_loss)\n\n    # Optimize the actor\n    p_opt = tf.train.AdamOptimizer(ac_lr).minimize(p_loss, var_list=variables_in_scope('online/p_mlp'))\n\n\n    def add_normal_noise(x, scale, low_lim=-0.5, high_lim=0.5):\n        return x + np.clip(np.random.normal(loc=0.0, scale=scale, size=x.shape), low_lim, high_lim)\n\n    def agent_op(o):\n        ac = np.squeeze(sess.run(p_onl, feed_dict={obs_ph:[o]}))\n        return np.clip(ac, env.action_space.low, env.action_space.high)\n\n    def agent_noisy_op(o, scale):\n        ac = agent_op(o)\n        return np.clip(add_normal_noise(ac, scale, env.action_space.low, env.action_space.high), env.action_space.low, env.action_space.high)\n\n\n    # Time\n    now = datetime.now()\n    clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, int(now.second))\n    print('Time:', clock_time)\n\n    # Set TensorBoard\n    tf.summary.scalar('loss/q1', q1_loss)\n    tf.summary.scalar('loss/q2', q2_loss)\n    tf.summary.scalar('loss/p', p_loss)\n    scalar_summary = tf.summary.merge_all()\n\n    hyp_str = '-aclr_'+str(ac_lr)+'-crlr_'+str(cr_lr)+'-tau_'+str(tau)\n\n    file_writer = tf.summary.FileWriter('log_dir/'+env_name+'/TD3_'+clock_time+'_'+hyp_str, tf.get_default_graph())\n\n    # Create a session and initialize the variables\n    sess = tf.Session()\n    sess.run(tf.global_variables_initializer())\n    sess.run(init_target_op)\n    \n    # Some useful variables..\n    render_the_game = False\n    step_count = 0\n    last_q1_update_loss = []\n    last_q2_update_loss = []\n    last_p_update_loss = []\n    ep_time = current_milli_time()\n    batch_rew = []\n\n    # Reset the environment\n    obs = env.reset()\n    # Initialize the buffer\n    buffer = ExperiencedBuffer(buffer_size)\n\n\n    for ep in range(num_epochs):\n        g_rew = 0\n        done = False\n\n        while not done:\n            # If not gathered enough experience yet, act randomly\n            if len(buffer) < min_buffer_size:\n                act = env.action_space.sample()\n            else:\n                act = agent_noisy_op(obs, expl_noise)\n\n            # Take a step in the environment\n            obs2, rew, done, _ = env.step(act)\n\n            if render_the_game:\n                env.render()\n\n            # Add the transition in the buffer\n            buffer.add(obs.copy(), rew, act, obs2.copy(), done)\n\n            obs = obs2\n            g_rew += rew\n            step_count += 1\n\n            if len(buffer) > min_buffer_size:\n                # sample a mini batch from the buffer\n                mb_obs, mb_rew, mb_act, mb_obs2, mb_done = buffer.sample_minibatch(batch_size)\n\n\n                double_actions = sess.run(p_tar, feed_dict={obs_ph:mb_obs2})\n                # Target regularization\n                double_noisy_actions = np.clip(add_normal_noise(double_actions, target_noise), env.action_space.low, env.action_space.high)\n\n                # Clipped Double Q-learning\n                q1_target_mb, q2_target_mb = sess.run([qa1_tar,qa2_tar], feed_dict={obs_ph:mb_obs2, act_ph:double_noisy_actions})\n                q_target_mb = np.min([q1_target_mb, q2_target_mb], axis=0) \n                assert(len(q1_target_mb) == len(q_target_mb))\n\n                # Compute the target values\n                y_r = np.array(mb_rew) + discount*(1-np.array(mb_done))*q_target_mb\n\n                # Optimize the critics\n                train_summary, _, q1_train_loss, _, q2_train_loss = sess.run([scalar_summary, q1_opt, q1_loss, q2_opt, q2_loss], feed_dict={obs_ph:mb_obs, y_ph:y_r, act_ph: mb_act})\n\n                # Delayed policy update\n                if step_count % policy_update_freq == 0:\n                    # Optimize the policy\n                    _, p_train_loss = sess.run([p_opt, p_loss], feed_dict={obs_ph:mb_obs})\n\n                    # Soft update of the target networks\n                    sess.run(update_target_op)\n\n                    file_writer.add_summary(train_summary, step_count)\n                    last_q1_update_loss.append(q1_train_loss)\n                    last_q2_update_loss.append(q2_train_loss)\n                    last_p_update_loss.append(p_train_loss)\n\n                \n                # some 'mean' summaries to plot more smooth functions\n                if step_count % mean_summaries_steps == 0:\n                    summary = tf.Summary()\n                    summary.value.add(tag='loss/mean_q1', simple_value=np.mean(last_q1_update_loss))\n                    summary.value.add(tag='loss/mean_q2', simple_value=np.mean(last_q2_update_loss))\n                    summary.value.add(tag='loss/mean_p', simple_value=np.mean(last_p_update_loss))\n                    file_writer.add_summary(summary, step_count)\n                    file_writer.flush()\n\n                    last_q1_update_loss = []\n                    last_q2_update_loss = []\n                    last_p_update_loss = []\n\n\n            if done:\n                obs = env.reset()\n                batch_rew.append(g_rew)\n                g_rew, render_the_game = 0, False\n\n        # Test the actor every 10 epochs\n        if ep % 10 == 0:\n            test_mn_rw, test_std_rw = test_agent(env_test, agent_op)\n            summary = tf.Summary()\n            summary.value.add(tag='test/reward', simple_value=test_mn_rw)\n            file_writer.add_summary(summary, step_count)\n            file_writer.flush()\n\n            ep_sec_time = int((current_milli_time()-ep_time) / 1000)\n            print('Ep:%4d Rew:%4.2f -- Step:%5d -- Test:%4.2f %4.2f -- Time:%d' %  (ep,np.mean(batch_rew), step_count, test_mn_rw, test_std_rw, ep_sec_time))\n\n            ep_time = current_milli_time()\n            batch_rew = []\n                \n        if ep % render_cycle == 0:\n            render_the_game = True\n\n    # close everything\n    file_writer.close()\n    env.close()\n    env_test.close()\n\n\nif __name__ == '__main__':\n    TD3('BipedalWalker-v2', hidden_sizes=[64,64], ac_lr=4e-4, cr_lr=4e-4, buffer_size=200000, mean_summaries_steps=100, batch_size=64, \n        min_buffer_size=10000, tau=0.005, policy_update_freq=2, target_noise=0.1)"
  },
  {
    "path": "Chapter09/ME-TRPO.py",
    "content": "import numpy as np \nimport tensorflow as tf\nimport gym\nfrom datetime import datetime\nimport roboschool\n\n\ndef mlp(x, hidden_layers, output_layer, activation=tf.tanh, last_activation=None):\n    '''\n    Multi-layer perceptron\n    '''\n    for l in hidden_layers:\n        x = tf.layers.dense(x, units=l, activation=activation)\n    return tf.layers.dense(x, units=output_layer, activation=last_activation)\n\ndef softmax_entropy(logits):\n    '''\n    Softmax Entropy\n    '''\n    return -tf.reduce_sum(tf.nn.softmax(logits, axis=-1) * tf.nn.log_softmax(logits, axis=-1), axis=-1)\n\ndef gaussian_log_likelihood(ac, mean, log_std):\n    '''\n    Gaussian Log Likelihood \n    '''\n    log_p = ((ac-mean)**2 / (tf.exp(log_std)**2+1e-9) + 2*log_std) + np.log(2*np.pi)\n    return -0.5 * tf.reduce_sum(log_p, axis=-1)\n\ndef conjugate_gradient(A, b, x=None, iters=10):\n    '''\n    Conjugate gradient method: approximate the solution of Ax=b\n    It solve Ax=b without forming the full matrix, just compute the matrix-vector product (The Fisher-vector product)\n    \n    NB: A is not the full matrix but is a useful matrix-vector product between the averaged Fisher information matrix and arbitrary vectors \n    Descibed in Appendix C.1 of the TRPO paper\n    '''\n    if x is None:\n        x = np.zeros_like(b)\n        \n    r = A(x) - b\n    p = -r\n    for _ in range(iters):\n        a = np.dot(r, r) / (np.dot(p, A(p))+1e-8)\n        x += a*p\n        r_n = r + a*A(p)\n        b = np.dot(r_n, r_n) / (np.dot(r, r)+1e-8)\n        p = -r_n + b*p\n        r = r_n\n    return x\n\ndef gaussian_DKL(mu_q, log_std_q, mu_p, log_std_p):\n    '''\n    Gaussian KL divergence in case of a diagonal covariance matrix\n    '''\n    return tf.reduce_mean(tf.reduce_sum(0.5 * (log_std_p - log_std_q + tf.exp(log_std_q - log_std_p) + (mu_q - mu_p)**2 / tf.exp(log_std_p) - 1), axis=1))\n\ndef backtracking_line_search(Dkl, delta, old_loss, p=0.8):\n    '''\n    Backtracking line searc. It look for a coefficient s.t. the constraint on the DKL is satisfied\n    It has both to\n     - improve the non-linear objective\n     - satisfy the constraint\n\n    '''\n    ## Explained in Appendix C of the TRPO paper\n    a = 1\n    it = 0\n    \n    new_dkl, new_loss = Dkl(a) \n    while (new_dkl > delta) or (new_loss > old_loss):\n        a *= p\n        it += 1\n        new_dkl, new_loss = Dkl(a)\n\n    return a\n\ndef GAE(rews, v, v_last, gamma=0.99, lam=0.95):\n    '''\n    Generalized Advantage Estimation\n    '''\n    assert len(rews) == len(v)\n    vs = np.append(v, v_last)\n    d = np.array(rews) + gamma*vs[1:] - vs[:-1]\n    gae_advantage = discounted_rewards(d, 0, gamma*lam)\n    return gae_advantage\n\ndef discounted_rewards(rews, last_sv, gamma):\n    '''\n    Discounted reward to go \n\n    Parameters:\n    ----------\n    rews: list of rewards\n    last_sv: value of the last state\n    gamma: discount value \n    '''\n    rtg = np.zeros_like(rews, dtype=np.float32)\n    rtg[-1] = rews[-1] + gamma*last_sv\n    for i in reversed(range(len(rews)-1)):\n        rtg[i] = rews[i] + gamma*rtg[i+1]\n    return rtg\n\ndef flatten_list(tensor_list):\n    '''\n    Flatten a list of tensors\n    '''\n    return tf.concat([flatten(t) for t in tensor_list], axis=0)\n\ndef flatten(tensor):\n    '''\n    Flatten a tensor\n    '''\n    return tf.reshape(tensor, shape=(-1,))\n\n  \ndef test_agent(env_test, agent_op, num_games=10):\n    '''\n    Test an agent 'agent_op', 'num_games' times\n    Return mean and std\n    '''\n    games_r = []\n    for _ in range(num_games):\n        d = False\n        game_r = 0\n        o = env_test.reset()\n\n        while not d:\n            a_s, _ = agent_op([o])\n            o, r, d, _ = env_test.step(a_s)\n            game_r += r\n\n        games_r.append(game_r)\n    return np.mean(games_r), np.std(games_r)\n\nclass Buffer():\n    '''\n    Class to store the experience from a unique policy\n    '''\n    def __init__(self, gamma=0.99, lam=0.95):\n        self.gamma = gamma\n        self.lam = lam\n        self.adv = []\n        self.ob = []\n        self.ac = []\n        self.rtg = []\n\n    def store(self, temp_traj, last_sv):\n        '''\n        Add temp_traj values to the buffers and compute the advantage and reward to go\n\n        Parameters:\n        -----------\n        temp_traj: list where each element is a list that contains: observation, reward, action, state-value\n        last_sv: value of the last state (Used to Bootstrap)\n        '''\n        # store only if there are temporary trajectories\n        if len(temp_traj) > 0:\n            self.ob.extend(temp_traj[:,0])\n            rtg = discounted_rewards(temp_traj[:,1], last_sv, self.gamma)\n            self.adv.extend(GAE(temp_traj[:,1], temp_traj[:,3], last_sv, self.gamma, self.lam))\n            self.rtg.extend(rtg)\n            self.ac.extend(temp_traj[:,2])\n\n    def get_batch(self):\n        # standardize the advantage values\n        norm_adv = (self.adv - np.mean(self.adv)) / (np.std(self.adv) + 1e-10)\n        return np.array(self.ob), np.array(np.expand_dims(self.ac,-1)), np.array(norm_adv), np.array(self.rtg)\n\n    def __len__(self):\n        assert(len(self.adv) == len(self.ob) == len(self.ac) == len(self.rtg))\n        return len(self.ob)\n\n\nclass FullBuffer():\n    def __init__(self):\n        self.rew = []\n        self.obs = []\n        self.act = []\n        self.nxt_obs = []\n        self.done = []\n        \n        self.train_idx = []\n        self.valid_idx = []\n        self.idx = 0\n\n        \n    def store(self, obs, act, rew, nxt_obs, done):\n        self.rew.append(rew)\n        self.obs.append(obs)\n        self.act.append(act)\n        self.nxt_obs.append(nxt_obs)\n        self.done.append(done)\n          \n        self.idx += 1\n\n    def generate_random_dataset(self):\n        rnd = np.arange(len(self.obs))\n        np.random.shuffle(rnd)\n        self.valid_idx = rnd[ : int(len(self.obs)/3)]\n        self.train_idx = rnd[int(len(self.obs)/3) : ]\n        print('Train set:', len(self.train_idx), 'Valid set:', len(self.valid_idx))  \n      \n    def get_training_batch(self):\n        return np.array(self.obs)[self.train_idx], np.array(np.expand_dims(self.act,-1))[self.train_idx], np.array(self.rew)[self.train_idx], np.array(self.nxt_obs)[self.train_idx], np.array(self.done)[self.train_idx]\n      \n      \n    def get_valid_batch(self):\n        return np.array(self.obs)[self.valid_idx], np.array(np.expand_dims(self.act,-1))[self.valid_idx], np.array(self.rew)[self.valid_idx], np.array(self.nxt_obs)[self.valid_idx], np.array(self.done)[self.valid_idx]\n      \n    def __len__(self):\n        assert(len(self.rew) == len(self.obs) == len(self.act) == len(self.nxt_obs) == len(self.done))\n        return len(self.obs)\n\n      \n      \ndef simulate_environment(env, policy, simulated_steps):\n\n    buffer = Buffer(0.99, 0.95)\n    # lists to store rewards and length of the trajectories completed\n    steps = 0\n    number_episodes = 0\n\n    while steps < simulated_steps:\n        temp_buf = []\n        obs = env.reset()\n        number_episodes += 1\n        done = False\n\n        while not done:\n            act, val = policy([obs])\n\n            obs2, rew, done, _ = env.step([act])\n          \n            temp_buf.append([obs.copy(), rew, np.squeeze(act), np.squeeze(val)])\n\n            obs = obs2.copy()\n            steps += 1\n                \n            if done:\n                buffer.store(np.array(temp_buf), 0)\n                temp_buf = []\n\n            if steps == simulated_steps:\n                break\n\n        buffer.store(np.array(temp_buf), np.squeeze(policy([obs])[1]))\n        \n    print('Sim ep:',number_episodes, end=' ')\n    \n    return buffer.get_batch()\n\n\nclass NetworkEnv(gym.Wrapper):\n    def __init__(self, env, model_func, reward_func, done_func, number_models):\n        gym.Wrapper.__init__(self, env)\n        self.model_func = model_func\n        self.reward_func = reward_func\n        self.done_func = done_func\n        self.number_models = number_models\n        self.len_episode = 0\n\n    def reset(self, **kwargs):\n        self.len_episode = 0\n        self.obs = self.env.reset(**kwargs)\n          \n        return self.obs\n    \n    def step(self, action):\n        # predict the next state on a random model\n        obs = self.model_func(self.obs, [np.squeeze(action)], np.random.randint(0,self.number_models))\n        rew = self.reward_func(self.obs, [np.squeeze(action)])\n        done = self.done_func(obs)\n        \n        self.len_episode += 1\n\n        if self.len_episode >= 990:\n          done = True\n        \n        self.obs = obs\n        \n        return self.obs, rew, done, \"\"\n\nclass StructEnv(gym.Wrapper):\n    '''\n    Gym Wrapper to store information like number of steps and total reward of the last espisode.\n    '''\n    def __init__(self, env):\n        gym.Wrapper.__init__(self, env)\n        self.n_obs = self.env.reset()\n        self.total_rew = 0\n        self.len_episode = 0\n\n    def reset(self, **kwargs):\n        self.n_obs = self.env.reset(**kwargs)\n        self.total_rew = 0\n        self.len_episode = 0\n        return self.n_obs.copy()\n        \n    def step(self, action):\n        ob, reward, done, info = self.env.step(action)\n        self.total_rew += reward\n        self.len_episode += 1\n        return ob, reward, done, info\n\n    def get_episode_reward(self):\n        return self.total_rew\n\n    def get_episode_length(self):\n        return self.len_episode\n\ndef pendulum_done(ob):\n  return np.abs(np.arcsin(np.squeeze(ob[3]))) > .2\n\ndef pendulum_reward(ob, ac):\n  return 1\n\n\ndef restore_model(old_model_variables, m_variables):    \n    # variable used as index for restoring the actor's parameters\n    it_v2 = tf.Variable(0, trainable=False)\n    restore_m_params = []\n    \n    for m_v in m_variables:\n        upd_m_rsh = tf.reshape(old_model_variables[it_v2 : it_v2+tf.reduce_prod(m_v.shape)], shape=m_v.shape)\n        restore_m_params.append(m_v.assign(upd_m_rsh)) \n        it_v2 += tf.reduce_prod(m_v.shape)\n        \n    return tf.group(*restore_m_params)\n      \n      \ndef METRPO(env_name, hidden_sizes=[32], cr_lr=5e-3, num_epochs=50, gamma=0.99, lam=0.95, number_envs=1, \n        critic_iter=10, steps_per_env=100, delta=0.002, algorithm='TRPO', conj_iters=10, minibatch_size=1000,\n          mb_lr=0.0001, model_batch_size=512, simulated_steps=300, num_ensemble_models=2, model_iter=15):\n    '''\n    Model Ensemble Trust Region Policy Optimization\n\n    Parameters:\n    -----------\n    env_name: Name of the environment\n    hidden_sizes: list of the number of hidden units for each layer\n    cr_lr: critic learning rate\n    num_epochs: number of training epochs\n    gamma: discount factor\n    lam: lambda parameter for computing the GAE\n    number_envs: number of \"parallel\" synchronous environments\n        # NB: it isn't distributed across multiple CPUs\n    critic_iter: NUmber of SGD iterations on the critic per epoch\n    steps_per_env: number of steps per environment\n            # NB: the total number of steps per epoch will be: steps_per_env*number_envs\n    delta: Maximum KL divergence between two policies. Scalar value\n    algorithm: type of algorithm. Either 'TRPO' or 'NPO'\n    conj_iters: number of conjugate gradient iterations\n    minibatch_size: Batch size used to train the critic\n    mb_lr: learning rate of the environment model\n    model_batch_size: batch size of the environment model\n    simulated_steps: number of simulated steps for each policy update\n    num_ensemble_models: number of models\n    model_iter: number of iterations without improvement before stopping training the model\n    '''\n    # TODO: add ME-TRPO hyperparameters\n\n    tf.reset_default_graph()\n\n    # Create a few environments to collect the trajectories\n    envs = [StructEnv(gym.make(env_name)) for _ in range(number_envs)]\n    env_test = gym.make(env_name)\n    #env_test = gym.wrappers.Monitor(env_test, \"VIDEOS/\", force=True, video_callable=lambda x: x%10 == 0)\n\n    low_action_space = envs[0].action_space.low\n    high_action_space = envs[0].action_space.high\n\n    obs_dim = envs[0].observation_space.shape\n    act_dim = envs[0].action_space.shape[0]\n    \n    print(envs[0].action_space, envs[0].observation_space)\n\n    # Placeholders\n    act_ph = tf.placeholder(shape=(None,act_dim), dtype=tf.float32, name='act')\n    obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')\n    # NEW\n    nobs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='nobs')\n    ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')\n    adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='adv')\n    old_p_log_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='old_p_log')\n    old_mu_ph = tf.placeholder(shape=(None, act_dim), dtype=tf.float32, name='old_mu')\n    old_log_std_ph = tf.placeholder(shape=(act_dim), dtype=tf.float32, name='old_log_std')\n    p_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='p_ph')\n\n    # result of the conjugate gradient algorithm\n    cg_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='cg')\n    \n    #########################################################\n    ######################## POLICY #########################\n    #########################################################\n\n    old_model_variables = tf.placeholder(shape=(None,), dtype=tf.float32, name='old_model_variables')\n        \n    # Neural network that represent the policy\n    with tf.variable_scope('actor_nn'):\n        p_means = mlp(obs_ph, hidden_sizes, act_dim, tf.tanh, last_activation=tf.tanh)\n        p_means = tf.clip_by_value(p_means, low_action_space, high_action_space)\n        log_std = tf.get_variable(name='log_std', initializer=np.ones(act_dim, dtype=np.float32))\n\n    # Neural network that represent the value function\n    with tf.variable_scope('critic_nn'):\n        s_values = mlp(obs_ph, hidden_sizes, 1, tf.tanh, last_activation=None)\n        s_values = tf.squeeze(s_values)    \n\n\n    # Add \"noise\" to the predicted mean following the Gaussian distribution with standard deviation e^(log_std)\n    p_noisy = p_means + tf.random_normal(tf.shape(p_means), 0, 1) * tf.exp(log_std)\n    # Clip the noisy actions\n    a_sampl = tf.clip_by_value(p_noisy, low_action_space, high_action_space)\n    # Compute the gaussian log likelihood\n    p_log = gaussian_log_likelihood(act_ph, p_means, log_std)\n\n    # Measure the divergence\n    diverg = tf.reduce_mean(tf.exp(old_p_log_ph - p_log))\n    \n    # ratio\n    ratio_new_old = tf.exp(p_log - old_p_log_ph)\n    # TRPO surrogate loss function\n    p_loss = - tf.reduce_mean(ratio_new_old * adv_ph)\n\n    # MSE loss function\n    v_loss = tf.reduce_mean((ret_ph - s_values)**2)\n    # Critic optimization\n    v_opt = tf.train.AdamOptimizer(cr_lr).minimize(v_loss)\n\n    def variables_in_scope(scope):\n        # get all trainable variables in 'scope'\n        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)\n    \n    # Gather and flatten the actor parameters\n    p_variables = variables_in_scope('actor_nn')\n    p_var_flatten = flatten_list(p_variables)\n\n    # Gradient of the policy loss with respect to the actor parameters\n    p_grads = tf.gradients(p_loss, p_variables)\n    p_grads_flatten = flatten_list(p_grads)\n\n    ########### RESTORE ACTOR PARAMETERS ###########\n    p_old_variables = tf.placeholder(shape=(None,), dtype=tf.float32, name='p_old_variables')\n    # variable used as index for restoring the actor's parameters\n    it_v1 = tf.Variable(0, trainable=False)\n    restore_params = []\n\n    for p_v in p_variables:\n        upd_rsh = tf.reshape(p_old_variables[it_v1 : it_v1+tf.reduce_prod(p_v.shape)], shape=p_v.shape)\n        restore_params.append(p_v.assign(upd_rsh)) \n        it_v1 += tf.reduce_prod(p_v.shape)\n\n    restore_params = tf.group(*restore_params)\n\n    # gaussian KL divergence of the two policies \n    dkl_diverg = gaussian_DKL(old_mu_ph, old_log_std_ph, p_means, log_std) \n\n    # Jacobian of the KL divergence (Needed for the Fisher matrix-vector product)\n    dkl_diverg_grad = tf.gradients(dkl_diverg, p_variables) \n\n    dkl_matrix_product = tf.reduce_sum(flatten_list(dkl_diverg_grad) * p_ph)\n    print('dkl_matrix_product', dkl_matrix_product.shape)\n    # Fisher vector product\n    # The Fisher-vector product is a way to compute the A matrix without the need of the full A\n    Fx = flatten_list(tf.gradients(dkl_matrix_product, p_variables))\n\n    ## Step length\n    beta_ph = tf.placeholder(shape=(), dtype=tf.float32, name='beta')\n    # NPG update\n    npg_update = beta_ph * cg_ph\n    \n    ## alpha is found through line search\n    alpha = tf.Variable(1., trainable=False)\n    # TRPO update\n    trpo_update = alpha * npg_update\n\n    ####################   POLICY UPDATE  ###################\n    # variable used as an index\n    it_v = tf.Variable(0, trainable=False)\n    p_opt = []\n    # Apply the updates to the policy\n    for p_v in p_variables:\n        upd_rsh = tf.reshape(trpo_update[it_v : it_v+tf.reduce_prod(p_v.shape)], shape=p_v.shape)\n        p_opt.append(p_v.assign_sub(upd_rsh))\n        it_v += tf.reduce_prod(p_v.shape)\n\n    p_opt = tf.group(*p_opt)\n        \n\n    #########################################################\n    ######################### MODEL #########################\n    #########################################################\n     \n    m_opts = []\n    m_losses = []\n    \n    nobs_pred_m = []\n    act_obs = tf.concat([obs_ph, act_ph], 1)\n    # computational graph of N models\n    for i in range(num_ensemble_models):\n        with tf.variable_scope('model_'+str(i)+'_nn'):\n            nobs_pred = mlp(act_obs, [64, 64], obs_dim[0], tf.nn.relu, last_activation=None)\n            nobs_pred_m.append(nobs_pred)\n        \n        m_loss = tf.reduce_mean((nobs_ph - nobs_pred)**2)\n        m_losses.append(m_loss)\n\n        m_opts.append(tf.train.AdamOptimizer(mb_lr).minimize(m_loss))\n\n      \n    ##################### RESTORE MODEL ######################\n    initialize_models = []\n    models_variables = []\n    for i in range(num_ensemble_models):\n      m_variables = variables_in_scope('model_'+str(i)+'_nn')\n      initialize_models.append(restore_model(old_model_variables, m_variables))\n\n      models_variables.append(flatten_list(m_variables))\n\n    \n    # Time\n    now = datetime.now()\n    clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, now.second)\n    print('Time:', clock_time)\n\n\n    # Set scalars and hisograms for TensorBoard\n    tf.summary.scalar('p_loss', p_loss, collections=['train'])\n    tf.summary.scalar('v_loss', v_loss, collections=['train'])\n    tf.summary.scalar('p_divergence', diverg, collections=['train'])\n    tf.summary.scalar('ratio_new_old',tf.reduce_mean(ratio_new_old), collections=['train'])\n    tf.summary.scalar('dkl_diverg', dkl_diverg, collections=['train'])\n    tf.summary.scalar('alpha', alpha, collections=['train'])\n    tf.summary.scalar('beta', beta_ph, collections=['train'])\n    tf.summary.scalar('p_std_mn', tf.reduce_mean(tf.exp(log_std)), collections=['train'])\n    tf.summary.scalar('s_values_mn', tf.reduce_mean(s_values), collections=['train'])\n    tf.summary.histogram('p_log', p_log, collections=['train'])\n    tf.summary.histogram('p_means', p_means, collections=['train'])\n    tf.summary.histogram('s_values', s_values, collections=['train'])\n    tf.summary.histogram('adv_ph',adv_ph, collections=['train'])\n    tf.summary.histogram('log_std',log_std, collections=['train'])\n    scalar_summary = tf.summary.merge_all('train')\n\n    tf.summary.scalar('old_v_loss', v_loss, collections=['pre_train'])\n    tf.summary.scalar('old_p_loss', p_loss, collections=['pre_train'])\n    pre_scalar_summary = tf.summary.merge_all('pre_train')\n\n    hyp_str = '-spe_'+str(steps_per_env)+'-envs_'+str(number_envs)+'-cr_lr'+str(cr_lr)+'-crit_it_'+str(critic_iter)+'-delta_'+str(delta)+'-conj_iters_'+str(conj_iters)\n    file_writer = tf.summary.FileWriter('log_dir/'+env_name+'/'+algorithm+'_'+clock_time+'_'+hyp_str, tf.get_default_graph())\n    \n    # create a session\n    sess = tf.Session()\n    # initialize the variables\n    sess.run(tf.global_variables_initializer())\n    \n    def action_op(o):\n        return sess.run([p_means, s_values], feed_dict={obs_ph:o})\n\n    def action_op_noise(o):\n        return sess.run([a_sampl, s_values], feed_dict={obs_ph:o})\n\n    def model_op(o, a, md_idx):\n        mo = sess.run(nobs_pred_m[md_idx], feed_dict={obs_ph:[o], act_ph:[a]})\n        return np.squeeze(mo)\n      \n    def run_model_loss(model_idx, r_obs, r_act, r_nxt_obs):\n        return sess.run(m_losses[model_idx], feed_dict={obs_ph:r_obs, act_ph:r_act, nobs_ph:r_nxt_obs})\n      \n    def run_model_opt_loss(model_idx, r_obs, r_act, r_nxt_obs):\n        return sess.run([m_opts[model_idx], m_losses[model_idx]], feed_dict={obs_ph:r_obs, act_ph:r_act, nobs_ph:r_nxt_obs})      \n      \n    def model_assign(i, model_variables_to_assign):\n        '''\n        Update the i-th model's parameters\n        '''\n        return sess.run(initialize_models[i], feed_dict={old_model_variables:model_variables_to_assign})\n    \n    def policy_update(obs_batch, act_batch, adv_batch, rtg_batch):\n        # log probabilities, logits and log std of the \"old\" policy\n        # \"old\" policy refer to the policy to optimize and that has been used to sample from the environment\n\n        old_p_log, old_p_means, old_log_std = sess.run([p_log, p_means, log_std], feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})\n        # get also the \"old\" parameters\n        old_actor_params = sess.run(p_var_flatten)\n\n        # old_p_loss is later used in the line search\n        # run pre_scalar_summary for a summary before the optimization\n        old_p_loss, summary = sess.run([p_loss,pre_scalar_summary], feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})\n        file_writer.add_summary(summary, step_count)\n\n        def H_f(p):\n            '''\n            Run the Fisher-Vector product on 'p' to approximate the Hessian of the DKL\n            '''\n            return sess.run(Fx, feed_dict={old_mu_ph:old_p_means, old_log_std_ph:old_log_std, p_ph:p, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})\n\n        g_f = sess.run(p_grads_flatten, feed_dict={old_mu_ph:old_p_means,obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})\n        ## Compute the Conjugate Gradient so to obtain an approximation of H^(-1)*g\n        # Where H in reality isn't the true Hessian of the KL divergence but an approximation of it computed via Fisher-Vector Product (F)\n        conj_grad = conjugate_gradient(H_f, g_f, iters=conj_iters)\n\n        # Compute the step length\n        beta_np = np.sqrt(2*delta / (1e-10 + np.sum(conj_grad * H_f(conj_grad))))\n        \n        def DKL(alpha_v):\n            '''\n            Compute the KL divergence.\n            It optimize the function to compute the DKL. Afterwards it restore the old parameters.\n            '''\n            sess.run(p_opt, feed_dict={beta_ph:beta_np, alpha:alpha_v, cg_ph:conj_grad, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, old_p_log_ph:old_p_log})\n            a_res = sess.run([dkl_diverg, p_loss], feed_dict={old_mu_ph:old_p_means, old_log_std_ph:old_log_std, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})\n            sess.run(restore_params, feed_dict={p_old_variables: old_actor_params})\n            return a_res\n\n        # Actor optimization step\n        # Different for TRPO or NPG\n        # Backtracing line search to find the maximum alpha coefficient s.t. the constraint is valid\n        best_alpha = backtracking_line_search(DKL, delta, old_p_loss, p=0.8)\n        sess.run(p_opt, feed_dict={beta_ph:beta_np, alpha:best_alpha, cg_ph:conj_grad, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, old_p_log_ph:old_p_log})\n\n        lb = len(obs_batch)\n        shuffled_batch = np.arange(lb)\n        np.random.shuffle(shuffled_batch)\n\n        # Value function optimization steps\n        for _ in range(critic_iter):\n            # shuffle the batch on every iteration\n            np.random.shuffle(shuffled_batch)\n            for idx in range(0,lb, minibatch_size):\n                minib = shuffled_batch[idx:min(idx+minibatch_size,lb)]\n                sess.run(v_opt, feed_dict={obs_ph:obs_batch[minib], ret_ph:rtg_batch[minib]})\n\n\n    def train_model(tr_obs, tr_act, tr_nxt_obs, v_obs, v_act, v_nxt_obs, step_count, model_idx):\n\n        # Get validation loss on the old model\n        mb_valid_loss1 = run_model_loss(model_idx, v_obs, v_act, v_nxt_obs)\n\n        # Restore the random weights to have a new, clean neural network\n        model_assign(model_idx, initial_variables_models[model_idx])\n\n        mb_valid_loss = run_model_loss(model_idx, v_obs, v_act, v_nxt_obs)\n\n        acc_m_losses = []\n        last_m_losses = []\n        md_params = sess.run(models_variables[model_idx])\n        best_mb = {'iter':0, 'loss':mb_valid_loss, 'params':md_params}\n        it = 0\n\n\n        lb = len(tr_obs)\n        shuffled_batch = np.arange(lb)\n        np.random.shuffle(shuffled_batch)\n\n        while best_mb['iter'] > it - model_iter:\n            \n            # update the model on each mini-batch\n            last_m_losses = []\n            for idx in range(0, lb, model_batch_size):\n                minib = shuffled_batch[idx:min(idx+minibatch_size,lb)]\n                \n                if len(minib) != minibatch_size:\n                  _, ml = run_model_opt_loss(model_idx, tr_obs[minib], tr_act[minib], tr_nxt_obs[minib])\n                  acc_m_losses.append(ml)\n                  last_m_losses.append(ml)\n                else:\n                  print('Warning!')\n\n            # Check if the loss on the validation set has improved\n            mb_valid_loss = run_model_loss(model_idx, v_obs, v_act, v_nxt_obs)\n            if mb_valid_loss < best_mb['loss']:\n                best_mb['loss'] = mb_valid_loss\n                best_mb['iter'] = it\n                best_mb['params'] = sess.run(models_variables[model_idx])\n\n            it += 1\n\n        # Restore the model with the lower validation loss\n        model_assign(model_idx, best_mb['params'])\n\n        print('Model:{}, iter:{} -- Old Val loss:{:.6f}  New Val loss:{:.6f} -- New Train loss:{:.6f}'.format(model_idx, it, mb_valid_loss1, best_mb['loss'], np.mean(last_m_losses)))\n        summary = tf.Summary()\n        summary.value.add(tag='supplementary/m_loss', simple_value=np.mean(acc_m_losses))\n        summary.value.add(tag='supplementary/iterations', simple_value=it)\n        file_writer.add_summary(summary, step_count)\n        file_writer.flush()\n    \n    # variable to store the total number of steps\n    step_count = 0\n    model_buffer = FullBuffer()\n    print('Env batch size:',steps_per_env, ' Batch size:',steps_per_env*number_envs)\n\n    # Create a simulated environment\n    sim_env = NetworkEnv(gym.make(env_name), model_op, pendulum_reward, pendulum_done, num_ensemble_models)\n    \n    # Get the initial parameters of each model\n    # These are used in later epochs when we aim to re-train the models anew with the new dataset\n    initial_variables_models = []\n    for model_var in models_variables:\n        initial_variables_models.append(sess.run(model_var))\n\n    for ep in range(num_epochs):\n        # lists to store rewards and length of the trajectories completed\n        batch_rew = []\n        batch_len = []\n        print('============================', ep, '============================')\n        # Execute in serial the environment, storing temporarily the trajectories.\n        for env in envs:\n            init_log_std = np.ones(act_dim) * np.log(np.random.rand()*1)\n            env.reset()\n            \n            # iterate over a fixed number of steps\n            for _ in range(steps_per_env):\n                # run the policy\n                \n                if ep == 0:\n                    # Sample random action during the first epoch\n                    act = env.action_space.sample()\n                else:\n                    act = sess.run(a_sampl, feed_dict={obs_ph:[env.n_obs], log_std:init_log_std})\n                    \n                    \n                act = np.squeeze(act)\n\n                # take a step in the environment\n                obs2, rew, done, _ = env.step(np.array([act]))\n\n                # add the new transition to the temporary buffer\n                model_buffer.store(env.n_obs.copy(), act, rew, obs2.copy(), done)\n\n                env.n_obs = obs2.copy()\n                step_count += 1\n\n                if done:\n                    batch_rew.append(env.get_episode_reward())\n                    batch_len.append(env.get_episode_length())\n\n                    env.reset()\n                    init_log_std = np.ones(act_dim) * np.log(np.random.rand()*1)\n\n                    \n        print('Ep:%d Rew:%.2f -- Step:%d' % (ep, np.mean(batch_rew), step_count))\n        \n        ############################################################\n        ###################### MODEL LEARNING ######################\n        ############################################################\n        \n        # Initialize randomly a training and validation set\n        model_buffer.generate_random_dataset()\n\n        # get both datasets\n        train_obs, train_act, _, train_nxt_obs, _ = model_buffer.get_training_batch()\n        valid_obs, valid_act, _, valid_nxt_obs, _ = model_buffer.get_valid_batch()\n            \n        print('Log Std policy:', sess.run(log_std))\n        for i in range(num_ensemble_models):\n            \n            # train the dynamic model on the datasets just sampled\n            train_model(train_obs, train_act, train_nxt_obs, valid_obs, valid_act, valid_nxt_obs, step_count, i)\n\n        ############################################################\n        ###################### POLICY LEARNING ######################\n        ############################################################\n\n        best_sim_test = np.zeros(num_ensemble_models)\n        for it in range(80):\n            print('\\t Policy it', it, end='.. ')\n            ##################### MODEL SIMLUATION #####################\n            obs_batch, act_batch, adv_batch, rtg_batch = simulate_environment(sim_env, action_op_noise, simulated_steps)\n            \n            ################# TRPO UPDATE ################\n            policy_update(obs_batch, act_batch, adv_batch, rtg_batch)\n            \n            # Testing the policy on a real environment\n            mn_test = test_agent(env_test, action_op, num_games=10)[0]        \n            print(' Test score: ', np.round(mn_test, 2))\n            \n            summary = tf.Summary()\n            summary.value.add(tag='test/performance', simple_value=mn_test)\n            file_writer.add_summary(summary, step_count)\n            file_writer.flush()\n            \n            # Test the policy on simulated environment. \n            if (it+1) % 5 == 0:\n                print('Simulated test:', end=' -- ')\n                sim_rewards = []\n\n                for i in range(num_ensemble_models):\n                    sim_m_env = NetworkEnv(gym.make(env_name), model_op, pendulum_reward, pendulum_done, i+1)\n                    mn_sim_rew, _ = test_agent(sim_m_env, action_op, num_games=5)\n                    sim_rewards.append(mn_sim_rew)\n                    print(mn_sim_rew, end=' -- ')\n\n                print(\"\")\n                sim_rewards = np.array(sim_rewards)\n                # stop training if the policy hasn't improved\n                if (np.sum(best_sim_test >= sim_rewards) > int(num_ensemble_models*0.7)) \\\n                    or (len(sim_rewards[sim_rewards >= 990]) > int(num_ensemble_models*0.7)):\n                    break\n                else:\n                  best_sim_test = sim_rewards\n\n\n    # closing environments..\n    for env in envs:\n        env.close()\n    file_writer.close()\n\nif __name__ == '__main__':\n    METRPO('RoboschoolInvertedPendulum-v1', hidden_sizes=[32,32], cr_lr=1e-3, gamma=0.99, lam=0.95, num_epochs=7, steps_per_env=300, \n        number_envs=1, critic_iter=10, delta=0.01, algorithm='TRPO', conj_iters=10, minibatch_size=5000,\n        mb_lr=0.00001, model_batch_size=50, simulated_steps=50000, num_ensemble_models=10, model_iter=15)"
  },
  {
    "path": "Chapter10/DAgger.py",
    "content": "import numpy as np \nimport tensorflow as tf\nfrom datetime import datetime\nimport time\nfrom ple.games.flappybird import FlappyBird\nfrom ple import PLE\n\n\ndef mlp(x, hidden_layers, output_layer, activation=tf.tanh, last_activation=None):\n    '''\n    Multi-layer perceptron\n    '''\n    for l in hidden_layers:\n        x = tf.layers.dense(x, units=l, activation=activation)\n    return tf.layers.dense(x, units=output_layer, activation=last_activation)\n\ndef flappy_to_list(fd):\n    '''\n    Return the state dictionary as a list\n    '''\n    return fd['player_y'], fd['player_vel'], fd['next_pipe_dist_to_player'], fd['next_pipe_top_y'], \\\n            fd['next_pipe_bottom_y'], fd['next_next_pipe_dist_to_player'], fd['next_next_pipe_top_y'], \\\n            fd['next_next_pipe_bottom_y']\n\ndef flappy_game_state(bol):\n    '''\n    Normalize the game state\n    '''\n    stat = flappy_to_list(bol.getGameState())\n    stat = (np.array(stat, dtype=np.float32) / 300.0) - 0.5\n    return stat\n\ndef no_op(env, n_act=5):\n    for _ in range(n_act):\n        env.act(119 if np.random.randn() < 0.5 else None)\n\n\ndef expert():\n    '''\n    Load the computational graph and pretarined weights of the expert\n    '''\n    graph = tf.get_default_graph()\n\n    sess_expert = tf.Session(graph=graph)\n\n    saver = tf.train.import_meta_graph('expert/model.ckpt.meta')\n    saver.restore(sess_expert,tf.train.latest_checkpoint('expert/'))\n    \n    p_argmax = graph.get_tensor_by_name('actor_nn/max_act:0') \n    obs_ph = graph.get_tensor_by_name('obs:0') \n\n    def expert_policy(state):\n        act = sess_expert.run(p_argmax, feed_dict={obs_ph:[state]})\n        return np.squeeze(act)\n\n    return expert_policy\n\ndef test_agent(policy, file_writer=None, test_games=10, step=0):\n    game = FlappyBird()\n    env = PLE(game, fps=30, display_screen=False)\n    env.init()\n\n    test_rewards = []\n    for _ in range(test_games):\n        env.reset_game()\n        no_op(env)\n\n        game_rew = 0\n\n        while not env.game_over():\n\n            state = flappy_game_state(env)\n\n            action = 119 if policy(state) == 1 else None\n\n            for _ in range(2):\n                game_rew += env.act(action)\n\n        test_rewards.append(game_rew)\n\n        if file_writer is not None:\n            summary = tf.Summary()\n            summary.value.add(tag='test_performance', simple_value=game_rew)\n            file_writer.add_summary(summary, step)\n            file_writer.flush()\n\n    return test_rewards\n\n\ndef DAgger(hidden_sizes=[32,32], dagger_iterations=20, p_lr=1e-3, step_iterations=1000, batch_size=128, train_epochs=20, obs_dim=8, act_dim=2):\n\n    tf.reset_default_graph()\n\n    ############################## EXPERT ###############################\n    # load the expert and return a function that predict the expert action given a state\n    expert_policy = expert()     \n    print('Expert performance: ', np.mean(test_agent(expert_policy)))\n\n\n    #################### LEARNER COMPUTATIONAL GRAPH ####################\n    obs_ph = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32, name='obs')\n    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')\n\n    # Multi-layer perceptron\n    p_logits = mlp(obs_ph, hidden_sizes, act_dim, tf.nn.relu, last_activation=None)\n        \n    act_max = tf.math.argmax(p_logits, axis=1)\n    act_onehot = tf.one_hot(act_ph, depth=act_dim)\n\n    # softmax cross entropy loss\n    p_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=act_onehot, logits=p_logits))\n    # Adam optimizer\n    p_opt = tf.train.AdamOptimizer(p_lr).minimize(p_loss)\n\n\n    now = datetime.now()\n    clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, now.second)\n    file_writer = tf.summary.FileWriter('log_dir/FlappyBird/DAgger_'+clock_time, tf.get_default_graph())\n\n    sess = tf.Session()\n    sess.run(tf.global_variables_initializer())\n    \n    def learner_policy(state):\n        action = sess.run(act_max, feed_dict={obs_ph:[state]})\n        return np.squeeze(action)\n\n    X = []\n    y = []\n\n    env = FlappyBird()\n\n    env = PLE(env, fps=30, display_screen=False)\n    env.init()    \n\n    #################### DAgger iterations ####################\n    \n    for it in range(dagger_iterations):\n        sess.run(tf.global_variables_initializer())\n        env.reset_game()\n        no_op(env)\n\n        game_rew = 0\n        rewards = []\n\n        ###################### Populate the dataset #####################\n\n        for _ in range(step_iterations):\n            # get the current state from the environment\n            state = flappy_game_state(env)\n\n            # As the iterations continue use more and more actions sampled from the learner\n            if np.random.rand() < (1 - it/5):\n                action = expert_policy(state)\n            else:\n                action = learner_policy(state)\n\n            action = 119 if action == 1 else None\n\n            rew = env.act(action)\n            rew += env.act(action)\n\n            # Add the state and the expert action to the dataset\n            X.append(state)\n            y.append(expert_policy(state))\n\n            game_rew += rew\n\n            # Whenever the game stop, reset the environment and initailize the variables\n            if env.game_over():\n                env.reset_game()\n                no_op(env)\n\n                rewards.append(game_rew)\n                game_rew = 0\n\n        ##################### Training #####################\n\n        # Calculate the number of minibatches\n        n_batches = int(np.floor(len(X)/batch_size))\n\n        # shuffle the dataset\n        shuffle = np.arange(len(X))\n        np.random.shuffle(shuffle)\n\n        \n        shuffled_X = np.array(X)[shuffle]\n        shuffled_y = np.array(y)[shuffle]\n        \n        \n        for _ in range(train_epochs):\n            ep_loss = []\n            # Train the model on each minibatch in the dataset\n            for b in range(n_batches):\n                p_start = b*batch_size\n\n                # mini-batch training\n                tr_loss, _ = sess.run([p_loss, p_opt], feed_dict={\n                                obs_ph:shuffled_X[p_start:p_start+batch_size], \n                                act_ph:shuffled_y[p_start:p_start+batch_size]})\n\n                ep_loss.append(tr_loss)\n            \n        agent_tests = test_agent(learner_policy, file_writer, step=len(X))\n\n        print('Ep:', it, np.mean(ep_loss), 'Test:', np.mean(agent_tests))\n\n\n    \n\nif __name__ == \"__main__\":\n    DAgger(hidden_sizes=[16,16], dagger_iterations=10, p_lr=1e-4, step_iterations=100, batch_size=50, train_epochs=2000)"
  },
  {
    "path": "Chapter10/expert/checkpoint",
    "content": "model_checkpoint_path: \"model.ckpt\"\nall_model_checkpoint_paths: \"model.ckpt\"\n"
  },
  {
    "path": "Chapter11/ES.py",
    "content": "import numpy as np \nimport tensorflow as tf\nfrom datetime import datetime\nimport time\nimport gym\n\nimport multiprocessing as mp\nimport scipy.stats as ss\nimport contextlib\nimport numpy as np\n\n@contextlib.contextmanager\ndef temp_seed(seed):\n    state = np.random.get_state()\n    np.random.seed(seed)\n    try:\n        yield\n    finally:\n        np.random.set_state(state)\n\ndef mlp(x, hidden_layers, output_layer, activation=tf.tanh, last_activation=None):\n    '''\n    Multi-layer perceptron\n    '''\n    for l in hidden_layers:\n        x = tf.layers.dense(x, units=l, activation=activation)\n        \n    return tf.layers.dense(x, units=output_layer, activation=last_activation)\n\n\ndef test_agent(env_test, agent_op, num_games=1):\n    '''\n    Test an agent 'agent_op', 'num_games' times\n    Return mean and std\n    '''\n    games_r = []\n    steps = 0\n    for _ in range(num_games):\n        d = False\n        game_r = 0\n        o = env_test.reset()\n\n        while not d:\n            a_s = agent_op(o)\n            o, r, d, _ = env_test.step(a_s)\n            game_r += r\n            steps += 1\n\n        games_r.append(game_r)\n    return games_r, steps\n\n\ndef worker(env_name, initial_seed, hidden_sizes, lr, std_noise, indiv_per_worker, worker_name, params_queue, output_queue):\n\n    env = gym.make(env_name)\n    obs_dim = env.observation_space.shape[0]\n    act_dim = env.action_space.shape[0]\n\n    import tensorflow as tf\n\n    # set an initial seed common to all the workers\n    tf.random.set_random_seed(initial_seed)\n    np.random.seed(initial_seed)\n    \n\n    with tf.device(\"/cpu:\" + worker_name):\n        \n        obs_ph = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32, name='obs_ph')\n        new_weights_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='new_weights_ph')\n        \n        def variables_in_scope(scope):\n            # get all trainable variables in 'scope'\n            return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)\n\n        with tf.variable_scope('nn_' + worker_name):\n            acts = mlp(obs_ph, hidden_sizes, act_dim, tf.tanh, last_activation=tf.tanh)\n\n        agent_variables = variables_in_scope('nn_' + worker_name)\n        agent_variables_flatten = flatten_list(agent_variables)\n\n        # Update the agent parameters with new weights new_weights_ph\n        it_v1 = tf.Variable(0, trainable=False)\n        update_weights = []\n        for a_v in agent_variables:\n            upd_rsh = tf.reshape(new_weights_ph[it_v1 : it_v1+tf.reduce_prod(a_v.shape)], shape=a_v.shape)\n            update_weights.append(a_v.assign(upd_rsh))\n            it_v1 += tf.reduce_prod(a_v.shape)\n\n\n        # Reshape the new_weights_ph following the neural network shape\n        it_v2 = tf.Variable(0, trainable=False)\n        vars_grads_list = []\n        for a_v in agent_variables:\n            vars_grads_list.append(tf.reshape(new_weights_ph[it_v2 : it_v2+tf.reduce_prod(a_v.shape)], shape=a_v.shape))\n            it_v2 += tf.reduce_prod(a_v.shape)\n\n        # Create the optimizer\n        opt = tf.train.AdamOptimizer(lr)\n        # Apply the \"gradients\" using Adam\n        apply_g = opt.apply_gradients([(g, v) for g, v in zip(vars_grads_list, agent_variables)])\n        \n    def agent_op(o):\n        a = np.squeeze(sess.run(acts, feed_dict={obs_ph:[o]}))\n        return np.clip(a, env.action_space.low, env.action_space.high)\n\n\n    def evaluation_on_noise(noise):\n        '''\n        Evaluate the agent with the noise\n        ''' \n        # Get the original weights that will be restored after the evaluation\n        original_weights = sess.run(agent_variables_flatten)\n\n        # Update the weights of the agent/individual by adding the extra noise noise*STD_NOISE\n        sess.run(update_weights, feed_dict={new_weights_ph:original_weights + noise*std_noise})\n\n        # Test the agent with the new weights\n        rewards, steps = test_agent(env, agent_op)\n\n        # Restore the original weights\n        sess.run(update_weights, feed_dict={new_weights_ph:original_weights})\n\n        return np.mean(rewards), steps\n\n    config_proto = tf.ConfigProto(device_count={'CPU': 4}, allow_soft_placement=True)\n    sess = tf.Session(config=config_proto)\n    sess.run(tf.global_variables_initializer())\n\n\n    agent_flatten_shape = sess.run(agent_variables_flatten).shape\n\n    while True:\n\n        for _ in range(indiv_per_worker):\n            seed = np.random.randint(1e7)\n\n            with temp_seed(seed):\n                # sample, for each weight of the agent, from a normal distribution\n                sampled_noise = np.random.normal(size=agent_flatten_shape)\n            \n            # Mirrored sampling\n            pos_rew, stp1 = evaluation_on_noise(sampled_noise)\n            neg_rew, stp2 = evaluation_on_noise(-sampled_noise)\n\n            # Put the returns and seeds on the queue\n            # Note that here we are just sending the seed (a scalar value), not the complete perturbation sampled_noise\n            output_queue.put([[pos_rew, neg_rew], seed, stp1+stp2])\n\n        # Get all the returns and seed from each other worker\n        batch_return, batch_seed = params_queue.get()\n\n        batch_noise = []\n        for seed in batch_seed:\n\n            # reconstruct the perturbations from the seed\n            with temp_seed(seed):\n                sampled_noise = np.random.normal(size=agent_flatten_shape)\n\n            batch_noise.append(sampled_noise)\n            batch_noise.append(-sampled_noise)\n            \n\n        # Compute the sthocastic gradient estimate \n        vars_grads = np.zeros(agent_flatten_shape)\n        for n, r in zip(batch_noise, batch_return):\n            vars_grads += n * r\n        vars_grads /= len(batch_noise) * std_noise\n\n        # run Adam optimization on the estimate gradient just computed\n        sess.run(apply_g, feed_dict={new_weights_ph:-vars_grads})\n\n\ndef normalized_rank(rewards):\n    '''\n    Rank the rewards and normalize them.\n    '''\n    ranked = ss.rankdata(rewards)\n    norm = (ranked - 1) / (len(ranked) - 1)\n    norm -= 0.5\n    return norm\n\n\ndef flatten(tensor):\n    '''\n    Flatten a tensor\n    '''\n    return tf.reshape(tensor, shape=(-1,))\n\ndef flatten_list(tensor_list):\n    '''\n    Flatten a list of tensors\n    '''\n    return tf.concat([flatten(t) for t in tensor_list], axis=0)\n\n\n\ndef ES(env_name, hidden_sizes=[8,8], number_iter=1000, num_workers=4, lr=0.01, indiv_per_worker=10, std_noise=0.01):\n\n\n    initial_seed = np.random.randint(1e7)\n\n    # Create a queue for the output values (single returns and seeds values)\n    output_queue = mp.Queue(maxsize=num_workers*indiv_per_worker)\n    # Create a queue for the input paramaters (batch return and batch seeds)\n    params_queue = mp.Queue(maxsize=num_workers)\n\n\n    now = datetime.now()\n    clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, now.second)\n    hyp_str = '-numworkers_'+str(num_workers)+'-lr_'+str(lr)\n    file_writer = tf.summary.FileWriter('log_dir/'+env_name+'/'+clock_time+'_'+hyp_str, tf.get_default_graph())\n    \n    processes = []\n    # Create a parallel process for each worker\n    for widx in range(num_workers):\n        p = mp.Process(target=worker, args=(env_name, initial_seed, hidden_sizes, lr, std_noise, indiv_per_worker, str(widx), params_queue, output_queue))\n        p.start()\n        processes.append(p)\n\n    tot_steps = 0\n    # Iterate over all the training iterations\n    for n_iter in range(number_iter):\n\n        batch_seed = []\n        batch_return = []\n        \n        # Wait until enough candidate individuals are evaluated\n        for _ in range(num_workers*indiv_per_worker):\n            p_rews, p_seed, p_steps = output_queue.get()\n\n            batch_seed.append(p_seed)\n            batch_return.extend(p_rews)\n            tot_steps += p_steps\n\n        print('Iter: {} Reward: {:.2f}'.format(n_iter, np.mean(batch_return)))\n\n        # Let's save the population's performance\n        summary = tf.Summary()\n        for r in batch_return:\n            summary.value.add(tag='performance', simple_value=r)\n        file_writer.add_summary(summary, tot_steps)\n        file_writer.flush()\n\n        # Rank and normalize the returns\n        batch_return = normalized_rank(batch_return)\n\n        # Put on the queue all the returns and seed so that each worker can optimize the neural network\n        for _ in range(num_workers):\n            params_queue.put([batch_return, batch_seed])\n    \n    # terminate all workers\n    for p in processes:\n        p.terminate()\n\n\n        \nif __name__ == '__main__':\n    ES('LunarLanderContinuous-v2', hidden_sizes=[32,32], number_iter=200, num_workers=4, lr=0.02, indiv_per_worker=12, std_noise=0.05)\n"
  },
  {
    "path": "Chapter12/ESBAS.py",
    "content": "import numpy as np \nimport tensorflow as tf\nimport gym\nfrom datetime import datetime\nfrom collections import deque\nimport time\nimport sys\n\n\ngym.logger.set_level(40)\n\ncurrent_milli_time = lambda: int(round(time.time() * 1000))\n    \n\ndef mlp(x, hidden_layers, output_layer, activation=tf.tanh, last_activation=None):\n    '''\n    Multi-layer perceptron\n    '''\n    for l in hidden_layers:\n        x = tf.layers.dense(x, units=l, activation=activation)\n        \n    return tf.layers.dense(x, units=output_layer, activation=last_activation)\n\nclass ExperienceBuffer():\n    '''\n    Experience Replay Buffer\n    '''\n    def __init__(self, buffer_size):\n        self.obs_buf = deque(maxlen=buffer_size)\n        self.rew_buf = deque(maxlen=buffer_size)\n        self.act_buf = deque(maxlen=buffer_size)\n        self.obs2_buf = deque(maxlen=buffer_size)\n        self.done_buf = deque(maxlen=buffer_size)\n\n\n    def add(self, obs, rew, act, obs2, done):\n        # Add a new transition to the buffers\n        self.obs_buf.append(obs)\n        self.rew_buf.append(rew)\n        self.act_buf.append(act)\n        self.obs2_buf.append(obs2)\n        self.done_buf.append(done)\n        \n\n    def sample_minibatch(self, batch_size):\n        # Sample a minibatch of size batch_size\n        mb_indices = np.random.randint(len(self.obs_buf), size=batch_size)\n\n        mb_obs = [self.obs_buf[i] for i in mb_indices]\n        mb_rew = [self.rew_buf[i] for i in mb_indices]\n        mb_act = [self.act_buf[i] for i in mb_indices]\n        mb_obs2 = [self.obs2_buf[i] for i in mb_indices]\n        mb_done = [self.done_buf[i] for i in mb_indices]\n\n        return mb_obs, mb_rew, mb_act, mb_obs2, mb_done\n\n    def __len__(self):\n        return len(self.obs_buf)\n\n\ndef q_target_values(mini_batch_rw, mini_batch_done, av, discounted_value):   \n    '''\n    Calculate the target value y for each transition\n    '''\n    max_av = np.max(av, axis=1)\n    \n    # if episode terminate, y take value r\n    # otherwise, q-learning step\n    ys = []\n    for r, d, av in zip(mini_batch_rw, mini_batch_done, max_av):\n        if d:\n            ys.append(r)\n        else:\n            q_step = r + discounted_value * av\n            ys.append(q_step)\n    \n    assert len(ys) == len(mini_batch_rw)\n    return ys\n\ndef greedy(action_values):\n    '''\n    Greedy policy\n    '''\n    return np.argmax(action_values)\n\ndef eps_greedy(action_values, eps=0.1):\n    '''\n    Eps-greedy policy\n    '''\n    if np.random.uniform(0,1) < eps:\n        # Choose a uniform random action\n        return np.random.randint(len(action_values))\n    else:\n        # Choose the greedy action\n        return np.argmax(action_values)\n\ndef test_agent(env_test, agent_op, num_games=20, summary=None):\n    '''\n    Test an agent\n    '''\n    games_r = []\n\n    for _ in range(num_games):\n        d = False\n        game_r = 0\n        o = env_test.reset()\n\n        while not d:\n            a = greedy(np.squeeze(agent_op(o)))\n            o, r, d, _ = env_test.step(a)\n\n            game_r += r\n\n        if summary is not None:\n            summary.value.add(tag='test_performance', simple_value=game_r)\n\n        games_r.append(game_r)\n\n    return games_r\n\n\nclass DQN_optimization:\n    def __init__(self, obs_dim, act_dim, hidden_layers, lr, discount):\n        self.obs_dim = obs_dim\n        self.act_dim = act_dim\n        self.hidden_layers = hidden_layers\n        self.lr = lr\n        self.discount = discount\n\n        self.__build_graph()\n\n\n    def __build_graph(self):\n        \n        self.g = tf.Graph()\n        with self.g.as_default():\n            # Create all the placeholders\n            self.obs_ph = tf.placeholder(shape=(None, self.obs_dim[0]), dtype=tf.float32, name='obs')\n            self.act_ph = tf.placeholder(shape=(None,), dtype=tf.int32, name='act')\n            self.y_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='y')\n\n            # Create the target network\n            with tf.variable_scope('target_network'):\n                self.target_qv = mlp(self.obs_ph, self.hidden_layers, self.act_dim, tf.nn.relu, last_activation=None)\n            target_vars = tf.trainable_variables()\n\n            # Create the online network (i.e. the behavior policy)\n            with tf.variable_scope('online_network'):\n                self.online_qv = mlp(self.obs_ph, self.hidden_layers, self.act_dim, tf.nn.relu, last_activation=None)\n            train_vars = tf.trainable_variables()\n\n            # Update the target network by assigning to it the variables of the online network\n            # Note that the target network and the online network have the same exact architecture\n            update_target = [train_vars[i].assign(train_vars[i+len(target_vars)]) for i in range(len(train_vars) - len(target_vars))]\n            self.update_target_op = tf.group(*update_target)\n\n            # One hot encoding of the action\n            act_onehot = tf.one_hot(self.act_ph, depth=self.act_dim)\n            # We are interested only in the Q-values of those actions\n            q_values = tf.reduce_sum(act_onehot * self.online_qv, axis=1)\n            \n            # MSE loss function\n            self.v_loss = tf.reduce_mean((self.y_ph - q_values)**2)\n            # Adam optimize that minimize the loss v_loss\n            self.v_opt = tf.train.AdamOptimizer(self.lr).minimize(self.v_loss)\n\n            self.__create_session()\n\n            # Copy the online network in the target network\n            self.sess.run(self.update_target_op)\n\n    def __create_session(self):\n         # open a session\n        self.sess = tf.Session(graph=self.g)\n        # and initialize all the variables\n        self.sess.run(tf.global_variables_initializer())      \n    \n\n    def act(self, o):\n        '''\n        Forward pass to obtain the Q-values from the online network of a single observation\n        '''\n        return self.sess.run(self.online_qv, feed_dict={self.obs_ph:[o]})\n\n    def optimize(self, mb_obs, mb_rew, mb_act, mb_obs2, mb_done):\n        mb_trg_qv = self.sess.run(self.target_qv, feed_dict={self.obs_ph:mb_obs2})\n        y_r = q_target_values(mb_rew, mb_done, mb_trg_qv, self.discount)\n\n        # training step\n        # optimize, compute the loss and return the TB summary\n        self.sess.run(self.v_opt, feed_dict={self.obs_ph:mb_obs, self.y_ph:y_r, self.act_ph: mb_act})\n\n    def update_target_network(self):\n        # run the session to update the target network and get the mean loss sumamry \n        self.sess.run(self.update_target_op)\n\n\nclass UCB1:\n    def __init__(self, algos, epsilon):\n        self.n = 0\n        self.epsilon = epsilon\n        self.algos = algos\n\n        self.nk = np.zeros(len(algos))\n        self.xk = np.zeros(len(algos))\n\n    def choose_algorithm(self):\n        # take the best algorithm following UCB1\n        current_best = np.argmax([self.xk[i] + np.sqrt(self.epsilon * np.log(self.n) / self.nk[i]) for i in range(len(self.algos))])\n        for i in range(len(self.algos)):\n            if self.nk[i] < 5:\n                return np.random.randint(len(self.algos))\n\n        return current_best\n\n    def update(self, idx_algo, traj_return):\n        # Update the mean RL return \n        self.xk[idx_algo] = (self.nk[idx_algo] * self.xk[idx_algo] + traj_return) / (self.nk[idx_algo] + 1)\n        # increase the number of trajectories run\n        self.nk[idx_algo] += 1\n        self.n += 1\n\n\ndef ESBAS(env_name, hidden_sizes=[32], lr=1e-2, num_epochs=2000, buffer_size=100000, discount=0.99, render_cycle=100, update_target_net=1000, \n        batch_size=64, update_freq=4, min_buffer_size=5000, test_frequency=20, start_explor=1, end_explor=0.1, explor_steps=100000,\n        xi=1):\n\n    # reset the default graph\n    tf.reset_default_graph()\n\n    # Create the environment both for train and test\n    env = gym.make(env_name)\n    # Add a monitor to the test env to store the videos\n    env_test = gym.wrappers.Monitor(gym.make(env_name), \"VIDEOS/TEST_VIDEOS\"+env_name+str(current_milli_time()),force=True, video_callable=lambda x: x%20==0)\n\n    dqns = []\n    for l in hidden_sizes:\n        dqns.append(DQN_optimization(env.observation_space.shape, env.action_space.n, l, lr, discount))\n\n    # Time\n    now = datetime.now()\n    clock_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, int(now.second))\n    print('Time:', clock_time)\n\n    LOG_DIR = 'log_dir/'+env_name\n    hyp_str = \"-lr_{}-upTN_{}-upF_{}-xi_{}\" .format(lr, update_target_net, update_freq, xi)\n\n    # initialize the File Writer for writing TensorBoard summaries\n    file_writer = tf.summary.FileWriter(LOG_DIR+'/ESBAS_'+clock_time+'_'+hyp_str, tf.get_default_graph())\n\n    def DQNs_update(step_counter):\n        # If it's time to train the network:\n        if len(buffer) > min_buffer_size and (step_counter % update_freq == 0):\n        \n            # sample a minibatch from the buffer\n            mb_obs, mb_rew, mb_act, mb_obs2, mb_done = buffer.sample_minibatch(batch_size)\n\n            for dqn in dqns:\n                dqn.optimize(mb_obs, mb_rew, mb_act, mb_obs2, mb_done)\n\n        # Every update_target_net steps, update the target network\n        if len(buffer) > min_buffer_size and (step_counter % update_target_net == 0):\n\n            for dqn in dqns:\n                dqn.update_target_network()\n    \n\n    step_count = 0\n    episode = 0\n    beta = 1\n\n    # Initialize the experience buffer\n    buffer = ExperienceBuffer(buffer_size)\n\n    obs = env.reset()\n\n    # policy exploration initialization\n    eps = start_explor\n    eps_decay = (start_explor - end_explor) / explor_steps\n\n\n    for ep in range(num_epochs):\n\n        # Policies' training\n        for i in range(2**(beta-1), 2**beta):\n            DQNs_update(i)\n\n        ucb1 = UCB1(dqns, xi)\n        list_bests = []\n        ep_rew = []\n        beta += 1\n\n        while step_count < 2**beta:\n\n            # Chose the best policy's algortihm that will run the next trajectory \n            best_dqn = ucb1.choose_algorithm()\n            list_bests.append(best_dqn)\n\n            summary = tf.Summary()\n            summary.value.add(tag='algorithm_selected', simple_value=best_dqn)\n            file_writer.add_summary(summary, step_count)\n            file_writer.flush()\n\n            g_rew = 0\n            done = False\n                \n            while not done:\n                # Epsilon decay\n                if eps > end_explor:\n                    eps -= eps_decay\n                \n\n                # Choose an eps-greedy action \n                act = eps_greedy(np.squeeze(dqns[best_dqn].act(obs)), eps=eps)\n\n                # execute the action in the environment\n                obs2, rew, done, _ = env.step(act)\n\n                # Add the transition to the replay buffer\n                buffer.add(obs, rew, act, obs2, done)\n\n                obs = obs2\n                g_rew += rew\n                step_count += 1\n            \n\n            # Update the UCB parameters of the algortihm just used\n            ucb1.update(best_dqn, g_rew)\n\n            # The environment is ended.. reset it and initialize the variables\n            obs = env.reset()\n            ep_rew.append(g_rew)\n            g_rew = 0\n            episode += 1\n\n\n            # Print some stats and test the best policy\n            summary = tf.Summary()\n            summary.value.add(tag='train_performance', simple_value=np.mean(ep_rew))\n\n            if episode % 10 == 0:\n                unique, counts = np.unique(list_bests, return_counts=True)\n                print(dict(zip(unique, counts)))\n\n                test_agent_results = test_agent(env_test, dqns[best_dqn].act, num_games=10, summary=summary)\n                print('Epoch:%4d Episode:%4d Rew:%4.2f, Eps:%2.2f -- Step:%5d -- Test:%4.2f Best:%2d Last:%2d' % (ep,episode,np.mean(ep_rew), eps, step_count, np.mean(test_agent_results), best_dqn, g_rew))\n\n            file_writer.add_summary(summary, step_count)\n            file_writer.flush()\n\n\n    file_writer.close()\n    env.close()\n\n\nif __name__ == '__main__':\n\n    #ESBAS('Acrobot-v1', hidden_sizes=[[64, 64]], lr=4e-4, buffer_size=100000, update_target_net=100, batch_size=32, \n    #    update_freq=4, min_buffer_size=100, render_cycle=10000, explor_steps=50000, num_epochs=20000, end_explor=0.1)\n\n    ESBAS('Acrobot-v1', hidden_sizes=[[64], [16, 16], [64, 64]], lr=4e-4, buffer_size=100000, update_target_net=100, batch_size=32, \n        update_freq=4, min_buffer_size=100, render_cycle=10000, explor_steps=50000, num_epochs=20000, end_explor=0.1,\n        xi=1./4)"
  },
  {
    "path": "README.md",
    "content": "\n\n\n# Reinforcement Learning Algorithms with Python\n\n<a href=\"https://www.packtpub.com/data/hands-on-reinforcement-learning-algorithms-with-python\"><img src=\"https://www.packtpub.com/media/catalog/product/cache/ecd051e9670bd57df35c8f0b122d8aea/9/7/9781789131116-original.jpeg\" alt=\"Reinforcement Learning Algorithms with Python\" height=\"256px\" align=\"right\"></a>\n\nThis is the code repository for [Reinforcement Learning Algorithms with Python](https://www.packtpub.com/data/hands-on-reinforcement-learning-algorithms-with-python), published by Packt.\n\n**Learn, understand, and develop smart algorithms for addressing AI challenges**\n\n## What is this book about?\nReinforcement Learning (RL) is a popular and promising branch of AI that involves making smarter models and agents that can automatically determine ideal behavior based on changing requirements. This book will help you master RL algorithms and understand their implementation as you build self-learning agents.\nStarting with an introduction to the tools, libraries, and setup needed to work in the RL environment, this book covers the building blocks of RL and delves into value-based methods, such as the application of Q-learning and SARSA algorithms. You'll learn how to use a combination of Q-learning and neural networks to solve complex problems. Furthermore, you'll study the policy gradient methods, TRPO, and PPO, to improve performance and stability, before moving on to the DDPG and TD3 deterministic algorithms. This book also covers how imitation learning techniques work and how Dagger can teach an agent to drive. You'll discover evolutionary strategies and black-box optimization techniques, and see how they can improve RL algorithms. Finally, you'll get to grips with exploration approaches, such as UCB and UCB1, and develop a meta-algorithm called ESBAS.\nBy the end of the book, you'll have worked with key RL algorithms to overcome challenges in real-world applications, and be part of the RL research community.\n\n\nThis book covers the following exciting features:\n* Develop an agent to play CartPole using the OpenAI Gym interface\n* Discover the model-based reinforcement learning paradigm\n* Solve the Frozen Lake problem with dynamic programming\n* Explore Q-learning and SARSA with a view to playing a taxi game\n* Apply Deep Q-Networks (DQNs) to Atari games using Gym\n* Study policy gradient algorithms, including Actor-Critic and REINFORCE\n* Understand and apply PPO and TRPO in continuous locomotion environments\n* Get to grips with evolution strategies for solving the lunar lander problem\n\nIf you feel this book is for you, get your [copy](https://www.amazon.com/Reinforcement-Learning-Algorithms-Python-understand/dp/1789131111/) today!\n\n<a href=\"https://www.packtpub.com/?utm_source=github&utm_medium=banner&utm_campaign=GitHubBanner\"><img src=\"https://raw.githubusercontent.com/PacktPublishing/GitHub/master/GitHub.png\" \nalt=\"https://www.packtpub.com/\" border=\"5\" /></a>\n\n## Instructions and Navigations\nAll of the code is organized into folders. For example, Chapter02.\n\nThe code will look like the following:\n```\nimport gym\n\n# create the environment \nenv = gym.make(\"CartPole-v1\")\n# reset the environment before starting\nenv.reset()\n\n# loop 10 times\nfor i in range(10):\n    # take a random action\n    env.step(env.action_space.sample())\n    # render the game\n   env.render()\n\n# close the environment\nenv.close()\n```\n\n**Following is what you need for this book:**\nIf you are an AI researcher, deep learning user, or anyone who wants to learn reinforcement learning from scratch, this book is for you. You’ll also find this reinforcement learning book useful if you want to learn about the advancements in the field. Working knowledge of Python is necessary.\t\n\n\nWith the following software and hardware list you can run all code files present in the book (Chapter 1-11).\n### Software and Hardware List\n| Chapter | Software required | OS required |\n| -------- | ------------------------------------ | ----------------------------------- |\n| All | Python 3.6 or higher | Windows, Mac OS X, and Linux (Any) |\n| All | TensorFlow 1.14 or higher | Windows, Mac OS X, and Linux (Any) |\n\nWe also provide a PDF file that has color images of the screenshots/diagrams used in this book. [Click here to download it](http://www.packtpub.com/sites/default/files/downloads/9781789131116_ColorImages.pdf).\n\n### Related products\n* Hands-On Reinforcement Learning with Python [[Packt]](https://www.packtpub.com/big-data-and-business-intelligence/hands-reinforcement-learning-python) [[Amazon]](https://www.amazon.com/Hands-Reinforcement-Learning-Python-reinforcement-ebook/dp/B079Q3WLM4/)\n\n* Python Reinforcement Learning Projects [[Packt]](https://www.packtpub.com/big-data-and-business-intelligence/python-reinforcement-learning-projects) [[Amazon]](https://www.amazon.com/Python-Reinforcement-Learning-Projects-hands-ebook/dp/B07F2S82W3/)\n\n## Get to Know the Author\n**Andrea Lonza** is a deep learning engineer with a great passion for artificial intelligence and a desire to create machines that act intelligently. He has acquired expert knowledge in reinforcement learning, natural language processing, and computer vision through academic and industrial machine learning projects. He has also participated in several Kaggle competitions, achieving high results. He is always looking for compelling challenges and loves to prove himself.\n\n\n\n### Suggestions and Feedback\n[Click here](https://docs.google.com/forms/d/e/1FAIpQLSdy7dATC6QmEL81FIUuymZ0Wy9vH1jHkvpY57OiMeKGqib_Ow/viewform) if you have any feedback or suggestions.\n\n\n### Download a free PDF\n\n <i>If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.<br>Simply click on the link to claim your free PDF.</i>\n<p align=\"center\"> <a href=\"https://packt.link/free-ebook/9781789131116\">https://packt.link/free-ebook/9781789131116 </a> </p>"
  }
]