[
  {
    "path": ".gitignore",
    "content": ".idea\n__pycache__\n"
  },
  {
    "path": "LICENCE",
    "content": "MIT License\n\nCopyright (c) 2017\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE."
  },
  {
    "path": "README.md",
    "content": "<p align=\"center\">\n    <a href=\"https://www.youtube.com/watch?v=pieI7rOXELI&list=PLXO45tsB95cIplu-fLMpUEEZTwrDNh6Ba\" target=\"_blank\">\n    <img width=\"60%\" src=\"https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/RL_cover.jpg\" style=\"max-width:100%;\">\n    </a>\n</p>\n\n\n<br>\n\n# Reinforcement Learning Methods and Tutorials\n\nIn these tutorials for reinforcement learning, it covers from the basic RL algorithms to advanced algorithms developed recent years.\n\n**If you speak Chinese, visit [莫烦 Python](https://mofanpy.com) or my [Youtube channel](https://www.youtube.com/channel/UCdyjiB5H8Pu7aDTNVXTTpcg) for more.**\n\n**As many requests about making these tutorials available in English, please find them in this playlist:** ([https://www.youtube.com/playlist?list=PLXO45tsB95cIplu-fLMpUEEZTwrDNh6Ba](https://www.youtube.com/playlist?list=PLXO45tsB95cIplu-fLMpUEEZTwrDNh6Ba))\n\n# Table of Contents\n\n* Tutorials\n    * [Simple entry example](contents/1_command_line_reinforcement_learning)\n    * [Q-learning](contents/2_Q_Learning_maze)\n    * [Sarsa](contents/3_Sarsa_maze)\n    * [Sarsa(lambda)](contents/4_Sarsa_lambda_maze)\n    * [Deep Q Network (DQN)](contents/5_Deep_Q_Network)\n    * [Using OpenAI Gym](contents/6_OpenAI_gym)\n    * [Double DQN](contents/5.1_Double_DQN)\n    * [DQN with Prioitized Experience Replay](contents/5.2_Prioritized_Replay_DQN)\n    * [Dueling DQN](contents/5.3_Dueling_DQN)\n    * [Policy Gradients](contents/7_Policy_gradient_softmax)\n    * [Actor-Critic](contents/8_Actor_Critic_Advantage)\n    * [Deep Deterministic Policy Gradient (DDPG)](contents/9_Deep_Deterministic_Policy_Gradient_DDPG)\n    * [A3C](contents/10_A3C)\n    * [Dyna-Q](contents/11_Dyna_Q)\n    * [Proximal Policy Optimization (PPO)](contents/12_Proximal_Policy_Optimization)\n    * [Curiosity Model](/contents/Curiosity_Model), [Random Network Distillation (RND)](/contents/Curiosity_Model/Random_Network_Distillation.py)\n* [Some of my experiments](experiments)\n    * [2D Car](experiments/2D_car)\n    * [Robot arm](experiments/Robot_arm)\n    * [BipedalWalker](experiments/Solve_BipedalWalker)\n    * [LunarLander](experiments/Solve_LunarLander)\n\n# Some RL Networks\n### [Deep Q Network](contents/5_Deep_Q_Network)\n\n<a href=\"contents/5_Deep_Q_Network\">\n    <img class=\"course-image\" src=\"https://mofanpy.com/static/results/reinforcement-learning/4-3-2.png\">\n</a>\n\n### [Double DQN](contents/5.1_Double_DQN)\n\n<a href=\"contents/5.1_Double_DQN\">\n    <img class=\"course-image\" src=\"https://mofanpy.com/static/results/reinforcement-learning/4-5-3.png\">\n</a>\n\n### [Dueling DQN](contents/5.3_Dueling_DQN)\n\n<a href=\"contents/5.3_Dueling_DQN\">\n    <img class=\"course-image\" src=\"https://mofanpy.com/static/results/reinforcement-learning/4-7-4.png\">\n</a>\n\n### [Actor Critic](contents/8_Actor_Critic_Advantage)\n\n<a href=\"contents/8_Actor_Critic_Advantage\">\n    <img class=\"course-image\" src=\"https://mofanpy.com/static/results/reinforcement-learning/6-1-1.png\">\n</a>\n\n### [Deep Deterministic Policy Gradient](contents/9_Deep_Deterministic_Policy_Gradient_DDPG)\n\n<a href=\"contents/9_Deep_Deterministic_Policy_Gradient_DDPG\">\n    <img class=\"course-image\" src=\"https://mofanpy.com/static/results/reinforcement-learning/6-2-2.png\">\n</a>\n\n### [A3C](contents/10_A3C)\n\n<a href=\"contents/10_A3C\">\n    <img class=\"course-image\" src=\"https://mofanpy.com/static/results/reinforcement-learning/6-3-2.png\">\n</a>\n\n### [Proximal Policy Optimization (PPO)](contents/12_Proximal_Policy_Optimization)\n\n<a href=\"contents/12_Proximal_Policy_Optimization\">\n    <img class=\"course-image\" src=\"https://mofanpy.com/static/results/reinforcement-learning/6-4-3.png\">\n</a>\n\n### [Curiosity Model](/contents/Curiosity_Model)\n\n<a href=\"/contents/Curiosity_Model\">\n    <img class=\"course-image\" src=\"/contents/Curiosity_Model/Curiosity.png\">\n</a>\n\n# Donation\n\n*If this does help you, please consider donating to support me for better tutorials. Any contribution is greatly appreciated!*\n\n<div >\n  <a href=\"https://www.paypal.com/cgi-bin/webscr?cmd=_donations&amp;business=morvanzhou%40gmail%2ecom&amp;lc=C2&amp;item_name=MorvanPython&amp;currency_code=AUD&amp;bn=PP%2dDonationsBF%3abtn_donateCC_LG%2egif%3aNonHosted\">\n    <img style=\"border-radius: 20px;  box-shadow: 0px 0px 10px 1px  #888888;\"\n         src=\"https://www.paypalobjects.com/webstatic/en_US/i/btn/png/silver-pill-paypal-44px.png\"\n         alt=\"Paypal\"\n         height=\"auto\" ></a>\n</div>\n\n<div>\n  <a href=\"https://www.patreon.com/morvan\">\n    <img src=\"https://mofanpy.com/static/img/support/patreon.jpg\"\n         alt=\"Patreon\"\n         height=120></a>\n</div>\n"
  },
  {
    "path": "contents/10_A3C/A3C_RNN.py",
    "content": "\"\"\"\nAsynchronous Advantage Actor Critic (A3C) + RNN with continuous action space, Reinforcement Learning.\n\nThe Pendulum example.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\ntensorflow 1.8.0\ngym 0.10.5\n\"\"\"\n\nimport multiprocessing\nimport threading\nimport tensorflow as tf\nimport numpy as np\nimport gym\nimport os\nimport shutil\nimport matplotlib.pyplot as plt\n\nGAME = 'Pendulum-v0'\nOUTPUT_GRAPH = True\nLOG_DIR = './log'\nN_WORKERS = multiprocessing.cpu_count()\nMAX_EP_STEP = 200\nMAX_GLOBAL_EP = 1500\nGLOBAL_NET_SCOPE = 'Global_Net'\nUPDATE_GLOBAL_ITER = 5\nGAMMA = 0.9\nENTROPY_BETA = 0.01\nLR_A = 0.0001    # learning rate for actor\nLR_C = 0.001    # learning rate for critic\nGLOBAL_RUNNING_R = []\nGLOBAL_EP = 0\n\nenv = gym.make(GAME)\n\nN_S = env.observation_space.shape[0]\nN_A = env.action_space.shape[0]\nA_BOUND = [env.action_space.low, env.action_space.high]\n\n\nclass ACNet(object):\n    def __init__(self, scope, globalAC=None):\n\n        if scope == GLOBAL_NET_SCOPE:   # get global network\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self.a_params, self.c_params = self._build_net(scope)[-2:]\n        else:   # local net, calculate losses\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self.a_his = tf.placeholder(tf.float32, [None, N_A], 'A')\n                self.v_target = tf.placeholder(tf.float32, [None, 1], 'Vtarget')\n\n                mu, sigma, self.v, self.a_params, self.c_params = self._build_net(scope)\n\n                td = tf.subtract(self.v_target, self.v, name='TD_error')\n                with tf.name_scope('c_loss'):\n                    self.c_loss = tf.reduce_mean(tf.square(td))\n\n                with tf.name_scope('wrap_a_out'):\n                    mu, sigma = mu * A_BOUND[1], sigma + 1e-4\n\n                normal_dist = tf.distributions.Normal(mu, sigma)\n\n                with tf.name_scope('a_loss'):\n                    log_prob = normal_dist.log_prob(self.a_his)\n                    exp_v = log_prob * tf.stop_gradient(td)\n                    entropy = normal_dist.entropy()  # encourage exploration\n                    self.exp_v = ENTROPY_BETA * entropy + exp_v\n                    self.a_loss = tf.reduce_mean(-self.exp_v)\n\n                with tf.name_scope('choose_a'):  # use local params to choose action\n                    self.A = tf.clip_by_value(tf.squeeze(normal_dist.sample(1), axis=[0, 1]), A_BOUND[0], A_BOUND[1])\n                with tf.name_scope('local_grad'):\n                    self.a_grads = tf.gradients(self.a_loss, self.a_params)\n                    self.c_grads = tf.gradients(self.c_loss, self.c_params)\n\n            with tf.name_scope('sync'):\n                with tf.name_scope('pull'):\n                    self.pull_a_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.a_params, globalAC.a_params)]\n                    self.pull_c_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.c_params, globalAC.c_params)]\n                with tf.name_scope('push'):\n                    self.update_a_op = OPT_A.apply_gradients(zip(self.a_grads, globalAC.a_params))\n                    self.update_c_op = OPT_C.apply_gradients(zip(self.c_grads, globalAC.c_params))\n\n    def _build_net(self, scope):\n        w_init = tf.random_normal_initializer(0., .1)\n        with tf.variable_scope('critic'):   # only critic controls the rnn update\n            cell_size = 64\n            s = tf.expand_dims(self.s, axis=1,\n                               name='timely_input')  # [time_step, feature] => [time_step, batch, feature]\n            rnn_cell = tf.contrib.rnn.BasicRNNCell(cell_size)\n            self.init_state = rnn_cell.zero_state(batch_size=1, dtype=tf.float32)\n            outputs, self.final_state = tf.nn.dynamic_rnn(\n                cell=rnn_cell, inputs=s, initial_state=self.init_state, time_major=True)\n            cell_out = tf.reshape(outputs, [-1, cell_size], name='flatten_rnn_outputs')  # joined state representation\n            l_c = tf.layers.dense(cell_out, 50, tf.nn.relu6, kernel_initializer=w_init, name='lc')\n            v = tf.layers.dense(l_c, 1, kernel_initializer=w_init, name='v')  # state value\n\n        with tf.variable_scope('actor'):  # state representation is based on critic\n            l_a = tf.layers.dense(cell_out, 80, tf.nn.relu6, kernel_initializer=w_init, name='la')\n            mu = tf.layers.dense(l_a, N_A, tf.nn.tanh, kernel_initializer=w_init, name='mu')\n            sigma = tf.layers.dense(l_a, N_A, tf.nn.softplus, kernel_initializer=w_init, name='sigma')\n        a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\n        c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\n        return mu, sigma, v, a_params, c_params\n\n    def update_global(self, feed_dict):  # run by a local\n        SESS.run([self.update_a_op, self.update_c_op], feed_dict)  # local grads applies to global net\n\n    def pull_global(self):  # run by a local\n        SESS.run([self.pull_a_params_op, self.pull_c_params_op])\n\n    def choose_action(self, s, cell_state):  # run by a local\n        s = s[np.newaxis, :]\n        a, cell_state = SESS.run([self.A, self.final_state], {self.s: s, self.init_state: cell_state})\n        return a, cell_state\n\n\nclass Worker(object):\n    def __init__(self, name, globalAC):\n        self.env = gym.make(GAME).unwrapped\n        self.name = name\n        self.AC = ACNet(name, globalAC)\n\n    def work(self):\n        global GLOBAL_RUNNING_R, GLOBAL_EP\n        total_step = 1\n        buffer_s, buffer_a, buffer_r = [], [], []\n        while not COORD.should_stop() and GLOBAL_EP < MAX_GLOBAL_EP:\n            s = self.env.reset()\n            ep_r = 0\n            rnn_state = SESS.run(self.AC.init_state)    # zero rnn state at beginning\n            keep_state = rnn_state.copy()       # keep rnn state for updating global net\n            for ep_t in range(MAX_EP_STEP):\n                if self.name == 'W_0':\n                    self.env.render()\n\n                a, rnn_state_ = self.AC.choose_action(s, rnn_state)  # get the action and next rnn state\n                s_, r, done, info = self.env.step(a)\n                done = True if ep_t == MAX_EP_STEP - 1 else False\n\n                ep_r += r\n                buffer_s.append(s)\n                buffer_a.append(a)\n                buffer_r.append((r+8)/8)    # normalize\n\n                if total_step % UPDATE_GLOBAL_ITER == 0 or done:   # update global and assign to local net\n                    if done:\n                        v_s_ = 0   # terminal\n                    else:\n                        v_s_ = SESS.run(self.AC.v, {self.AC.s: s_[np.newaxis, :], self.AC.init_state: rnn_state_})[0, 0]\n                    buffer_v_target = []\n                    for r in buffer_r[::-1]:    # reverse buffer r\n                        v_s_ = r + GAMMA * v_s_\n                        buffer_v_target.append(v_s_)\n                    buffer_v_target.reverse()\n\n                    buffer_s, buffer_a, buffer_v_target = np.vstack(buffer_s), np.vstack(buffer_a), np.vstack(buffer_v_target)\n\n                    feed_dict = {\n                        self.AC.s: buffer_s,\n                        self.AC.a_his: buffer_a,\n                        self.AC.v_target: buffer_v_target,\n                        self.AC.init_state: keep_state,\n                    }\n\n                    self.AC.update_global(feed_dict)\n                    buffer_s, buffer_a, buffer_r = [], [], []\n                    self.AC.pull_global()\n                    keep_state = rnn_state_.copy()   # replace the keep_state as the new initial rnn state_\n\n                s = s_\n                rnn_state = rnn_state_  # renew rnn state\n                total_step += 1\n\n                if done:\n                    if len(GLOBAL_RUNNING_R) == 0:  # record running episode reward\n                        GLOBAL_RUNNING_R.append(ep_r)\n                    else:\n                        GLOBAL_RUNNING_R.append(0.9 * GLOBAL_RUNNING_R[-1] + 0.1 * ep_r)\n                    print(\n                        self.name,\n                        \"Ep:\", GLOBAL_EP,\n                        \"| Ep_r: %i\" % GLOBAL_RUNNING_R[-1],\n                          )\n                    GLOBAL_EP += 1\n                    break\n\nif __name__ == \"__main__\":\n    SESS = tf.Session()\n\n    with tf.device(\"/cpu:0\"):\n        OPT_A = tf.train.RMSPropOptimizer(LR_A, name='RMSPropA')\n        OPT_C = tf.train.RMSPropOptimizer(LR_C, name='RMSPropC')\n        GLOBAL_AC = ACNet(GLOBAL_NET_SCOPE)  # we only need its params\n        workers = []\n        # Create worker\n        for i in range(N_WORKERS):\n            i_name = 'W_%i' % i   # worker name\n            workers.append(Worker(i_name, GLOBAL_AC))\n\n    COORD = tf.train.Coordinator()\n    SESS.run(tf.global_variables_initializer())\n\n    if OUTPUT_GRAPH:\n        if os.path.exists(LOG_DIR):\n            shutil.rmtree(LOG_DIR)\n        tf.summary.FileWriter(LOG_DIR, SESS.graph)\n\n    worker_threads = []\n    for worker in workers:\n        job = lambda: worker.work()\n        t = threading.Thread(target=job)\n        t.start()\n        worker_threads.append(t)\n    COORD.join(worker_threads)\n\n    plt.plot(np.arange(len(GLOBAL_RUNNING_R)), GLOBAL_RUNNING_R)\n    plt.xlabel('step')\n    plt.ylabel('Total moving reward')\n    plt.show()\n\n"
  },
  {
    "path": "contents/10_A3C/A3C_continuous_action.py",
    "content": "\"\"\"\nAsynchronous Advantage Actor Critic (A3C) with continuous action space, Reinforcement Learning.\n\nThe Pendulum example.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\ntensorflow 1.8.0\ngym 0.10.5\n\"\"\"\n\nimport multiprocessing\nimport threading\nimport tensorflow as tf\nimport numpy as np\nimport gym\nimport os\nimport shutil\nimport matplotlib.pyplot as plt\n\nGAME = 'Pendulum-v0'\nOUTPUT_GRAPH = True\nLOG_DIR = './log'\nN_WORKERS = multiprocessing.cpu_count()\nMAX_EP_STEP = 200\nMAX_GLOBAL_EP = 2000\nGLOBAL_NET_SCOPE = 'Global_Net'\nUPDATE_GLOBAL_ITER = 10\nGAMMA = 0.9\nENTROPY_BETA = 0.01\nLR_A = 0.0001    # learning rate for actor\nLR_C = 0.001    # learning rate for critic\nGLOBAL_RUNNING_R = []\nGLOBAL_EP = 0\n\nenv = gym.make(GAME)\n\nN_S = env.observation_space.shape[0]\nN_A = env.action_space.shape[0]\nA_BOUND = [env.action_space.low, env.action_space.high]\n\n\nclass ACNet(object):\n    def __init__(self, scope, globalAC=None):\n\n        if scope == GLOBAL_NET_SCOPE:   # get global network\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self.a_params, self.c_params = self._build_net(scope)[-2:]\n        else:   # local net, calculate losses\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self.a_his = tf.placeholder(tf.float32, [None, N_A], 'A')\n                self.v_target = tf.placeholder(tf.float32, [None, 1], 'Vtarget')\n\n                mu, sigma, self.v, self.a_params, self.c_params = self._build_net(scope)\n\n                td = tf.subtract(self.v_target, self.v, name='TD_error')\n                with tf.name_scope('c_loss'):\n                    self.c_loss = tf.reduce_mean(tf.square(td))\n\n                with tf.name_scope('wrap_a_out'):\n                    mu, sigma = mu * A_BOUND[1], sigma + 1e-4\n\n                normal_dist = tf.distributions.Normal(mu, sigma)\n\n                with tf.name_scope('a_loss'):\n                    log_prob = normal_dist.log_prob(self.a_his)\n                    exp_v = log_prob * tf.stop_gradient(td)\n                    entropy = normal_dist.entropy()  # encourage exploration\n                    self.exp_v = ENTROPY_BETA * entropy + exp_v\n                    self.a_loss = tf.reduce_mean(-self.exp_v)\n\n                with tf.name_scope('choose_a'):  # use local params to choose action\n                    self.A = tf.clip_by_value(tf.squeeze(normal_dist.sample(1), axis=[0, 1]), A_BOUND[0], A_BOUND[1])\n                with tf.name_scope('local_grad'):\n                    self.a_grads = tf.gradients(self.a_loss, self.a_params)\n                    self.c_grads = tf.gradients(self.c_loss, self.c_params)\n\n            with tf.name_scope('sync'):\n                with tf.name_scope('pull'):\n                    self.pull_a_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.a_params, globalAC.a_params)]\n                    self.pull_c_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.c_params, globalAC.c_params)]\n                with tf.name_scope('push'):\n                    self.update_a_op = OPT_A.apply_gradients(zip(self.a_grads, globalAC.a_params))\n                    self.update_c_op = OPT_C.apply_gradients(zip(self.c_grads, globalAC.c_params))\n\n    def _build_net(self, scope):\n        w_init = tf.random_normal_initializer(0., .1)\n        with tf.variable_scope('actor'):\n            l_a = tf.layers.dense(self.s, 200, tf.nn.relu6, kernel_initializer=w_init, name='la')\n            mu = tf.layers.dense(l_a, N_A, tf.nn.tanh, kernel_initializer=w_init, name='mu')\n            sigma = tf.layers.dense(l_a, N_A, tf.nn.softplus, kernel_initializer=w_init, name='sigma')\n        with tf.variable_scope('critic'):\n            l_c = tf.layers.dense(self.s, 100, tf.nn.relu6, kernel_initializer=w_init, name='lc')\n            v = tf.layers.dense(l_c, 1, kernel_initializer=w_init, name='v')  # state value\n        a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\n        c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\n        return mu, sigma, v, a_params, c_params\n\n    def update_global(self, feed_dict):  # run by a local\n        SESS.run([self.update_a_op, self.update_c_op], feed_dict)  # local grads applies to global net\n\n    def pull_global(self):  # run by a local\n        SESS.run([self.pull_a_params_op, self.pull_c_params_op])\n\n    def choose_action(self, s):  # run by a local\n        s = s[np.newaxis, :]\n        return SESS.run(self.A, {self.s: s})\n\n\nclass Worker(object):\n    def __init__(self, name, globalAC):\n        self.env = gym.make(GAME).unwrapped\n        self.name = name\n        self.AC = ACNet(name, globalAC)\n\n    def work(self):\n        global GLOBAL_RUNNING_R, GLOBAL_EP\n        total_step = 1\n        buffer_s, buffer_a, buffer_r = [], [], []\n        while not COORD.should_stop() and GLOBAL_EP < MAX_GLOBAL_EP:\n            s = self.env.reset()\n            ep_r = 0\n            for ep_t in range(MAX_EP_STEP):\n                # if self.name == 'W_0':\n                #     self.env.render()\n                a = self.AC.choose_action(s)\n                s_, r, done, info = self.env.step(a)\n                done = True if ep_t == MAX_EP_STEP - 1 else False\n\n                ep_r += r\n                buffer_s.append(s)\n                buffer_a.append(a)\n                buffer_r.append((r+8)/8)    # normalize\n\n                if total_step % UPDATE_GLOBAL_ITER == 0 or done:   # update global and assign to local net\n                    if done:\n                        v_s_ = 0   # terminal\n                    else:\n                        v_s_ = SESS.run(self.AC.v, {self.AC.s: s_[np.newaxis, :]})[0, 0]\n                    buffer_v_target = []\n                    for r in buffer_r[::-1]:    # reverse buffer r\n                        v_s_ = r + GAMMA * v_s_\n                        buffer_v_target.append(v_s_)\n                    buffer_v_target.reverse()\n\n                    buffer_s, buffer_a, buffer_v_target = np.vstack(buffer_s), np.vstack(buffer_a), np.vstack(buffer_v_target)\n                    feed_dict = {\n                        self.AC.s: buffer_s,\n                        self.AC.a_his: buffer_a,\n                        self.AC.v_target: buffer_v_target,\n                    }\n                    self.AC.update_global(feed_dict)\n                    buffer_s, buffer_a, buffer_r = [], [], []\n                    self.AC.pull_global()\n\n                s = s_\n                total_step += 1\n                if done:\n                    if len(GLOBAL_RUNNING_R) == 0:  # record running episode reward\n                        GLOBAL_RUNNING_R.append(ep_r)\n                    else:\n                        GLOBAL_RUNNING_R.append(0.9 * GLOBAL_RUNNING_R[-1] + 0.1 * ep_r)\n                    print(\n                        self.name,\n                        \"Ep:\", GLOBAL_EP,\n                        \"| Ep_r: %i\" % GLOBAL_RUNNING_R[-1],\n                          )\n                    GLOBAL_EP += 1\n                    break\n\nif __name__ == \"__main__\":\n    SESS = tf.Session()\n\n    with tf.device(\"/cpu:0\"):\n        OPT_A = tf.train.RMSPropOptimizer(LR_A, name='RMSPropA')\n        OPT_C = tf.train.RMSPropOptimizer(LR_C, name='RMSPropC')\n        GLOBAL_AC = ACNet(GLOBAL_NET_SCOPE)  # we only need its params\n        workers = []\n        # Create worker\n        for i in range(N_WORKERS):\n            i_name = 'W_%i' % i   # worker name\n            workers.append(Worker(i_name, GLOBAL_AC))\n\n    COORD = tf.train.Coordinator()\n    SESS.run(tf.global_variables_initializer())\n\n    if OUTPUT_GRAPH:\n        if os.path.exists(LOG_DIR):\n            shutil.rmtree(LOG_DIR)\n        tf.summary.FileWriter(LOG_DIR, SESS.graph)\n\n    worker_threads = []\n    for worker in workers:\n        job = lambda: worker.work()\n        t = threading.Thread(target=job)\n        t.start()\n        worker_threads.append(t)\n    COORD.join(worker_threads)\n\n    plt.plot(np.arange(len(GLOBAL_RUNNING_R)), GLOBAL_RUNNING_R)\n    plt.xlabel('step')\n    plt.ylabel('Total moving reward')\n    plt.show()\n\n"
  },
  {
    "path": "contents/10_A3C/A3C_discrete_action.py",
    "content": "\"\"\"\nAsynchronous Advantage Actor Critic (A3C) with discrete action space, Reinforcement Learning.\n\nThe Cartpole example.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\ntensorflow 1.8.0\ngym 0.10.5\n\"\"\"\n\nimport multiprocessing\nimport threading\nimport tensorflow as tf\nimport numpy as np\nimport gym\nimport os\nimport shutil\nimport matplotlib.pyplot as plt\n\n\nGAME = 'CartPole-v0'\nOUTPUT_GRAPH = True\nLOG_DIR = './log'\nN_WORKERS = multiprocessing.cpu_count()\nMAX_GLOBAL_EP = 1000\nGLOBAL_NET_SCOPE = 'Global_Net'\nUPDATE_GLOBAL_ITER = 10\nGAMMA = 0.9\nENTROPY_BETA = 0.001\nLR_A = 0.001    # learning rate for actor\nLR_C = 0.001    # learning rate for critic\nGLOBAL_RUNNING_R = []\nGLOBAL_EP = 0\n\nenv = gym.make(GAME)\nN_S = env.observation_space.shape[0]\nN_A = env.action_space.n\n\n\nclass ACNet(object):\n    def __init__(self, scope, globalAC=None):\n\n        if scope == GLOBAL_NET_SCOPE:   # get global network\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self.a_params, self.c_params = self._build_net(scope)[-2:]\n        else:   # local net, calculate losses\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self.a_his = tf.placeholder(tf.int32, [None, ], 'A')\n                self.v_target = tf.placeholder(tf.float32, [None, 1], 'Vtarget')\n\n                self.a_prob, self.v, self.a_params, self.c_params = self._build_net(scope)\n\n                td = tf.subtract(self.v_target, self.v, name='TD_error')\n                with tf.name_scope('c_loss'):\n                    self.c_loss = tf.reduce_mean(tf.square(td))\n\n                with tf.name_scope('a_loss'):\n                    log_prob = tf.reduce_sum(tf.log(self.a_prob + 1e-5) * tf.one_hot(self.a_his, N_A, dtype=tf.float32), axis=1, keep_dims=True)\n                    exp_v = log_prob * tf.stop_gradient(td)\n                    entropy = -tf.reduce_sum(self.a_prob * tf.log(self.a_prob + 1e-5),\n                                             axis=1, keep_dims=True)  # encourage exploration\n                    self.exp_v = ENTROPY_BETA * entropy + exp_v\n                    self.a_loss = tf.reduce_mean(-self.exp_v)\n\n                with tf.name_scope('local_grad'):\n                    self.a_grads = tf.gradients(self.a_loss, self.a_params)\n                    self.c_grads = tf.gradients(self.c_loss, self.c_params)\n\n            with tf.name_scope('sync'):\n                with tf.name_scope('pull'):\n                    self.pull_a_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.a_params, globalAC.a_params)]\n                    self.pull_c_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.c_params, globalAC.c_params)]\n                with tf.name_scope('push'):\n                    self.update_a_op = OPT_A.apply_gradients(zip(self.a_grads, globalAC.a_params))\n                    self.update_c_op = OPT_C.apply_gradients(zip(self.c_grads, globalAC.c_params))\n\n    def _build_net(self, scope):\n        w_init = tf.random_normal_initializer(0., .1)\n        with tf.variable_scope('actor'):\n            l_a = tf.layers.dense(self.s, 200, tf.nn.relu6, kernel_initializer=w_init, name='la')\n            a_prob = tf.layers.dense(l_a, N_A, tf.nn.softmax, kernel_initializer=w_init, name='ap')\n        with tf.variable_scope('critic'):\n            l_c = tf.layers.dense(self.s, 100, tf.nn.relu6, kernel_initializer=w_init, name='lc')\n            v = tf.layers.dense(l_c, 1, kernel_initializer=w_init, name='v')  # state value\n        a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\n        c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\n        return a_prob, v, a_params, c_params\n\n    def update_global(self, feed_dict):  # run by a local\n        SESS.run([self.update_a_op, self.update_c_op], feed_dict)  # local grads applies to global net\n\n    def pull_global(self):  # run by a local\n        SESS.run([self.pull_a_params_op, self.pull_c_params_op])\n\n    def choose_action(self, s):  # run by a local\n        prob_weights = SESS.run(self.a_prob, feed_dict={self.s: s[np.newaxis, :]})\n        action = np.random.choice(range(prob_weights.shape[1]),\n                                  p=prob_weights.ravel())  # select action w.r.t the actions prob\n        return action\n\n\nclass Worker(object):\n    def __init__(self, name, globalAC):\n        self.env = gym.make(GAME).unwrapped\n        self.name = name\n        self.AC = ACNet(name, globalAC)\n\n    def work(self):\n        global GLOBAL_RUNNING_R, GLOBAL_EP\n        total_step = 1\n        buffer_s, buffer_a, buffer_r = [], [], []\n        while not COORD.should_stop() and GLOBAL_EP < MAX_GLOBAL_EP:\n            s = self.env.reset()\n            ep_r = 0\n            while True:\n                # if self.name == 'W_0':\n                #     self.env.render()\n                a = self.AC.choose_action(s)\n                s_, r, done, info = self.env.step(a)\n                if done: r = -5\n                ep_r += r\n                buffer_s.append(s)\n                buffer_a.append(a)\n                buffer_r.append(r)\n\n                if total_step % UPDATE_GLOBAL_ITER == 0 or done:   # update global and assign to local net\n                    if done:\n                        v_s_ = 0   # terminal\n                    else:\n                        v_s_ = SESS.run(self.AC.v, {self.AC.s: s_[np.newaxis, :]})[0, 0]\n                    buffer_v_target = []\n                    for r in buffer_r[::-1]:    # reverse buffer r\n                        v_s_ = r + GAMMA * v_s_\n                        buffer_v_target.append(v_s_)\n                    buffer_v_target.reverse()\n\n                    buffer_s, buffer_a, buffer_v_target = np.vstack(buffer_s), np.array(buffer_a), np.vstack(buffer_v_target)\n                    feed_dict = {\n                        self.AC.s: buffer_s,\n                        self.AC.a_his: buffer_a,\n                        self.AC.v_target: buffer_v_target,\n                    }\n                    self.AC.update_global(feed_dict)\n\n                    buffer_s, buffer_a, buffer_r = [], [], []\n                    self.AC.pull_global()\n\n                s = s_\n                total_step += 1\n                if done:\n                    if len(GLOBAL_RUNNING_R) == 0:  # record running episode reward\n                        GLOBAL_RUNNING_R.append(ep_r)\n                    else:\n                        GLOBAL_RUNNING_R.append(0.99 * GLOBAL_RUNNING_R[-1] + 0.01 * ep_r)\n                    print(\n                        self.name,\n                        \"Ep:\", GLOBAL_EP,\n                        \"| Ep_r: %i\" % GLOBAL_RUNNING_R[-1],\n                          )\n                    GLOBAL_EP += 1\n                    break\n\nif __name__ == \"__main__\":\n    SESS = tf.Session()\n\n    with tf.device(\"/cpu:0\"):\n        OPT_A = tf.train.RMSPropOptimizer(LR_A, name='RMSPropA')\n        OPT_C = tf.train.RMSPropOptimizer(LR_C, name='RMSPropC')\n        GLOBAL_AC = ACNet(GLOBAL_NET_SCOPE)  # we only need its params\n        workers = []\n        # Create worker\n        for i in range(N_WORKERS):\n            i_name = 'W_%i' % i   # worker name\n            workers.append(Worker(i_name, GLOBAL_AC))\n\n    COORD = tf.train.Coordinator()\n    SESS.run(tf.global_variables_initializer())\n\n    if OUTPUT_GRAPH:\n        if os.path.exists(LOG_DIR):\n            shutil.rmtree(LOG_DIR)\n        tf.summary.FileWriter(LOG_DIR, SESS.graph)\n\n    worker_threads = []\n    for worker in workers:\n        job = lambda: worker.work()\n        t = threading.Thread(target=job)\n        t.start()\n        worker_threads.append(t)\n    COORD.join(worker_threads)\n\n    plt.plot(np.arange(len(GLOBAL_RUNNING_R)), GLOBAL_RUNNING_R)\n    plt.xlabel('step')\n    plt.ylabel('Total moving reward')\n    plt.show()\n"
  },
  {
    "path": "contents/10_A3C/A3C_distributed_tf.py",
    "content": "\"\"\"\nAsynchronous Advantage Actor Critic (A3C) with discrete action space, Reinforcement Learning.\n\nThe Cartpole example using distributed tensorflow + multiprocessing.\n\nView more on my tutorial page: https://morvanzhou.github.io/\n\n\"\"\"\n\nimport multiprocessing as mp\nimport tensorflow as tf\nimport numpy as np\nimport gym, time\nimport matplotlib.pyplot as plt\n\n\nUPDATE_GLOBAL_ITER = 10\nGAMMA = 0.9\nENTROPY_BETA = 0.001\nLR_A = 0.001    # learning rate for actor\nLR_C = 0.001    # learning rate for critic\n\nenv = gym.make('CartPole-v0')\nN_S = env.observation_space.shape[0]\nN_A = env.action_space.n\n\n\nclass ACNet(object):\n    sess = None\n\n    def __init__(self, scope, opt_a=None, opt_c=None, global_net=None):\n        if scope == 'global_net':  # get global network\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self.a_params, self.c_params = self._build_net(scope)[-2:]\n        else:\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self.a_his = tf.placeholder(tf.int32, [None, ], 'A')\n                self.v_target = tf.placeholder(tf.float32, [None, 1], 'Vtarget')\n\n                self.a_prob, self.v, self.a_params, self.c_params = self._build_net(scope)\n\n                td = tf.subtract(self.v_target, self.v, name='TD_error')\n                with tf.name_scope('c_loss'):\n                    self.c_loss = tf.reduce_mean(tf.square(td))\n\n                with tf.name_scope('a_loss'):\n                    log_prob = tf.reduce_sum(\n                        tf.log(self.a_prob) * tf.one_hot(self.a_his, N_A, dtype=tf.float32),\n                        axis=1, keep_dims=True)\n                    exp_v = log_prob * tf.stop_gradient(td)\n                    entropy = -tf.reduce_sum(self.a_prob * tf.log(self.a_prob + 1e-5),\n                                             axis=1, keep_dims=True)  # encourage exploration\n                    self.exp_v = ENTROPY_BETA * entropy + exp_v\n                    self.a_loss = tf.reduce_mean(-self.exp_v)\n\n                with tf.name_scope('local_grad'):\n                    self.a_grads = tf.gradients(self.a_loss, self.a_params)\n                    self.c_grads = tf.gradients(self.c_loss, self.c_params)\n\n            self.global_step = tf.train.get_or_create_global_step()\n            with tf.name_scope('sync'):\n                with tf.name_scope('pull'):\n                    self.pull_a_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.a_params, global_net.a_params)]\n                    self.pull_c_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.c_params, global_net.c_params)]\n                with tf.name_scope('push'):\n                    self.update_a_op = opt_a.apply_gradients(zip(self.a_grads, global_net.a_params), global_step=self.global_step)\n                    self.update_c_op = opt_c.apply_gradients(zip(self.c_grads, global_net.c_params))\n\n    def _build_net(self, scope):\n        w_init = tf.random_normal_initializer(0., .1)\n        with tf.variable_scope('actor'):\n            l_a = tf.layers.dense(self.s, 200, tf.nn.relu6, kernel_initializer=w_init, name='la')\n            a_prob = tf.layers.dense(l_a, N_A, tf.nn.softmax, kernel_initializer=w_init, name='ap')\n        with tf.variable_scope('critic'):\n            l_c = tf.layers.dense(self.s, 100, tf.nn.relu6, kernel_initializer=w_init, name='lc')\n            v = tf.layers.dense(l_c, 1, kernel_initializer=w_init, name='v')  # state value\n        a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\n        c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\n        return a_prob, v, a_params, c_params\n\n    def choose_action(self, s):  # run by a local\n        prob_weights = self.sess.run(self.a_prob, feed_dict={self.s: s[np.newaxis, :]})\n        action = np.random.choice(range(prob_weights.shape[1]),\n                                  p=prob_weights.ravel())  # select action w.r.t the actions prob\n        return action\n\n    def update_global(self, feed_dict):  # run by a local\n        self.sess.run([self.update_a_op, self.update_c_op], feed_dict)  # local grads applies to global net\n\n    def pull_global(self):  # run by a local\n        self.sess.run([self.pull_a_params_op, self.pull_c_params_op])\n\n\ndef work(job_name, task_index, global_ep, lock, r_queue, global_running_r):\n    # set work's ip:port\n    cluster = tf.train.ClusterSpec({\n        \"ps\": ['localhost:2220', 'localhost:2221',],\n        \"worker\": ['localhost:2222', 'localhost:2223', 'localhost:2224', 'localhost:2225',]\n    })\n    server = tf.train.Server(cluster, job_name=job_name, task_index=task_index)\n    if job_name == 'ps':\n        print('Start Parameter Sever: ', task_index)\n        server.join()\n    else:\n        t1 = time.time()\n        env = gym.make('CartPole-v0').unwrapped\n        print('Start Worker: ', task_index)\n        with tf.device(tf.train.replica_device_setter(\n                worker_device=\"/job:worker/task:%d\" % task_index,\n                cluster=cluster)):\n            opt_a = tf.train.RMSPropOptimizer(LR_A, name='opt_a')\n            opt_c = tf.train.RMSPropOptimizer(LR_C, name='opt_c')\n            global_net = ACNet('global_net')\n\n        local_net = ACNet('local_ac%d' % task_index, opt_a, opt_c, global_net)\n        # set training steps\n        hooks = [tf.train.StopAtStepHook(last_step=100000)]\n        with tf.train.MonitoredTrainingSession(master=server.target,\n                                               is_chief=True,\n                                               hooks=hooks,) as sess:\n            print('Start Worker Session: ', task_index)\n            local_net.sess = sess\n            total_step = 1\n            buffer_s, buffer_a, buffer_r = [], [], []\n            while (not sess.should_stop()) and (global_ep.value < 1000):\n                s = env.reset()\n                ep_r = 0\n                while True:\n                    # if task_index:\n                    #     env.render()\n                    a = local_net.choose_action(s)\n                    s_, r, done, info = env.step(a)\n                    if done: r = -5.\n                    ep_r += r\n                    buffer_s.append(s)\n                    buffer_a.append(a)\n                    buffer_r.append(r)\n\n                    if total_step % UPDATE_GLOBAL_ITER == 0 or done:  # update global and assign to local net\n                        if done:\n                            v_s_ = 0  # terminal\n                        else:\n                            v_s_ = sess.run(local_net.v, {local_net.s: s_[np.newaxis, :]})[0, 0]\n                        buffer_v_target = []\n                        for r in buffer_r[::-1]:  # reverse buffer r\n                            v_s_ = r + GAMMA * v_s_\n                            buffer_v_target.append(v_s_)\n                        buffer_v_target.reverse()\n\n                        buffer_s, buffer_a, buffer_v_target = np.vstack(buffer_s), np.array(buffer_a), np.vstack(\n                            buffer_v_target)\n                        feed_dict = {\n                            local_net.s: buffer_s,\n                            local_net.a_his: buffer_a,\n                            local_net.v_target: buffer_v_target,\n                        }\n                        local_net.update_global(feed_dict)\n                        buffer_s, buffer_a, buffer_r = [], [], []\n                        local_net.pull_global()\n                    s = s_\n                    total_step += 1\n                    if done:\n                        if r_queue.empty():  # record running episode reward\n                            global_running_r.value = ep_r\n                        else:\n                            global_running_r.value = .99 * global_running_r.value + 0.01 * ep_r\n                        r_queue.put(global_running_r.value)\n\n                        print(\n                            \"Task: %i\" % task_index,\n                            \"| Ep: %i\" % global_ep.value,\n                            \"| Ep_r: %i\" % global_running_r.value,\n                            \"| Global_step: %i\" % sess.run(local_net.global_step),\n                        )\n                        with lock:\n                            global_ep.value += 1\n                        break\n\n        print('Worker Done: ', task_index, time.time()-t1)\n\n\nif __name__ == \"__main__\":\n    # use multiprocessing to create a local cluster with 2 parameter servers and 2 workers\n    global_ep = mp.Value('i', 0)\n    lock = mp.Lock()\n    r_queue = mp.Queue()\n    global_running_r = mp.Value('d', 0)\n\n    jobs = [\n        ('ps', 0), ('ps', 1),\n        ('worker', 0), ('worker', 1), ('worker', 2), ('worker', 3)\n    ]\n    ps = [mp.Process(target=work, args=(j, i, global_ep, lock, r_queue, global_running_r), ) for j, i in jobs]\n    [p.start() for p in ps]\n    [p.join() for p in ps[2:]]\n\n    ep_r = []\n    while not r_queue.empty():\n        ep_r.append(r_queue.get())\n    plt.plot(np.arange(len(ep_r)), ep_r)\n    plt.title('Distributed training')\n    plt.xlabel('Step')\n    plt.ylabel('Total moving reward')\n    plt.show()\n\n\n\n"
  },
  {
    "path": "contents/11_Dyna_Q/RL_brain.py",
    "content": "\"\"\"\nThis part of code is the Dyna-Q learning brain, which is a brain of the agent.\nAll decisions and learning processes are made in here.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\"\"\"\n\nimport numpy as np\nimport pandas as pd\n\n\nclass QLearningTable:\n    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):\n        self.actions = actions  # a list\n        self.lr = learning_rate\n        self.gamma = reward_decay\n        self.epsilon = e_greedy\n\n        ## argmax type error\n        self.q_table = pd.DataFrame(columns=self.actions).astype('float32')\n\n    def choose_action(self, observation):\n        self.check_state_exist(observation)\n        # action selection\n        if np.random.uniform() < self.epsilon:\n            # choose best action\n\n\n            # state_action = self.q_table.ix[observation, :]\n            state_action = self.q_table.loc[observation, :]             # for label indexing\n            state_action = state_action.reindex(np.random.permutation(state_action.index))     # some actions have same value\n            action = state_action.argmax()\n\n\n        else:\n            # choose random action\n            action = np.random.choice(self.actions)\n        return action\n\n    def learn(self, s, a, r, s_):\n        self.check_state_exist(s_)\n\n        q_predict = self.q_table.loc[s, a]\n        if s_ != 'terminal':\n            q_target = r + self.gamma * self.q_table.loc[s_, :].max()  # next state is not terminal\n        else:\n            q_target = r  # next state is terminal\n        self.q_table.loc[s, a] += self.lr * (q_target - q_predict)  # update\n\n    def check_state_exist(self, state):\n        if state not in self.q_table.index:\n            # append new state to q table\n            self.q_table = self.q_table.append(\n                pd.Series(\n                    [0]*len(self.actions),\n                    index=self.q_table.columns,\n                    name=state,\n                )\n            )\n\n\nclass EnvModel:\n    \"\"\"Similar to the memory buffer in DQN, you can store past experiences in here.\n    Alternatively, the model can generate next state and reward signal accurately.\"\"\"\n    def __init__(self, actions):\n        # the simplest case is to think about the model is a memory which has all past transition information\n        self.actions = actions\n        self.database = pd.DataFrame(columns=actions, dtype=np.object)\n\n    def store_transition(self, s, a, r, s_):\n        if s not in self.database.index:\n            self.database = self.database.append(\n                pd.Series(\n                    [None] * len(self.actions),\n                    index=self.database.columns,\n                    name=s,\n                ))\n        self.database.set_value(s, a, (r, s_))\n\n    def sample_s_a(self):\n        s = np.random.choice(self.database.index)\n        a = np.random.choice(self.database.loc[s].dropna().index)    # filter out the None value\n        return s, a\n\n    def get_r_s_(self, s, a):\n        r, s_ = self.database.loc[s, a]\n        return r, s_\n"
  },
  {
    "path": "contents/11_Dyna_Q/maze_env.py",
    "content": "\"\"\"\nReinforcement learning maze example.\n\nRed rectangle:          explorer.\nBlack rectangles:       hells       [reward = -1].\nYellow bin circle:      paradise    [reward = +1].\nAll other states:       ground      [reward = 0].\n\nThis script is the environment part of this example. The RL is in RL_brain.py.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\"\"\"\n\n\nimport numpy as np\nnp.random.seed(1)\nimport tkinter as tk\nimport time\n\n\nUNIT = 40   # pixels\nMAZE_H = 4  # grid height\nMAZE_W = 4  # grid width\n\n\nclass Maze(tk.Tk, object):\n    def __init__(self):\n        super(Maze, self).__init__()\n        self.action_space = ['u', 'd', 'l', 'r']\n        self.n_actions = len(self.action_space)\n        self.title('maze')\n        self.geometry('{0}x{1}'.format(MAZE_H * UNIT, MAZE_H * UNIT))\n        self._build_maze()\n\n    def _build_maze(self):\n        self.canvas = tk.Canvas(self, bg='white',\n                           height=MAZE_H * UNIT,\n                           width=MAZE_W * UNIT)\n\n        # create grids\n        for c in range(0, MAZE_W * UNIT, UNIT):\n            x0, y0, x1, y1 = c, 0, c, MAZE_H * UNIT\n            self.canvas.create_line(x0, y0, x1, y1)\n        for r in range(0, MAZE_H * UNIT, UNIT):\n            x0, y0, x1, y1 = 0, r, MAZE_H * UNIT, r\n            self.canvas.create_line(x0, y0, x1, y1)\n\n        # create origin\n        origin = np.array([20, 20])\n\n        # hell\n        hell1_center = origin + np.array([UNIT * 2, UNIT])\n        self.hell1 = self.canvas.create_rectangle(\n            hell1_center[0] - 15, hell1_center[1] - 15,\n            hell1_center[0] + 15, hell1_center[1] + 15,\n            fill='black')\n        # hell\n        hell2_center = origin + np.array([UNIT, UNIT * 2])\n        self.hell2 = self.canvas.create_rectangle(\n            hell2_center[0] - 15, hell2_center[1] - 15,\n            hell2_center[0] + 15, hell2_center[1] + 15,\n            fill='black')\n\n        # create oval\n        oval_center = origin + UNIT * 2\n        self.oval = self.canvas.create_oval(\n            oval_center[0] - 15, oval_center[1] - 15,\n            oval_center[0] + 15, oval_center[1] + 15,\n            fill='yellow')\n\n        # create red rect\n        self.rect = self.canvas.create_rectangle(\n            origin[0] - 15, origin[1] - 15,\n            origin[0] + 15, origin[1] + 15,\n            fill='red')\n\n        # pack all\n        self.canvas.pack()\n\n    def reset(self):\n        self.update()\n        time.sleep(0.5)\n        self.canvas.delete(self.rect)\n        origin = np.array([20, 20])\n        self.rect = self.canvas.create_rectangle(\n            origin[0] - 15, origin[1] - 15,\n            origin[0] + 15, origin[1] + 15,\n            fill='red')\n        # return observation\n        return self.canvas.coords(self.rect)\n\n    def step(self, action):\n        s = self.canvas.coords(self.rect)\n        base_action = np.array([0, 0])\n        if action == 0:   # up\n            if s[1] > UNIT:\n                base_action[1] -= UNIT\n        elif action == 1:   # down\n            if s[1] < (MAZE_H - 1) * UNIT:\n                base_action[1] += UNIT\n        elif action == 2:   # right\n            if s[0] < (MAZE_W - 1) * UNIT:\n                base_action[0] += UNIT\n        elif action == 3:   # left\n            if s[0] > UNIT:\n                base_action[0] -= UNIT\n\n        self.canvas.move(self.rect, base_action[0], base_action[1])  # move agent\n\n        s_ = self.canvas.coords(self.rect)  # next state\n\n        # reward function\n        if s_ == self.canvas.coords(self.oval):\n            reward = 1\n            done = True\n        elif s_ in [self.canvas.coords(self.hell1), self.canvas.coords(self.hell2)]:\n            reward = -1\n            done = True\n        else:\n            reward = 0\n            done = False\n\n        return s_, reward, done\n\n    def render(self):\n        # time.sleep(0.1)\n        self.update()\n\n\n"
  },
  {
    "path": "contents/11_Dyna_Q/run_this.py",
    "content": "\"\"\"\nSimplest model-based RL, Dyna-Q.\n\nRed rectangle:          explorer.\nBlack rectangles:       hells       [reward = -1].\nYellow bin circle:      paradise    [reward = +1].\nAll other states:       ground      [reward = 0].\n\nThis script is the main part which controls the update method of this example.\nThe RL is in RL_brain.py.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\"\"\"\n\nfrom maze_env import Maze\nfrom RL_brain import QLearningTable, EnvModel\n\n\ndef update():\n    for episode in range(40):\n        s = env.reset()\n        while True:\n            env.render()\n            a = RL.choose_action(str(s))\n            s_, r, done = env.step(a)\n            RL.learn(str(s), a, r, str(s_))\n\n            # use a model to output (r, s_) by inputting (s, a)\n            # the model in dyna Q version is just like a memory replay buffer\n            env_model.store_transition(str(s), a, r, s_)\n            for n in range(10):     # learn 10 more times using the env_model\n                ms, ma = env_model.sample_s_a()  # ms in here is a str\n                mr, ms_ = env_model.get_r_s_(ms, ma)\n                RL.learn(ms, ma, mr, str(ms_))\n\n            s = s_\n            if done:\n                break\n\n    # end of game\n    print('game over')\n    env.destroy()\n\n\nif __name__ == \"__main__\":\n    env = Maze()\n    RL = QLearningTable(actions=list(range(env.n_actions)))\n    env_model = EnvModel(actions=list(range(env.n_actions)))\n\n    env.after(0, update)\n    env.mainloop()"
  },
  {
    "path": "contents/12_Proximal_Policy_Optimization/DPPO.py",
    "content": "\"\"\"\nA simple version of OpenAI's Proximal Policy Optimization (PPO). [https://arxiv.org/abs/1707.06347]\n\nDistributing workers in parallel to collect data, then stop worker's roll-out and train PPO on collected data.\nRestart workers once PPO is updated.\n\nThe global PPO updating rule is adopted from DeepMind's paper (DPPO):\nEmergence of Locomotion Behaviours in Rich Environments (Google Deepmind): [https://arxiv.org/abs/1707.02286]\n\nView more on my tutorial website: https://morvanzhou.github.io/tutorials\n\nDependencies:\ntensorflow r1.3\ngym 0.9.2\n\"\"\"\n\nimport tensorflow as tf\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport gym, threading, queue\n\nEP_MAX = 1000\nEP_LEN = 200\nN_WORKER = 4                # parallel workers\nGAMMA = 0.9                 # reward discount factor\nA_LR = 0.0001               # learning rate for actor\nC_LR = 0.0002               # learning rate for critic\nMIN_BATCH_SIZE = 64         # minimum batch size for updating PPO\nUPDATE_STEP = 10            # loop update operation n-steps\nEPSILON = 0.2               # for clipping surrogate objective\nGAME = 'Pendulum-v0'\nS_DIM, A_DIM = 3, 1         # state and action dimension\n\n\nclass PPO(object):\n    def __init__(self):\n        self.sess = tf.Session()\n        self.tfs = tf.placeholder(tf.float32, [None, S_DIM], 'state')\n\n        # critic\n        l1 = tf.layers.dense(self.tfs, 100, tf.nn.relu)\n        self.v = tf.layers.dense(l1, 1)\n        self.tfdc_r = tf.placeholder(tf.float32, [None, 1], 'discounted_r')\n        self.advantage = self.tfdc_r - self.v\n        self.closs = tf.reduce_mean(tf.square(self.advantage))\n        self.ctrain_op = tf.train.AdamOptimizer(C_LR).minimize(self.closs)\n\n        # actor\n        pi, pi_params = self._build_anet('pi', trainable=True)\n        oldpi, oldpi_params = self._build_anet('oldpi', trainable=False)\n        self.sample_op = tf.squeeze(pi.sample(1), axis=0)  # operation of choosing action\n        self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)]\n\n        self.tfa = tf.placeholder(tf.float32, [None, A_DIM], 'action')\n        self.tfadv = tf.placeholder(tf.float32, [None, 1], 'advantage')\n        # ratio = tf.exp(pi.log_prob(self.tfa) - oldpi.log_prob(self.tfa))\n        ratio = pi.prob(self.tfa) / (oldpi.prob(self.tfa) + 1e-5)\n        surr = ratio * self.tfadv                       # surrogate loss\n\n        self.aloss = -tf.reduce_mean(tf.minimum(        # clipped surrogate objective\n            surr,\n            tf.clip_by_value(ratio, 1. - EPSILON, 1. + EPSILON) * self.tfadv))\n\n        self.atrain_op = tf.train.AdamOptimizer(A_LR).minimize(self.aloss)\n        self.sess.run(tf.global_variables_initializer())\n\n    def update(self):\n        global GLOBAL_UPDATE_COUNTER\n        while not COORD.should_stop():\n            if GLOBAL_EP < EP_MAX:\n                UPDATE_EVENT.wait()                     # wait until get batch of data\n                self.sess.run(self.update_oldpi_op)     # copy pi to old pi\n                data = [QUEUE.get() for _ in range(QUEUE.qsize())]      # collect data from all workers\n                data = np.vstack(data)\n                s, a, r = data[:, :S_DIM], data[:, S_DIM: S_DIM + A_DIM], data[:, -1:]\n                adv = self.sess.run(self.advantage, {self.tfs: s, self.tfdc_r: r})\n                # update actor and critic in a update loop\n                [self.sess.run(self.atrain_op, {self.tfs: s, self.tfa: a, self.tfadv: adv}) for _ in range(UPDATE_STEP)]\n                [self.sess.run(self.ctrain_op, {self.tfs: s, self.tfdc_r: r}) for _ in range(UPDATE_STEP)]\n                UPDATE_EVENT.clear()        # updating finished\n                GLOBAL_UPDATE_COUNTER = 0   # reset counter\n                ROLLING_EVENT.set()         # set roll-out available\n\n    def _build_anet(self, name, trainable):\n        with tf.variable_scope(name):\n            l1 = tf.layers.dense(self.tfs, 200, tf.nn.relu, trainable=trainable)\n            mu = 2 * tf.layers.dense(l1, A_DIM, tf.nn.tanh, trainable=trainable)\n            sigma = tf.layers.dense(l1, A_DIM, tf.nn.softplus, trainable=trainable)\n            norm_dist = tf.distributions.Normal(loc=mu, scale=sigma)\n        params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)\n        return norm_dist, params\n\n    def choose_action(self, s):\n        s = s[np.newaxis, :]\n        a = self.sess.run(self.sample_op, {self.tfs: s})[0]\n        return np.clip(a, -2, 2)\n\n    def get_v(self, s):\n        if s.ndim < 2: s = s[np.newaxis, :]\n        return self.sess.run(self.v, {self.tfs: s})[0, 0]\n\n\nclass Worker(object):\n    def __init__(self, wid):\n        self.wid = wid\n        self.env = gym.make(GAME).unwrapped\n        self.ppo = GLOBAL_PPO\n\n    def work(self):\n        global GLOBAL_EP, GLOBAL_RUNNING_R, GLOBAL_UPDATE_COUNTER\n        while not COORD.should_stop():\n            s = self.env.reset()\n            ep_r = 0\n            buffer_s, buffer_a, buffer_r = [], [], []\n            for t in range(EP_LEN):\n                if not ROLLING_EVENT.is_set():                  # while global PPO is updating\n                    ROLLING_EVENT.wait()                        # wait until PPO is updated\n                    buffer_s, buffer_a, buffer_r = [], [], []   # clear history buffer, use new policy to collect data\n                a = self.ppo.choose_action(s)\n                s_, r, done, _ = self.env.step(a)\n                buffer_s.append(s)\n                buffer_a.append(a)\n                buffer_r.append((r + 8) / 8)                    # normalize reward, find to be useful\n                s = s_\n                ep_r += r\n\n                GLOBAL_UPDATE_COUNTER += 1                      # count to minimum batch size, no need to wait other workers\n                if t == EP_LEN - 1 or GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE:\n                    v_s_ = self.ppo.get_v(s_)\n                    discounted_r = []                           # compute discounted reward\n                    for r in buffer_r[::-1]:\n                        v_s_ = r + GAMMA * v_s_\n                        discounted_r.append(v_s_)\n                    discounted_r.reverse()\n\n                    bs, ba, br = np.vstack(buffer_s), np.vstack(buffer_a), np.array(discounted_r)[:, np.newaxis]\n                    buffer_s, buffer_a, buffer_r = [], [], []\n                    QUEUE.put(np.hstack((bs, ba, br)))          # put data in the queue\n                    if GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE:\n                        ROLLING_EVENT.clear()       # stop collecting data\n                        UPDATE_EVENT.set()          # globalPPO update\n\n                    if GLOBAL_EP >= EP_MAX:         # stop training\n                        COORD.request_stop()\n                        break\n\n            # record reward changes, plot later\n            if len(GLOBAL_RUNNING_R) == 0: GLOBAL_RUNNING_R.append(ep_r)\n            else: GLOBAL_RUNNING_R.append(GLOBAL_RUNNING_R[-1]*0.9+ep_r*0.1)\n            GLOBAL_EP += 1\n            print('{0:.1f}%'.format(GLOBAL_EP/EP_MAX*100), '|W%i' % self.wid,  '|Ep_r: %.2f' % ep_r,)\n\n\nif __name__ == '__main__':\n    GLOBAL_PPO = PPO()\n    UPDATE_EVENT, ROLLING_EVENT = threading.Event(), threading.Event()\n    UPDATE_EVENT.clear()            # not update now\n    ROLLING_EVENT.set()             # start to roll out\n    workers = [Worker(wid=i) for i in range(N_WORKER)]\n    \n    GLOBAL_UPDATE_COUNTER, GLOBAL_EP = 0, 0\n    GLOBAL_RUNNING_R = []\n    COORD = tf.train.Coordinator()\n    QUEUE = queue.Queue()           # workers putting data in this queue\n    threads = []\n    for worker in workers:          # worker threads\n        t = threading.Thread(target=worker.work, args=())\n        t.start()                   # training\n        threads.append(t)\n    # add a PPO updating thread\n    threads.append(threading.Thread(target=GLOBAL_PPO.update,))\n    threads[-1].start()\n    COORD.join(threads)\n\n    # plot reward change and test\n    plt.plot(np.arange(len(GLOBAL_RUNNING_R)), GLOBAL_RUNNING_R)\n    plt.xlabel('Episode'); plt.ylabel('Moving reward'); plt.ion(); plt.show()\n    env = gym.make('Pendulum-v0')\n    while True:\n        s = env.reset()\n        for t in range(300):\n            env.render()\n            s = env.step(GLOBAL_PPO.choose_action(s))[0]"
  },
  {
    "path": "contents/12_Proximal_Policy_Optimization/discrete_DPPO.py",
    "content": "\"\"\"\nA simple version of OpenAI's Proximal Policy Optimization (PPO). [https://arxiv.org/abs/1707.06347]\n\nDistributing workers in parallel to collect data, then stop worker's roll-out and train PPO on collected data.\nRestart workers once PPO is updated.\n\nThe global PPO updating rule is adopted from DeepMind's paper (DPPO):\nEmergence of Locomotion Behaviours in Rich Environments (Google Deepmind): [https://arxiv.org/abs/1707.02286]\n\nView more on my tutorial website: https://morvanzhou.github.io/tutorials\n\nDependencies:\ntensorflow 1.8.0\ngym 0.9.2\n\"\"\"\n\nimport tensorflow as tf\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport gym, threading, queue\n\nEP_MAX = 1000\nEP_LEN = 500\nN_WORKER = 4                # parallel workers\nGAMMA = 0.9                 # reward discount factor\nA_LR = 0.0001               # learning rate for actor\nC_LR = 0.0001               # learning rate for critic\nMIN_BATCH_SIZE = 64         # minimum batch size for updating PPO\nUPDATE_STEP = 15            # loop update operation n-steps\nEPSILON = 0.2               # for clipping surrogate objective\nGAME = 'CartPole-v0'\n\nenv = gym.make(GAME)\nS_DIM = env.observation_space.shape[0]\nA_DIM = env.action_space.n\n\n\nclass PPONet(object):\n    def __init__(self):\n        self.sess = tf.Session()\n        self.tfs = tf.placeholder(tf.float32, [None, S_DIM], 'state')\n\n        # critic\n        w_init = tf.random_normal_initializer(0., .1)\n        lc = tf.layers.dense(self.tfs, 200, tf.nn.relu, kernel_initializer=w_init, name='lc')\n        self.v = tf.layers.dense(lc, 1)\n        self.tfdc_r = tf.placeholder(tf.float32, [None, 1], 'discounted_r')\n        self.advantage = self.tfdc_r - self.v\n        self.closs = tf.reduce_mean(tf.square(self.advantage))\n        self.ctrain_op = tf.train.AdamOptimizer(C_LR).minimize(self.closs)\n\n        # actor\n        self.pi, pi_params = self._build_anet('pi', trainable=True)\n        oldpi, oldpi_params = self._build_anet('oldpi', trainable=False)\n        \n        self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)]\n\n        self.tfa = tf.placeholder(tf.int32, [None, ], 'action')\n        self.tfadv = tf.placeholder(tf.float32, [None, 1], 'advantage')\n\n        a_indices = tf.stack([tf.range(tf.shape(self.tfa)[0], dtype=tf.int32), self.tfa], axis=1)\n        pi_prob = tf.gather_nd(params=self.pi, indices=a_indices)   # shape=(None, )\n        oldpi_prob = tf.gather_nd(params=oldpi, indices=a_indices)  # shape=(None, )\n        ratio = pi_prob/(oldpi_prob + 1e-5)\n        surr = ratio * self.tfadv                       # surrogate loss\n\n        self.aloss = -tf.reduce_mean(tf.minimum(        # clipped surrogate objective\n            surr,\n            tf.clip_by_value(ratio, 1. - EPSILON, 1. + EPSILON) * self.tfadv))\n\n        self.atrain_op = tf.train.AdamOptimizer(A_LR).minimize(self.aloss)\n        self.sess.run(tf.global_variables_initializer())\n\n    def update(self):\n        global GLOBAL_UPDATE_COUNTER\n        while not COORD.should_stop():\n            if GLOBAL_EP < EP_MAX:\n                UPDATE_EVENT.wait()                     # wait until get batch of data\n                self.sess.run(self.update_oldpi_op)     # copy pi to old pi\n                data = [QUEUE.get() for _ in range(QUEUE.qsize())]      # collect data from all workers\n                data = np.vstack(data)\n                s, a, r = data[:, :S_DIM], data[:, S_DIM: S_DIM + 1].ravel(), data[:, -1:]\n                adv = self.sess.run(self.advantage, {self.tfs: s, self.tfdc_r: r})\n                # update actor and critic in a update loop\n                [self.sess.run(self.atrain_op, {self.tfs: s, self.tfa: a, self.tfadv: adv}) for _ in range(UPDATE_STEP)]\n                [self.sess.run(self.ctrain_op, {self.tfs: s, self.tfdc_r: r}) for _ in range(UPDATE_STEP)]\n                UPDATE_EVENT.clear()        # updating finished\n                GLOBAL_UPDATE_COUNTER = 0   # reset counter\n                ROLLING_EVENT.set()         # set roll-out available\n\n    def _build_anet(self, name, trainable):\n        with tf.variable_scope(name):\n            l_a = tf.layers.dense(self.tfs, 200, tf.nn.relu, trainable=trainable)\n            a_prob = tf.layers.dense(l_a, A_DIM, tf.nn.softmax, trainable=trainable)\n        params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)\n        return a_prob, params\n\n    def choose_action(self, s):  # run by a local\n        prob_weights = self.sess.run(self.pi, feed_dict={self.tfs: s[None, :]})\n        action = np.random.choice(range(prob_weights.shape[1]),\n                                      p=prob_weights.ravel())  # select action w.r.t the actions prob\n        return action\n    \n    def get_v(self, s):\n        if s.ndim < 2: s = s[np.newaxis, :]\n        return self.sess.run(self.v, {self.tfs: s})[0, 0]\n\n\nclass Worker(object):\n    def __init__(self, wid):\n        self.wid = wid\n        self.env = gym.make(GAME).unwrapped\n        self.ppo = GLOBAL_PPO\n\n    def work(self):\n        global GLOBAL_EP, GLOBAL_RUNNING_R, GLOBAL_UPDATE_COUNTER\n        while not COORD.should_stop():\n            s = self.env.reset()\n            ep_r = 0\n            buffer_s, buffer_a, buffer_r = [], [], []\n            for t in range(EP_LEN):\n                if not ROLLING_EVENT.is_set():                  # while global PPO is updating\n                    ROLLING_EVENT.wait()                        # wait until PPO is updated\n                    buffer_s, buffer_a, buffer_r = [], [], []   # clear history buffer, use new policy to collect data\n                a = self.ppo.choose_action(s)\n                s_, r, done, _ = self.env.step(a)\n                if done: r = -10\n                buffer_s.append(s)\n                buffer_a.append(a)\n                buffer_r.append(r-1)                            # 0 for not down, -11 for down. Reward engineering\n                s = s_\n                ep_r += r\n\n                GLOBAL_UPDATE_COUNTER += 1                      # count to minimum batch size, no need to wait other workers\n                if t == EP_LEN - 1 or GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE or done:\n                    if done:\n                        v_s_ = 0                                # end of episode\n                    else:\n                        v_s_ = self.ppo.get_v(s_)\n                    \n                    discounted_r = []                           # compute discounted reward\n                    for r in buffer_r[::-1]:\n                        v_s_ = r + GAMMA * v_s_\n                        discounted_r.append(v_s_)\n                    discounted_r.reverse()\n\n                    bs, ba, br = np.vstack(buffer_s), np.vstack(buffer_a), np.array(discounted_r)[:, None]\n                    buffer_s, buffer_a, buffer_r = [], [], []\n                    QUEUE.put(np.hstack((bs, ba, br)))          # put data in the queue\n                    if GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE:\n                        ROLLING_EVENT.clear()       # stop collecting data\n                        UPDATE_EVENT.set()          # globalPPO update\n\n                    if GLOBAL_EP >= EP_MAX:         # stop training\n                        COORD.request_stop()\n                        break\n        \n                    if done: break\n\n            # record reward changes, plot later\n            if len(GLOBAL_RUNNING_R) == 0: GLOBAL_RUNNING_R.append(ep_r)\n            else: GLOBAL_RUNNING_R.append(GLOBAL_RUNNING_R[-1]*0.9+ep_r*0.1)\n            GLOBAL_EP += 1\n            print('{0:.1f}%'.format(GLOBAL_EP/EP_MAX*100), '|W%i' % self.wid,  '|Ep_r: %.2f' % ep_r,)\n\n\nif __name__ == '__main__':\n    GLOBAL_PPO = PPONet()\n    UPDATE_EVENT, ROLLING_EVENT = threading.Event(), threading.Event()\n    UPDATE_EVENT.clear()            # not update now\n    ROLLING_EVENT.set()             # start to roll out\n    workers = [Worker(wid=i) for i in range(N_WORKER)]\n\n    GLOBAL_UPDATE_COUNTER, GLOBAL_EP = 0, 0\n    GLOBAL_RUNNING_R = []\n    COORD = tf.train.Coordinator()\n    QUEUE = queue.Queue()           # workers putting data in this queue\n    threads = []\n    for worker in workers:          # worker threads\n        t = threading.Thread(target=worker.work, args=())\n        t.start()                   # training\n        threads.append(t)\n    # add a PPO updating thread\n    threads.append(threading.Thread(target=GLOBAL_PPO.update,))\n    threads[-1].start()\n    COORD.join(threads)\n\n    # plot reward change and test\n    plt.plot(np.arange(len(GLOBAL_RUNNING_R)), GLOBAL_RUNNING_R)\n    plt.xlabel('Episode'); plt.ylabel('Moving reward'); plt.ion(); plt.show()\n    env = gym.make('CartPole-v0')\n    while True:\n        s = env.reset()\n        for t in range(1000):\n            env.render()\n            s, r, done, info = env.step(GLOBAL_PPO.choose_action(s))\n            if done:\n                break\n\n"
  },
  {
    "path": "contents/12_Proximal_Policy_Optimization/simply_PPO.py",
    "content": "\"\"\"\nA simple version of Proximal Policy Optimization (PPO) using single thread.\n\nBased on:\n1. Emergence of Locomotion Behaviours in Rich Environments (Google Deepmind): [https://arxiv.org/abs/1707.02286]\n2. Proximal Policy Optimization Algorithms (OpenAI): [https://arxiv.org/abs/1707.06347]\n\nView more on my tutorial website: https://morvanzhou.github.io/tutorials\n\nDependencies:\ntensorflow r1.2\ngym 0.9.2\n\"\"\"\n\nimport tensorflow as tf\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport gym\n\nEP_MAX = 1000\nEP_LEN = 200\nGAMMA = 0.9\nA_LR = 0.0001\nC_LR = 0.0002\nBATCH = 32\nA_UPDATE_STEPS = 10\nC_UPDATE_STEPS = 10\nS_DIM, A_DIM = 3, 1\nMETHOD = [\n    dict(name='kl_pen', kl_target=0.01, lam=0.5),   # KL penalty\n    dict(name='clip', epsilon=0.2),                 # Clipped surrogate objective, find this is better\n][1]        # choose the method for optimization\n\n\nclass PPO(object):\n\n    def __init__(self):\n        self.sess = tf.Session()\n        self.tfs = tf.placeholder(tf.float32, [None, S_DIM], 'state')\n\n        # critic\n        with tf.variable_scope('critic'):\n            l1 = tf.layers.dense(self.tfs, 100, tf.nn.relu)\n            self.v = tf.layers.dense(l1, 1)\n            self.tfdc_r = tf.placeholder(tf.float32, [None, 1], 'discounted_r')\n            self.advantage = self.tfdc_r - self.v\n            self.closs = tf.reduce_mean(tf.square(self.advantage))\n            self.ctrain_op = tf.train.AdamOptimizer(C_LR).minimize(self.closs)\n\n        # actor\n        pi, pi_params = self._build_anet('pi', trainable=True)\n        oldpi, oldpi_params = self._build_anet('oldpi', trainable=False)\n        with tf.variable_scope('sample_action'):\n            self.sample_op = tf.squeeze(pi.sample(1), axis=0)       # choosing action\n        with tf.variable_scope('update_oldpi'):\n            self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)]\n\n        self.tfa = tf.placeholder(tf.float32, [None, A_DIM], 'action')\n        self.tfadv = tf.placeholder(tf.float32, [None, 1], 'advantage')\n        with tf.variable_scope('loss'):\n            with tf.variable_scope('surrogate'):\n                # ratio = tf.exp(pi.log_prob(self.tfa) - oldpi.log_prob(self.tfa))\n                ratio = pi.prob(self.tfa) / (oldpi.prob(self.tfa) + 1e-5)\n                surr = ratio * self.tfadv\n            if METHOD['name'] == 'kl_pen':\n                self.tflam = tf.placeholder(tf.float32, None, 'lambda')\n                kl = tf.distributions.kl_divergence(oldpi, pi)\n                self.kl_mean = tf.reduce_mean(kl)\n                self.aloss = -(tf.reduce_mean(surr - self.tflam * kl))\n            else:   # clipping method, find this is better\n                self.aloss = -tf.reduce_mean(tf.minimum(\n                    surr,\n                    tf.clip_by_value(ratio, 1.-METHOD['epsilon'], 1.+METHOD['epsilon'])*self.tfadv))\n\n        with tf.variable_scope('atrain'):\n            self.atrain_op = tf.train.AdamOptimizer(A_LR).minimize(self.aloss)\n\n        tf.summary.FileWriter(\"log/\", self.sess.graph)\n\n        self.sess.run(tf.global_variables_initializer())\n\n    def update(self, s, a, r):\n        self.sess.run(self.update_oldpi_op)\n        adv = self.sess.run(self.advantage, {self.tfs: s, self.tfdc_r: r})\n        # adv = (adv - adv.mean())/(adv.std()+1e-6)     # sometimes helpful\n\n        # update actor\n        if METHOD['name'] == 'kl_pen':\n            for _ in range(A_UPDATE_STEPS):\n                _, kl = self.sess.run(\n                    [self.atrain_op, self.kl_mean],\n                    {self.tfs: s, self.tfa: a, self.tfadv: adv, self.tflam: METHOD['lam']})\n                if kl > 4*METHOD['kl_target']:  # this in in google's paper\n                    break\n            if kl < METHOD['kl_target'] / 1.5:  # adaptive lambda, this is in OpenAI's paper\n                METHOD['lam'] /= 2\n            elif kl > METHOD['kl_target'] * 1.5:\n                METHOD['lam'] *= 2\n            METHOD['lam'] = np.clip(METHOD['lam'], 1e-4, 10)    # sometimes explode, this clipping is my solution\n        else:   # clipping method, find this is better (OpenAI's paper)\n            [self.sess.run(self.atrain_op, {self.tfs: s, self.tfa: a, self.tfadv: adv}) for _ in range(A_UPDATE_STEPS)]\n\n        # update critic\n        [self.sess.run(self.ctrain_op, {self.tfs: s, self.tfdc_r: r}) for _ in range(C_UPDATE_STEPS)]\n\n    def _build_anet(self, name, trainable):\n        with tf.variable_scope(name):\n            l1 = tf.layers.dense(self.tfs, 100, tf.nn.relu, trainable=trainable)\n            mu = 2 * tf.layers.dense(l1, A_DIM, tf.nn.tanh, trainable=trainable)\n            sigma = tf.layers.dense(l1, A_DIM, tf.nn.softplus, trainable=trainable)\n            norm_dist = tf.distributions.Normal(loc=mu, scale=sigma)\n        params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)\n        return norm_dist, params\n\n    def choose_action(self, s):\n        s = s[np.newaxis, :]\n        a = self.sess.run(self.sample_op, {self.tfs: s})[0]\n        return np.clip(a, -2, 2)\n\n    def get_v(self, s):\n        if s.ndim < 2: s = s[np.newaxis, :]\n        return self.sess.run(self.v, {self.tfs: s})[0, 0]\n\nenv = gym.make('Pendulum-v0').unwrapped\nppo = PPO()\nall_ep_r = []\n\nfor ep in range(EP_MAX):\n    s = env.reset()\n    buffer_s, buffer_a, buffer_r = [], [], []\n    ep_r = 0\n    for t in range(EP_LEN):    # in one episode\n        env.render()\n        a = ppo.choose_action(s)\n        s_, r, done, _ = env.step(a)\n        buffer_s.append(s)\n        buffer_a.append(a)\n        buffer_r.append((r+8)/8)    # normalize reward, find to be useful\n        s = s_\n        ep_r += r\n\n        # update ppo\n        if (t+1) % BATCH == 0 or t == EP_LEN-1:\n            v_s_ = ppo.get_v(s_)\n            discounted_r = []\n            for r in buffer_r[::-1]:\n                v_s_ = r + GAMMA * v_s_\n                discounted_r.append(v_s_)\n            discounted_r.reverse()\n\n            bs, ba, br = np.vstack(buffer_s), np.vstack(buffer_a), np.array(discounted_r)[:, np.newaxis]\n            buffer_s, buffer_a, buffer_r = [], [], []\n            ppo.update(bs, ba, br)\n    if ep == 0: all_ep_r.append(ep_r)\n    else: all_ep_r.append(all_ep_r[-1]*0.9 + ep_r*0.1)\n    print(\n        'Ep: %i' % ep,\n        \"|Ep_r: %i\" % ep_r,\n        (\"|Lam: %.4f\" % METHOD['lam']) if METHOD['name'] == 'kl_pen' else '',\n    )\n\nplt.plot(np.arange(len(all_ep_r)), all_ep_r)\nplt.xlabel('Episode');plt.ylabel('Moving averaged episode reward');plt.show()"
  },
  {
    "path": "contents/1_command_line_reinforcement_learning/treasure_on_right.py",
    "content": "\"\"\"\nA simple example for Reinforcement Learning using table lookup Q-learning method.\nAn agent \"o\" is on the left of a 1 dimensional world, the treasure is on the rightmost location.\nRun this program and to see how the agent will improve its strategy of finding the treasure.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\"\"\"\n\nimport numpy as np\nimport pandas as pd\nimport time\n\nnp.random.seed(2)  # reproducible\n\n\nN_STATES = 6   # the length of the 1 dimensional world\nACTIONS = ['left', 'right']     # available actions\nEPSILON = 0.9   # greedy police\nALPHA = 0.1     # learning rate\nGAMMA = 0.9    # discount factor\nMAX_EPISODES = 13   # maximum episodes\nFRESH_TIME = 0.3    # fresh time for one move\n\n\ndef build_q_table(n_states, actions):\n    table = pd.DataFrame(\n        np.zeros((n_states, len(actions))),     # q_table initial values\n        columns=actions,    # actions's name\n    )\n    # print(table)    # show table\n    return table\n\n\ndef choose_action(state, q_table):\n    # This is how to choose an action\n    state_actions = q_table.iloc[state, :]\n    if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()):  # act non-greedy or state-action have no value\n        action_name = np.random.choice(ACTIONS)\n    else:   # act greedy\n        action_name = state_actions.idxmax()    # replace argmax to idxmax as argmax means a different function in newer version of pandas\n    return action_name\n\n\ndef get_env_feedback(S, A):\n    # This is how agent will interact with the environment\n    if A == 'right':    # move right\n        if S == N_STATES - 2:   # terminate\n            S_ = 'terminal'\n            R = 1\n        else:\n            S_ = S + 1\n            R = 0\n    else:   # move left\n        R = 0\n        if S == 0:\n            S_ = S  # reach the wall\n        else:\n            S_ = S - 1\n    return S_, R\n\n\ndef update_env(S, episode, step_counter):\n    # This is how environment be updated\n    env_list = ['-']*(N_STATES-1) + ['T']   # '---------T' our environment\n    if S == 'terminal':\n        interaction = 'Episode %s: total_steps = %s' % (episode+1, step_counter)\n        print('\\r{}'.format(interaction), end='')\n        time.sleep(2)\n        print('\\r                                ', end='')\n    else:\n        env_list[S] = 'o'\n        interaction = ''.join(env_list)\n        print('\\r{}'.format(interaction), end='')\n        time.sleep(FRESH_TIME)\n\n\ndef rl():\n    # main part of RL loop\n    q_table = build_q_table(N_STATES, ACTIONS)\n    for episode in range(MAX_EPISODES):\n        step_counter = 0\n        S = 0\n        is_terminated = False\n        update_env(S, episode, step_counter)\n        while not is_terminated:\n\n            A = choose_action(S, q_table)\n            S_, R = get_env_feedback(S, A)  # take action & get next state and reward\n            q_predict = q_table.loc[S, A]\n            if S_ != 'terminal':\n                q_target = R + GAMMA * q_table.iloc[S_, :].max()   # next state is not terminal\n            else:\n                q_target = R     # next state is terminal\n                is_terminated = True    # terminate this episode\n\n            q_table.loc[S, A] += ALPHA * (q_target - q_predict)  # update\n            S = S_  # move to next state\n\n            update_env(S, episode, step_counter+1)\n            step_counter += 1\n    return q_table\n\n\nif __name__ == \"__main__\":\n    q_table = rl()\n    print('\\r\\nQ-table:\\n')\n    print(q_table)\n"
  },
  {
    "path": "contents/2_Q_Learning_maze/RL_brain.py",
    "content": "\"\"\"\nThis part of code is the Q learning brain, which is a brain of the agent.\nAll decisions are made in here.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\"\"\"\n\nimport numpy as np\nimport pandas as pd\n\n\nclass QLearningTable:\n    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):\n        self.actions = actions  # a list\n        self.lr = learning_rate\n        self.gamma = reward_decay\n        self.epsilon = e_greedy\n        self.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64)\n\n    def choose_action(self, observation):\n        self.check_state_exist(observation)\n        # action selection\n        if np.random.uniform() < self.epsilon:\n            # choose best action\n            state_action = self.q_table.loc[observation, :]\n            # some actions may have the same value, randomly choose on in these actions\n            action = np.random.choice(state_action[state_action == np.max(state_action)].index)\n        else:\n            # choose random action\n            action = np.random.choice(self.actions)\n        return action\n\n    def learn(self, s, a, r, s_):\n        self.check_state_exist(s_)\n        q_predict = self.q_table.loc[s, a]\n        if s_ != 'terminal':\n            q_target = r + self.gamma * self.q_table.loc[s_, :].max()  # next state is not terminal\n        else:\n            q_target = r  # next state is terminal\n        self.q_table.loc[s, a] += self.lr * (q_target - q_predict)  # update\n\n    def check_state_exist(self, state):\n        if state not in self.q_table.index:\n            # append new state to q table\n            self.q_table = self.q_table.append(\n                pd.Series(\n                    [0]*len(self.actions),\n                    index=self.q_table.columns,\n                    name=state,\n                )\n            )"
  },
  {
    "path": "contents/2_Q_Learning_maze/maze_env.py",
    "content": "\"\"\"\nReinforcement learning maze example.\n\nRed rectangle:          explorer.\nBlack rectangles:       hells       [reward = -1].\nYellow bin circle:      paradise    [reward = +1].\nAll other states:       ground      [reward = 0].\n\nThis script is the environment part of this example. The RL is in RL_brain.py.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\"\"\"\n\n\nimport numpy as np\nimport time\nimport sys\nif sys.version_info.major == 2:\n    import Tkinter as tk\nelse:\n    import tkinter as tk\n\n\nUNIT = 40   # pixels\nMAZE_H = 4  # grid height\nMAZE_W = 4  # grid width\n\n\nclass Maze(tk.Tk, object):\n    def __init__(self):\n        super(Maze, self).__init__()\n        self.action_space = ['u', 'd', 'l', 'r']\n        self.n_actions = len(self.action_space)\n        self.title('maze')\n        self.geometry('{0}x{1}'.format(MAZE_W * UNIT, MAZE_H * UNIT))\n        self._build_maze()\n\n    def _build_maze(self):\n        self.canvas = tk.Canvas(self, bg='white',\n                           height=MAZE_H * UNIT,\n                           width=MAZE_W * UNIT)\n\n        # create grids\n        for c in range(0, MAZE_W * UNIT, UNIT):\n            x0, y0, x1, y1 = c, 0, c, MAZE_H * UNIT\n            self.canvas.create_line(x0, y0, x1, y1)\n        for r in range(0, MAZE_H * UNIT, UNIT):\n            x0, y0, x1, y1 = 0, r, MAZE_W * UNIT, r\n            self.canvas.create_line(x0, y0, x1, y1)\n\n        # create origin\n        origin = np.array([20, 20])\n\n        # hell\n        hell1_center = origin + np.array([UNIT * 2, UNIT])\n        self.hell1 = self.canvas.create_rectangle(\n            hell1_center[0] - 15, hell1_center[1] - 15,\n            hell1_center[0] + 15, hell1_center[1] + 15,\n            fill='black')\n        # hell\n        hell2_center = origin + np.array([UNIT, UNIT * 2])\n        self.hell2 = self.canvas.create_rectangle(\n            hell2_center[0] - 15, hell2_center[1] - 15,\n            hell2_center[0] + 15, hell2_center[1] + 15,\n            fill='black')\n\n        # create oval\n        oval_center = origin + UNIT * 2\n        self.oval = self.canvas.create_oval(\n            oval_center[0] - 15, oval_center[1] - 15,\n            oval_center[0] + 15, oval_center[1] + 15,\n            fill='yellow')\n\n        # create red rect\n        self.rect = self.canvas.create_rectangle(\n            origin[0] - 15, origin[1] - 15,\n            origin[0] + 15, origin[1] + 15,\n            fill='red')\n\n        # pack all\n        self.canvas.pack()\n\n    def reset(self):\n        self.update()\n        time.sleep(0.5)\n        self.canvas.delete(self.rect)\n        origin = np.array([20, 20])\n        self.rect = self.canvas.create_rectangle(\n            origin[0] - 15, origin[1] - 15,\n            origin[0] + 15, origin[1] + 15,\n            fill='red')\n        # return observation\n        return self.canvas.coords(self.rect)\n\n    def step(self, action):\n        s = self.canvas.coords(self.rect)\n        base_action = np.array([0, 0])\n        if action == 0:   # up\n            if s[1] > UNIT:\n                base_action[1] -= UNIT\n        elif action == 1:   # down\n            if s[1] < (MAZE_H - 1) * UNIT:\n                base_action[1] += UNIT\n        elif action == 2:   # right\n            if s[0] < (MAZE_W - 1) * UNIT:\n                base_action[0] += UNIT\n        elif action == 3:   # left\n            if s[0] > UNIT:\n                base_action[0] -= UNIT\n\n        self.canvas.move(self.rect, base_action[0], base_action[1])  # move agent\n\n        s_ = self.canvas.coords(self.rect)  # next state\n\n        # reward function\n        if s_ == self.canvas.coords(self.oval):\n            reward = 1\n            done = True\n            s_ = 'terminal'\n        elif s_ in [self.canvas.coords(self.hell1), self.canvas.coords(self.hell2)]:\n            reward = -1\n            done = True\n            s_ = 'terminal'\n        else:\n            reward = 0\n            done = False\n\n        return s_, reward, done\n\n    def render(self):\n        time.sleep(0.1)\n        self.update()\n\n\ndef update():\n    for t in range(10):\n        s = env.reset()\n        while True:\n            env.render()\n            a = 1\n            s, r, done = env.step(a)\n            if done:\n                break\n\nif __name__ == '__main__':\n    env = Maze()\n    env.after(100, update)\n    env.mainloop()"
  },
  {
    "path": "contents/2_Q_Learning_maze/run_this.py",
    "content": "\"\"\"\nReinforcement learning maze example.\n\nRed rectangle:          explorer.\nBlack rectangles:       hells       [reward = -1].\nYellow bin circle:      paradise    [reward = +1].\nAll other states:       ground      [reward = 0].\n\nThis script is the main part which controls the update method of this example.\nThe RL is in RL_brain.py.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\"\"\"\n\nfrom maze_env import Maze\nfrom RL_brain import QLearningTable\n\n\ndef update():\n    for episode in range(100):\n        # initial observation\n        observation = env.reset()\n\n        while True:\n            # fresh env\n            env.render()\n\n            # RL choose action based on observation\n            action = RL.choose_action(str(observation))\n\n            # RL take action and get next observation and reward\n            observation_, reward, done = env.step(action)\n\n            # RL learn from this transition\n            RL.learn(str(observation), action, reward, str(observation_))\n\n            # swap observation\n            observation = observation_\n\n            # break while loop when end of this episode\n            if done:\n                break\n\n    # end of game\n    print('game over')\n    env.destroy()\n\nif __name__ == \"__main__\":\n    env = Maze()\n    RL = QLearningTable(actions=list(range(env.n_actions)))\n\n    env.after(100, update)\n    env.mainloop()"
  },
  {
    "path": "contents/3_Sarsa_maze/RL_brain.py",
    "content": "\"\"\"\nThis part of code is the Q learning brain, which is a brain of the agent.\nAll decisions are made in here.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\"\"\"\n\nimport numpy as np\nimport pandas as pd\n\n\nclass RL(object):\n    def __init__(self, action_space, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):\n        self.actions = action_space  # a list\n        self.lr = learning_rate\n        self.gamma = reward_decay\n        self.epsilon = e_greedy\n\n        self.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64)\n\n    def check_state_exist(self, state):\n        if state not in self.q_table.index:\n            # append new state to q table\n            self.q_table = self.q_table.append(\n                pd.Series(\n                    [0]*len(self.actions),\n                    index=self.q_table.columns,\n                    name=state,\n                )\n            )\n\n    def choose_action(self, observation):\n        self.check_state_exist(observation)\n        # action selection\n        if np.random.rand() < self.epsilon:\n            # choose best action\n            state_action = self.q_table.loc[observation, :]\n            # some actions may have the same value, randomly choose on in these actions\n            action = np.random.choice(state_action[state_action == np.max(state_action)].index)\n        else:\n            # choose random action\n            action = np.random.choice(self.actions)\n        return action\n\n    def learn(self, *args):\n        pass\n\n\n# off-policy\nclass QLearningTable(RL):\n    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):\n        super(QLearningTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)\n\n    def learn(self, s, a, r, s_):\n        self.check_state_exist(s_)\n        q_predict = self.q_table.loc[s, a]\n        if s_ != 'terminal':\n            q_target = r + self.gamma * self.q_table.loc[s_, :].max()  # next state is not terminal\n        else:\n            q_target = r  # next state is terminal\n        self.q_table.loc[s, a] += self.lr * (q_target - q_predict)  # update\n\n\n# on-policy\nclass SarsaTable(RL):\n\n    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):\n        super(SarsaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)\n\n    def learn(self, s, a, r, s_, a_):\n        self.check_state_exist(s_)\n        q_predict = self.q_table.loc[s, a]\n        if s_ != 'terminal':\n            q_target = r + self.gamma * self.q_table.loc[s_, a_]  # next state is not terminal\n        else:\n            q_target = r  # next state is terminal\n        self.q_table.loc[s, a] += self.lr * (q_target - q_predict)  # update\n"
  },
  {
    "path": "contents/3_Sarsa_maze/maze_env.py",
    "content": "\"\"\"\nReinforcement learning maze example.\n\nRed rectangle:          explorer.\nBlack rectangles:       hells       [reward = -1].\nYellow bin circle:      paradise    [reward = +1].\nAll other states:       ground      [reward = 0].\n\nThis script is the environment part of this example.\nThe RL is in RL_brain.py.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\"\"\"\n\n\nimport numpy as np\nimport time\nimport sys\nif sys.version_info.major == 2:\n    import Tkinter as tk\nelse:\n    import tkinter as tk\n\n\nUNIT = 40   # pixels\nMAZE_H = 4  # grid height\nMAZE_W = 4  # grid width\n\n\nclass Maze(tk.Tk, object):\n    def __init__(self):\n        super(Maze, self).__init__()\n        self.action_space = ['u', 'd', 'l', 'r']\n        self.n_actions = len(self.action_space)\n        self.title('maze')\n        self.geometry('{0}x{1}'.format(MAZE_W * UNIT, MAZE_H * UNIT))\n        self._build_maze()\n\n    def _build_maze(self):\n        self.canvas = tk.Canvas(self, bg='white',\n                           height=MAZE_H * UNIT,\n                           width=MAZE_W * UNIT)\n\n        # create grids\n        for c in range(0, MAZE_W * UNIT, UNIT):\n            x0, y0, x1, y1 = c, 0, c, MAZE_H * UNIT\n            self.canvas.create_line(x0, y0, x1, y1)\n        for r in range(0, MAZE_H * UNIT, UNIT):\n            x0, y0, x1, y1 = 0, r, MAZE_W * UNIT, r\n            self.canvas.create_line(x0, y0, x1, y1)\n\n        # create origin\n        origin = np.array([20, 20])\n\n        # hell\n        hell1_center = origin + np.array([UNIT * 2, UNIT])\n        self.hell1 = self.canvas.create_rectangle(\n            hell1_center[0] - 15, hell1_center[1] - 15,\n            hell1_center[0] + 15, hell1_center[1] + 15,\n            fill='black')\n        # hell\n        hell2_center = origin + np.array([UNIT, UNIT * 2])\n        self.hell2 = self.canvas.create_rectangle(\n            hell2_center[0] - 15, hell2_center[1] - 15,\n            hell2_center[0] + 15, hell2_center[1] + 15,\n            fill='black')\n\n        # create oval\n        oval_center = origin + UNIT * 2\n        self.oval = self.canvas.create_oval(\n            oval_center[0] - 15, oval_center[1] - 15,\n            oval_center[0] + 15, oval_center[1] + 15,\n            fill='yellow')\n\n        # create red rect\n        self.rect = self.canvas.create_rectangle(\n            origin[0] - 15, origin[1] - 15,\n            origin[0] + 15, origin[1] + 15,\n            fill='red')\n\n        # pack all\n        self.canvas.pack()\n\n    def reset(self):\n        self.update()\n        time.sleep(0.5)\n        self.canvas.delete(self.rect)\n        origin = np.array([20, 20])\n        self.rect = self.canvas.create_rectangle(\n            origin[0] - 15, origin[1] - 15,\n            origin[0] + 15, origin[1] + 15,\n            fill='red')\n        # return observation\n        return self.canvas.coords(self.rect)\n\n    def step(self, action):\n        s = self.canvas.coords(self.rect)\n        base_action = np.array([0, 0])\n        if action == 0:   # up\n            if s[1] > UNIT:\n                base_action[1] -= UNIT\n        elif action == 1:   # down\n            if s[1] < (MAZE_H - 1) * UNIT:\n                base_action[1] += UNIT\n        elif action == 2:   # right\n            if s[0] < (MAZE_W - 1) * UNIT:\n                base_action[0] += UNIT\n        elif action == 3:   # left\n            if s[0] > UNIT:\n                base_action[0] -= UNIT\n\n        self.canvas.move(self.rect, base_action[0], base_action[1])  # move agent\n\n        s_ = self.canvas.coords(self.rect)  # next state\n\n        # reward function\n        if s_ == self.canvas.coords(self.oval):\n            reward = 1\n            done = True\n            s_ = 'terminal'\n        elif s_ in [self.canvas.coords(self.hell1), self.canvas.coords(self.hell2)]:\n            reward = -1\n            done = True\n            s_ = 'terminal'\n        else:\n            reward = 0\n            done = False\n\n        return s_, reward, done\n\n    def render(self):\n        time.sleep(0.1)\n        self.update()\n\n\n"
  },
  {
    "path": "contents/3_Sarsa_maze/run_this.py",
    "content": "\"\"\"\nSarsa is a online updating method for Reinforcement learning.\n\nUnlike Q learning which is a offline updating method, Sarsa is updating while in the current trajectory.\n\nYou will see the sarsa is more coward when punishment is close because it cares about all behaviours,\nwhile q learning is more brave because it only cares about maximum behaviour.\n\"\"\"\n\nfrom maze_env import Maze\nfrom RL_brain import SarsaTable\n\n\ndef update():\n    for episode in range(100):\n        # initial observation\n        observation = env.reset()\n\n        # RL choose action based on observation\n        action = RL.choose_action(str(observation))\n\n        while True:\n            # fresh env\n            env.render()\n\n            # RL take action and get next observation and reward\n            observation_, reward, done = env.step(action)\n\n            # RL choose action based on next observation\n            action_ = RL.choose_action(str(observation_))\n\n            # RL learn from this transition (s, a, r, s, a) ==> Sarsa\n            RL.learn(str(observation), action, reward, str(observation_), action_)\n\n            # swap observation and action\n            observation = observation_\n            action = action_\n\n            # break while loop when end of this episode\n            if done:\n                break\n\n    # end of game\n    print('game over')\n    env.destroy()\n\nif __name__ == \"__main__\":\n    env = Maze()\n    RL = SarsaTable(actions=list(range(env.n_actions)))\n\n    env.after(100, update)\n    env.mainloop()"
  },
  {
    "path": "contents/4_Sarsa_lambda_maze/RL_brain.py",
    "content": "\"\"\"\nThis part of code is the Q learning brain, which is a brain of the agent.\nAll decisions are made in here.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\"\"\"\n\nimport numpy as np\nimport pandas as pd\n\n\nclass RL(object):\n    def __init__(self, action_space, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):\n        self.actions = action_space  # a list\n        self.lr = learning_rate\n        self.gamma = reward_decay\n        self.epsilon = e_greedy\n\n        self.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64)\n\n    def check_state_exist(self, state):\n        if state not in self.q_table.index:\n            # append new state to q table\n            self.q_table = self.q_table.append(\n                pd.Series(\n                    [0]*len(self.actions),\n                    index=self.q_table.columns,\n                    name=state,\n                )\n            )\n\n    def choose_action(self, observation):\n        self.check_state_exist(observation)\n        # action selection\n        if np.random.rand() < self.epsilon:\n            # choose best action\n            state_action = self.q_table.loc[observation, :]\n            # some actions may have the same value, randomly choose on in these actions\n            action = np.random.choice(state_action[state_action == np.max(state_action)].index)\n        else:\n            # choose random action\n            action = np.random.choice(self.actions)\n        return action\n\n    def learn(self, *args):\n        pass\n\n\n# backward eligibility traces\nclass SarsaLambdaTable(RL):\n    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9, trace_decay=0.9):\n        super(SarsaLambdaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)\n\n        # backward view, eligibility trace.\n        self.lambda_ = trace_decay\n        self.eligibility_trace = self.q_table.copy()\n\n    def check_state_exist(self, state):\n        if state not in self.q_table.index:\n            # append new state to q table\n            to_be_append = pd.Series(\n                    [0] * len(self.actions),\n                    index=self.q_table.columns,\n                    name=state,\n                )\n            self.q_table = self.q_table.append(to_be_append)\n\n            # also update eligibility trace\n            self.eligibility_trace = self.eligibility_trace.append(to_be_append)\n\n    def learn(self, s, a, r, s_, a_):\n        self.check_state_exist(s_)\n        q_predict = self.q_table.loc[s, a]\n        if s_ != 'terminal':\n            q_target = r + self.gamma * self.q_table.loc[s_, a_]  # next state is not terminal\n        else:\n            q_target = r  # next state is terminal\n        error = q_target - q_predict\n\n        # increase trace amount for visited state-action pair\n\n        # Method 1:\n        # self.eligibility_trace.loc[s, a] += 1\n\n        # Method 2:\n        self.eligibility_trace.loc[s, :] *= 0\n        self.eligibility_trace.loc[s, a] = 1\n\n        # Q update\n        self.q_table += self.lr * error * self.eligibility_trace\n\n        # decay eligibility trace after update\n        self.eligibility_trace *= self.gamma*self.lambda_\n"
  },
  {
    "path": "contents/4_Sarsa_lambda_maze/maze_env.py",
    "content": "\"\"\"\nReinforcement learning maze example.\n\nRed rectangle:          explorer.\nBlack rectangles:       hells       [reward = -1].\nYellow bin circle:      paradise    [reward = +1].\nAll other states:       ground      [reward = 0].\n\nThis script is the environment part of this example.\nThe RL is in RL_brain.py.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\"\"\"\n\n\nimport numpy as np\nimport time\nimport sys\nif sys.version_info.major == 2:\n    import Tkinter as tk\nelse:\n    import tkinter as tk\n\n\nUNIT = 40   # pixels\nMAZE_H = 4  # grid height\nMAZE_W = 4  # grid width\n\n\nclass Maze(tk.Tk, object):\n    def __init__(self):\n        super(Maze, self).__init__()\n        self.action_space = ['u', 'd', 'l', 'r']\n        self.n_actions = len(self.action_space)\n        self.title('maze')\n        self.geometry('{0}x{1}'.format(MAZE_W * UNIT, MAZE_H * UNIT))\n        self._build_maze()\n\n    def _build_maze(self):\n        self.canvas = tk.Canvas(self, bg='white',\n                           height=MAZE_H * UNIT,\n                           width=MAZE_W * UNIT)\n\n        # create grids\n        for c in range(0, MAZE_W * UNIT, UNIT):\n            x0, y0, x1, y1 = c, 0, c, MAZE_H * UNIT\n            self.canvas.create_line(x0, y0, x1, y1)\n        for r in range(0, MAZE_H * UNIT, UNIT):\n            x0, y0, x1, y1 = 0, r, MAZE_W * UNIT, r\n            self.canvas.create_line(x0, y0, x1, y1)\n\n        # create origin\n        origin = np.array([20, 20])\n\n        # hell\n        hell1_center = origin + np.array([UNIT * 2, UNIT])\n        self.hell1 = self.canvas.create_rectangle(\n            hell1_center[0] - 15, hell1_center[1] - 15,\n            hell1_center[0] + 15, hell1_center[1] + 15,\n            fill='black')\n        # hell\n        hell2_center = origin + np.array([UNIT, UNIT * 2])\n        self.hell2 = self.canvas.create_rectangle(\n            hell2_center[0] - 15, hell2_center[1] - 15,\n            hell2_center[0] + 15, hell2_center[1] + 15,\n            fill='black')\n\n        # create oval\n        oval_center = origin + UNIT * 2\n        self.oval = self.canvas.create_oval(\n            oval_center[0] - 15, oval_center[1] - 15,\n            oval_center[0] + 15, oval_center[1] + 15,\n            fill='yellow')\n\n        # create red rect\n        self.rect = self.canvas.create_rectangle(\n            origin[0] - 15, origin[1] - 15,\n            origin[0] + 15, origin[1] + 15,\n            fill='red')\n\n        # pack all\n        self.canvas.pack()\n\n    def reset(self):\n        self.update()\n        time.sleep(0.5)\n        self.canvas.delete(self.rect)\n        origin = np.array([20, 20])\n        self.rect = self.canvas.create_rectangle(\n            origin[0] - 15, origin[1] - 15,\n            origin[0] + 15, origin[1] + 15,\n            fill='red')\n        # return observation\n        return self.canvas.coords(self.rect)\n\n    def step(self, action):\n        s = self.canvas.coords(self.rect)\n        base_action = np.array([0, 0])\n        if action == 0:   # up\n            if s[1] > UNIT:\n                base_action[1] -= UNIT\n        elif action == 1:   # down\n            if s[1] < (MAZE_H - 1) * UNIT:\n                base_action[1] += UNIT\n        elif action == 2:   # right\n            if s[0] < (MAZE_W - 1) * UNIT:\n                base_action[0] += UNIT\n        elif action == 3:   # left\n            if s[0] > UNIT:\n                base_action[0] -= UNIT\n\n        self.canvas.move(self.rect, base_action[0], base_action[1])  # move agent\n\n        s_ = self.canvas.coords(self.rect)  # next state\n\n        # reward function\n        if s_ == self.canvas.coords(self.oval):\n            reward = 1\n            done = True\n            s_ = 'terminal'\n        elif s_ in [self.canvas.coords(self.hell1), self.canvas.coords(self.hell2)]:\n            reward = -1\n            done = True\n            s_ = 'terminal'\n        else:\n            reward = 0\n            done = False\n\n        return s_, reward, done\n\n    def render(self):\n        time.sleep(0.05)\n        self.update()\n\n\n"
  },
  {
    "path": "contents/4_Sarsa_lambda_maze/run_this.py",
    "content": "\"\"\"\nSarsa is a online updating method for Reinforcement learning.\n\nUnlike Q learning which is a offline updating method, Sarsa is updating while in the current trajectory.\n\nYou will see the sarsa is more coward when punishment is close because it cares about all behaviours,\nwhile q learning is more brave because it only cares about maximum behaviour.\n\"\"\"\n\nfrom maze_env import Maze\nfrom RL_brain import SarsaLambdaTable\n\n\ndef update():\n    for episode in range(100):\n        # initial observation\n        observation = env.reset()\n\n        # RL choose action based on observation\n        action = RL.choose_action(str(observation))\n\n        # initial all zero eligibility trace\n        RL.eligibility_trace *= 0\n\n        while True:\n            # fresh env\n            env.render()\n\n            # RL take action and get next observation and reward\n            observation_, reward, done = env.step(action)\n\n            # RL choose action based on next observation\n            action_ = RL.choose_action(str(observation_))\n\n            # RL learn from this transition (s, a, r, s, a) ==> Sarsa\n            RL.learn(str(observation), action, reward, str(observation_), action_)\n\n            # swap observation and action\n            observation = observation_\n            action = action_\n\n            # break while loop when end of this episode\n            if done:\n                break\n\n    # end of game\n    print('game over')\n    env.destroy()\n\nif __name__ == \"__main__\":\n    env = Maze()\n    RL = SarsaLambdaTable(actions=list(range(env.n_actions)))\n\n    env.after(100, update)\n    env.mainloop()"
  },
  {
    "path": "contents/5.1_Double_DQN/RL_brain.py",
    "content": "\"\"\"\nThe double DQN based on this paper: https://arxiv.org/abs/1509.06461\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\nTensorflow: 1.0\ngym: 0.8.0\n\"\"\"\n\nimport numpy as np\nimport tensorflow as tf\n\nnp.random.seed(1)\ntf.set_random_seed(1)\n\n\nclass DoubleDQN:\n    def __init__(\n            self,\n            n_actions,\n            n_features,\n            learning_rate=0.005,\n            reward_decay=0.9,\n            e_greedy=0.9,\n            replace_target_iter=200,\n            memory_size=3000,\n            batch_size=32,\n            e_greedy_increment=None,\n            output_graph=False,\n            double_q=True,\n            sess=None,\n    ):\n        self.n_actions = n_actions\n        self.n_features = n_features\n        self.lr = learning_rate\n        self.gamma = reward_decay\n        self.epsilon_max = e_greedy\n        self.replace_target_iter = replace_target_iter\n        self.memory_size = memory_size\n        self.batch_size = batch_size\n        self.epsilon_increment = e_greedy_increment\n        self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max\n\n        self.double_q = double_q    # decide to use double q or not\n\n        self.learn_step_counter = 0\n        self.memory = np.zeros((self.memory_size, n_features*2+2))\n        self._build_net()\n        t_params = tf.get_collection('target_net_params')\n        e_params = tf.get_collection('eval_net_params')\n        self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]\n\n        if sess is None:\n            self.sess = tf.Session()\n            self.sess.run(tf.global_variables_initializer())\n        else:\n            self.sess = sess\n        if output_graph:\n            tf.summary.FileWriter(\"logs/\", self.sess.graph)\n        self.cost_his = []\n\n    def _build_net(self):\n        def build_layers(s, c_names, n_l1, w_initializer, b_initializer):\n            with tf.variable_scope('l1'):\n                w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)\n                b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)\n                l1 = tf.nn.relu(tf.matmul(s, w1) + b1)\n\n            with tf.variable_scope('l2'):\n                w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)\n                b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)\n                out = tf.matmul(l1, w2) + b2\n            return out\n        # ------------------ build evaluate_net ------------------\n        self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')  # input\n        self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')  # for calculating loss\n\n        with tf.variable_scope('eval_net'):\n            c_names, n_l1, w_initializer, b_initializer = \\\n                ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 20, \\\n                tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)  # config of layers\n\n            self.q_eval = build_layers(self.s, c_names, n_l1, w_initializer, b_initializer)\n\n        with tf.variable_scope('loss'):\n            self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))\n        with tf.variable_scope('train'):\n            self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)\n\n        # ------------------ build target_net ------------------\n        self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_')    # input\n        with tf.variable_scope('target_net'):\n            c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]\n\n            self.q_next = build_layers(self.s_, c_names, n_l1, w_initializer, b_initializer)\n\n    def store_transition(self, s, a, r, s_):\n        if not hasattr(self, 'memory_counter'):\n            self.memory_counter = 0\n        transition = np.hstack((s, [a, r], s_))\n        index = self.memory_counter % self.memory_size\n        self.memory[index, :] = transition\n        self.memory_counter += 1\n\n    def choose_action(self, observation):\n        observation = observation[np.newaxis, :]\n        actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})\n        action = np.argmax(actions_value)\n\n        if not hasattr(self, 'q'):  # record action value it gets\n            self.q = []\n            self.running_q = 0\n        self.running_q = self.running_q*0.99 + 0.01 * np.max(actions_value)\n        self.q.append(self.running_q)\n\n        if np.random.uniform() > self.epsilon:  # choosing action\n            action = np.random.randint(0, self.n_actions)\n        return action\n\n    def learn(self):\n        if self.learn_step_counter % self.replace_target_iter == 0:\n            self.sess.run(self.replace_target_op)\n            print('\\ntarget_params_replaced\\n')\n\n        if self.memory_counter > self.memory_size:\n            sample_index = np.random.choice(self.memory_size, size=self.batch_size)\n        else:\n            sample_index = np.random.choice(self.memory_counter, size=self.batch_size)\n        batch_memory = self.memory[sample_index, :]\n\n        q_next, q_eval4next = self.sess.run(\n            [self.q_next, self.q_eval],\n            feed_dict={self.s_: batch_memory[:, -self.n_features:],    # next observation\n                       self.s: batch_memory[:, -self.n_features:]})    # next observation\n        q_eval = self.sess.run(self.q_eval, {self.s: batch_memory[:, :self.n_features]})\n\n        q_target = q_eval.copy()\n\n        batch_index = np.arange(self.batch_size, dtype=np.int32)\n        eval_act_index = batch_memory[:, self.n_features].astype(int)\n        reward = batch_memory[:, self.n_features + 1]\n\n        if self.double_q:\n            max_act4next = np.argmax(q_eval4next, axis=1)        # the action that brings the highest value is evaluated by q_eval\n            selected_q_next = q_next[batch_index, max_act4next]  # Double DQN, select q_next depending on above actions\n        else:\n            selected_q_next = np.max(q_next, axis=1)    # the natural DQN\n\n        q_target[batch_index, eval_act_index] = reward + self.gamma * selected_q_next\n\n        _, self.cost = self.sess.run([self._train_op, self.loss],\n                                     feed_dict={self.s: batch_memory[:, :self.n_features],\n                                                self.q_target: q_target})\n        self.cost_his.append(self.cost)\n\n        self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max\n        self.learn_step_counter += 1\n\n\n\n\n"
  },
  {
    "path": "contents/5.1_Double_DQN/run_Pendulum.py",
    "content": "\"\"\"\nDouble DQN & Natural DQN comparison,\nThe Pendulum example.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\nTensorflow: 1.0\ngym: 0.8.0\n\"\"\"\n\n\nimport gym\nfrom RL_brain import DoubleDQN\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport tensorflow as tf\n\n\nenv = gym.make('Pendulum-v0')\nenv = env.unwrapped\nenv.seed(1)\nMEMORY_SIZE = 3000\nACTION_SPACE = 11\n\nsess = tf.Session()\nwith tf.variable_scope('Natural_DQN'):\n    natural_DQN = DoubleDQN(\n        n_actions=ACTION_SPACE, n_features=3, memory_size=MEMORY_SIZE,\n        e_greedy_increment=0.001, double_q=False, sess=sess\n    )\n\nwith tf.variable_scope('Double_DQN'):\n    double_DQN = DoubleDQN(\n        n_actions=ACTION_SPACE, n_features=3, memory_size=MEMORY_SIZE,\n        e_greedy_increment=0.001, double_q=True, sess=sess, output_graph=True)\n\nsess.run(tf.global_variables_initializer())\n\n\ndef train(RL):\n    total_steps = 0\n    observation = env.reset()\n    while True:\n        # if total_steps - MEMORY_SIZE > 8000: env.render()\n\n        action = RL.choose_action(observation)\n\n        f_action = (action-(ACTION_SPACE-1)/2)/((ACTION_SPACE-1)/4)   # convert to [-2 ~ 2] float actions\n        observation_, reward, done, info = env.step(np.array([f_action]))\n\n        reward /= 10     # normalize to a range of (-1, 0). r = 0 when get upright\n        # the Q target at upright state will be 0, because Q_target = r + gamma * Qmax(s', a') = 0 + gamma * 0\n        # so when Q at this state is greater than 0, the agent overestimates the Q. Please refer to the final result.\n\n        RL.store_transition(observation, action, reward, observation_)\n\n        if total_steps > MEMORY_SIZE:   # learning\n            RL.learn()\n\n        if total_steps - MEMORY_SIZE > 20000:   # stop game\n            break\n\n        observation = observation_\n        total_steps += 1\n    return RL.q\n\nq_natural = train(natural_DQN)\nq_double = train(double_DQN)\n\nplt.plot(np.array(q_natural), c='r', label='natural')\nplt.plot(np.array(q_double), c='b', label='double')\nplt.legend(loc='best')\nplt.ylabel('Q eval')\nplt.xlabel('training steps')\nplt.grid()\nplt.show()\n"
  },
  {
    "path": "contents/5.2_Prioritized_Replay_DQN/RL_brain.py",
    "content": "\"\"\"\nThe DQN improvement: Prioritized Experience Replay (based on https://arxiv.org/abs/1511.05952)\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\nTensorflow: 1.0\ngym: 0.8.0\n\"\"\"\n\nimport numpy as np\nimport tensorflow as tf\n\nnp.random.seed(1)\ntf.set_random_seed(1)\n\n\nclass SumTree(object):\n    \"\"\"\n    This SumTree code is a modified version and the original code is from:\n    https://github.com/jaara/AI-blog/blob/master/SumTree.py\n\n    Story data with its priority in the tree.\n    \"\"\"\n    data_pointer = 0\n\n    def __init__(self, capacity):\n        self.capacity = capacity  # for all priority values\n        self.tree = np.zeros(2 * capacity - 1)\n        # [--------------Parent nodes-------------][-------leaves to recode priority-------]\n        #             size: capacity - 1                       size: capacity\n        self.data = np.zeros(capacity, dtype=object)  # for all transitions\n        # [--------------data frame-------------]\n        #             size: capacity\n\n    def add(self, p, data):\n        tree_idx = self.data_pointer + self.capacity - 1\n        self.data[self.data_pointer] = data  # update data_frame\n        self.update(tree_idx, p)  # update tree_frame\n\n        self.data_pointer += 1\n        if self.data_pointer >= self.capacity:  # replace when exceed the capacity\n            self.data_pointer = 0\n\n    def update(self, tree_idx, p):\n        change = p - self.tree[tree_idx]\n        self.tree[tree_idx] = p\n        # then propagate the change through tree\n        while tree_idx != 0:    # this method is faster than the recursive loop in the reference code\n            tree_idx = (tree_idx - 1) // 2\n            self.tree[tree_idx] += change\n\n    def get_leaf(self, v):\n        \"\"\"\n        Tree structure and array storage:\n\n        Tree index:\n             0         -> storing priority sum\n            / \\\n          1     2\n         / \\   / \\\n        3   4 5   6    -> storing priority for transitions\n\n        Array type for storing:\n        [0,1,2,3,4,5,6]\n        \"\"\"\n        parent_idx = 0\n        while True:     # the while loop is faster than the method in the reference code\n            cl_idx = 2 * parent_idx + 1         # this leaf's left and right kids\n            cr_idx = cl_idx + 1\n            if cl_idx >= len(self.tree):        # reach bottom, end search\n                leaf_idx = parent_idx\n                break\n            else:       # downward search, always search for a higher priority node\n                if v <= self.tree[cl_idx]:\n                    parent_idx = cl_idx\n                else:\n                    v -= self.tree[cl_idx]\n                    parent_idx = cr_idx\n\n        data_idx = leaf_idx - self.capacity + 1\n        return leaf_idx, self.tree[leaf_idx], self.data[data_idx]\n\n    @property\n    def total_p(self):\n        return self.tree[0]  # the root\n\n\nclass Memory(object):  # stored as ( s, a, r, s_ ) in SumTree\n    \"\"\"\n    This Memory class is modified based on the original code from:\n    https://github.com/jaara/AI-blog/blob/master/Seaquest-DDQN-PER.py\n    \"\"\"\n    epsilon = 0.01  # small amount to avoid zero priority\n    alpha = 0.6  # [0~1] convert the importance of TD error to priority\n    beta = 0.4  # importance-sampling, from initial value increasing to 1\n    beta_increment_per_sampling = 0.001\n    abs_err_upper = 1.  # clipped abs error\n\n    def __init__(self, capacity):\n        self.tree = SumTree(capacity)\n\n    def store(self, transition):\n        max_p = np.max(self.tree.tree[-self.tree.capacity:])\n        if max_p == 0:\n            max_p = self.abs_err_upper\n        self.tree.add(max_p, transition)   # set the max p for new p\n\n    def sample(self, n):\n        b_idx, b_memory, ISWeights = np.empty((n,), dtype=np.int32), np.empty((n, self.tree.data[0].size)), np.empty((n, 1))\n        pri_seg = self.tree.total_p / n       # priority segment\n        self.beta = np.min([1., self.beta + self.beta_increment_per_sampling])  # max = 1\n\n        min_prob = np.min(self.tree.tree[-self.tree.capacity:]) / self.tree.total_p     # for later calculate ISweight\n        for i in range(n):\n            a, b = pri_seg * i, pri_seg * (i + 1)\n            v = np.random.uniform(a, b)\n            idx, p, data = self.tree.get_leaf(v)\n            prob = p / self.tree.total_p\n            ISWeights[i, 0] = np.power(prob/min_prob, -self.beta)\n            b_idx[i], b_memory[i, :] = idx, data\n        return b_idx, b_memory, ISWeights\n\n    def batch_update(self, tree_idx, abs_errors):\n        abs_errors += self.epsilon  # convert to abs and avoid 0\n        clipped_errors = np.minimum(abs_errors, self.abs_err_upper)\n        ps = np.power(clipped_errors, self.alpha)\n        for ti, p in zip(tree_idx, ps):\n            self.tree.update(ti, p)\n\n\nclass DQNPrioritizedReplay:\n    def __init__(\n            self,\n            n_actions,\n            n_features,\n            learning_rate=0.005,\n            reward_decay=0.9,\n            e_greedy=0.9,\n            replace_target_iter=500,\n            memory_size=10000,\n            batch_size=32,\n            e_greedy_increment=None,\n            output_graph=False,\n            prioritized=True,\n            sess=None,\n    ):\n        self.n_actions = n_actions\n        self.n_features = n_features\n        self.lr = learning_rate\n        self.gamma = reward_decay\n        self.epsilon_max = e_greedy\n        self.replace_target_iter = replace_target_iter\n        self.memory_size = memory_size\n        self.batch_size = batch_size\n        self.epsilon_increment = e_greedy_increment\n        self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max\n\n        self.prioritized = prioritized    # decide to use double q or not\n\n        self.learn_step_counter = 0\n\n        self._build_net()\n        t_params = tf.get_collection('target_net_params')\n        e_params = tf.get_collection('eval_net_params')\n        self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]\n\n        if self.prioritized:\n            self.memory = Memory(capacity=memory_size)\n        else:\n            self.memory = np.zeros((self.memory_size, n_features*2+2))\n\n        if sess is None:\n            self.sess = tf.Session()\n            self.sess.run(tf.global_variables_initializer())\n        else:\n            self.sess = sess\n\n        if output_graph:\n            tf.summary.FileWriter(\"logs/\", self.sess.graph)\n\n        self.cost_his = []\n\n    def _build_net(self):\n        def build_layers(s, c_names, n_l1, w_initializer, b_initializer, trainable):\n            with tf.variable_scope('l1'):\n                w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names, trainable=trainable)\n                b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names,  trainable=trainable)\n                l1 = tf.nn.relu(tf.matmul(s, w1) + b1)\n\n            with tf.variable_scope('l2'):\n                w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names,  trainable=trainable)\n                b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names,  trainable=trainable)\n                out = tf.matmul(l1, w2) + b2\n            return out\n\n        # ------------------ build evaluate_net ------------------\n        self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')  # input\n        self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')  # for calculating loss\n        if self.prioritized:\n            self.ISWeights = tf.placeholder(tf.float32, [None, 1], name='IS_weights')\n        with tf.variable_scope('eval_net'):\n            c_names, n_l1, w_initializer, b_initializer = \\\n                ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 20, \\\n                tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)  # config of layers\n\n            self.q_eval = build_layers(self.s, c_names, n_l1, w_initializer, b_initializer, True)\n\n        with tf.variable_scope('loss'):\n            if self.prioritized:\n                self.abs_errors = tf.reduce_sum(tf.abs(self.q_target - self.q_eval), axis=1)    # for updating Sumtree\n                self.loss = tf.reduce_mean(self.ISWeights * tf.squared_difference(self.q_target, self.q_eval))\n            else:\n                self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))\n        with tf.variable_scope('train'):\n            self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)\n\n        # ------------------ build target_net ------------------\n        self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_')    # input\n        with tf.variable_scope('target_net'):\n            c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]\n            self.q_next = build_layers(self.s_, c_names, n_l1, w_initializer, b_initializer, False)\n\n    def store_transition(self, s, a, r, s_):\n        if self.prioritized:    # prioritized replay\n            transition = np.hstack((s, [a, r], s_))\n            self.memory.store(transition)    # have high priority for newly arrived transition\n        else:       # random replay\n            if not hasattr(self, 'memory_counter'):\n                self.memory_counter = 0\n            transition = np.hstack((s, [a, r], s_))\n            index = self.memory_counter % self.memory_size\n            self.memory[index, :] = transition\n            self.memory_counter += 1\n\n    def choose_action(self, observation):\n        observation = observation[np.newaxis, :]\n        if np.random.uniform() < self.epsilon:\n            actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})\n            action = np.argmax(actions_value)\n        else:\n            action = np.random.randint(0, self.n_actions)\n        return action\n\n    def learn(self):\n        if self.learn_step_counter % self.replace_target_iter == 0:\n            self.sess.run(self.replace_target_op)\n            print('\\ntarget_params_replaced\\n')\n\n        if self.prioritized:\n            tree_idx, batch_memory, ISWeights = self.memory.sample(self.batch_size)\n        else:\n            sample_index = np.random.choice(self.memory_size, size=self.batch_size)\n            batch_memory = self.memory[sample_index, :]\n\n        q_next, q_eval = self.sess.run(\n                [self.q_next, self.q_eval],\n                feed_dict={self.s_: batch_memory[:, -self.n_features:],\n                           self.s: batch_memory[:, :self.n_features]})\n\n        q_target = q_eval.copy()\n        batch_index = np.arange(self.batch_size, dtype=np.int32)\n        eval_act_index = batch_memory[:, self.n_features].astype(int)\n        reward = batch_memory[:, self.n_features + 1]\n\n        q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)\n\n        if self.prioritized:\n            _, abs_errors, self.cost = self.sess.run([self._train_op, self.abs_errors, self.loss],\n                                         feed_dict={self.s: batch_memory[:, :self.n_features],\n                                                    self.q_target: q_target,\n                                                    self.ISWeights: ISWeights})\n            self.memory.batch_update(tree_idx, abs_errors)     # update priority\n        else:\n            _, self.cost = self.sess.run([self._train_op, self.loss],\n                                         feed_dict={self.s: batch_memory[:, :self.n_features],\n                                                    self.q_target: q_target})\n\n        self.cost_his.append(self.cost)\n\n        self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max\n        self.learn_step_counter += 1\n"
  },
  {
    "path": "contents/5.2_Prioritized_Replay_DQN/run_MountainCar.py",
    "content": "\"\"\"\nThe DQN improvement: Prioritized Experience Replay (based on https://arxiv.org/abs/1511.05952)\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\nTensorflow: 1.0\ngym: 0.8.0\n\"\"\"\n\n\nimport gym\nfrom RL_brain import DQNPrioritizedReplay\nimport matplotlib.pyplot as plt\nimport tensorflow as tf\nimport numpy as np\n\nenv = gym.make('MountainCar-v0')\nenv = env.unwrapped\nenv.seed(21)\nMEMORY_SIZE = 10000\n\nsess = tf.Session()\nwith tf.variable_scope('natural_DQN'):\n    RL_natural = DQNPrioritizedReplay(\n        n_actions=3, n_features=2, memory_size=MEMORY_SIZE,\n        e_greedy_increment=0.00005, sess=sess, prioritized=False,\n    )\n\nwith tf.variable_scope('DQN_with_prioritized_replay'):\n    RL_prio = DQNPrioritizedReplay(\n        n_actions=3, n_features=2, memory_size=MEMORY_SIZE,\n        e_greedy_increment=0.00005, sess=sess, prioritized=True, output_graph=True,\n    )\nsess.run(tf.global_variables_initializer())\n\n\ndef train(RL):\n    total_steps = 0\n    steps = []\n    episodes = []\n    for i_episode in range(20):\n        observation = env.reset()\n        while True:\n            # env.render()\n\n            action = RL.choose_action(observation)\n\n            observation_, reward, done, info = env.step(action)\n\n            if done: reward = 10\n\n            RL.store_transition(observation, action, reward, observation_)\n\n            if total_steps > MEMORY_SIZE:\n                RL.learn()\n\n            if done:\n                print('episode ', i_episode, ' finished')\n                steps.append(total_steps)\n                episodes.append(i_episode)\n                break\n\n            observation = observation_\n            total_steps += 1\n    return np.vstack((episodes, steps))\n\nhis_natural = train(RL_natural)\nhis_prio = train(RL_prio)\n\n# compare based on first success\nplt.plot(his_natural[0, :], his_natural[1, :] - his_natural[1, 0], c='b', label='natural DQN')\nplt.plot(his_prio[0, :], his_prio[1, :] - his_prio[1, 0], c='r', label='DQN with prioritized replay')\nplt.legend(loc='best')\nplt.ylabel('total training time')\nplt.xlabel('episode')\nplt.grid()\nplt.show()\n\n\n"
  },
  {
    "path": "contents/5.3_Dueling_DQN/RL_brain.py",
    "content": "\"\"\"\nThe Dueling DQN based on this paper: https://arxiv.org/abs/1511.06581\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\nTensorflow: 1.0\ngym: 0.8.0\n\"\"\"\n\nimport numpy as np\nimport tensorflow as tf\n\nnp.random.seed(1)\ntf.set_random_seed(1)\n\n\nclass DuelingDQN:\n    def __init__(\n            self,\n            n_actions,\n            n_features,\n            learning_rate=0.001,\n            reward_decay=0.9,\n            e_greedy=0.9,\n            replace_target_iter=200,\n            memory_size=500,\n            batch_size=32,\n            e_greedy_increment=None,\n            output_graph=False,\n            dueling=True,\n            sess=None,\n    ):\n        self.n_actions = n_actions\n        self.n_features = n_features\n        self.lr = learning_rate\n        self.gamma = reward_decay\n        self.epsilon_max = e_greedy\n        self.replace_target_iter = replace_target_iter\n        self.memory_size = memory_size\n        self.batch_size = batch_size\n        self.epsilon_increment = e_greedy_increment\n        self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max\n\n        self.dueling = dueling      # decide to use dueling DQN or not\n\n        self.learn_step_counter = 0\n        self.memory = np.zeros((self.memory_size, n_features*2+2))\n        self._build_net()\n        t_params = tf.get_collection('target_net_params')\n        e_params = tf.get_collection('eval_net_params')\n        self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]\n\n        if sess is None:\n            self.sess = tf.Session()\n            self.sess.run(tf.global_variables_initializer())\n        else:\n            self.sess = sess\n        if output_graph:\n            tf.summary.FileWriter(\"logs/\", self.sess.graph)\n        self.cost_his = []\n\n    def _build_net(self):\n        def build_layers(s, c_names, n_l1, w_initializer, b_initializer):\n            with tf.variable_scope('l1'):\n                w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)\n                b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)\n                l1 = tf.nn.relu(tf.matmul(s, w1) + b1)\n\n            if self.dueling:\n                # Dueling DQN\n                with tf.variable_scope('Value'):\n                    w2 = tf.get_variable('w2', [n_l1, 1], initializer=w_initializer, collections=c_names)\n                    b2 = tf.get_variable('b2', [1, 1], initializer=b_initializer, collections=c_names)\n                    self.V = tf.matmul(l1, w2) + b2\n\n                with tf.variable_scope('Advantage'):\n                    w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)\n                    b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)\n                    self.A = tf.matmul(l1, w2) + b2\n\n                with tf.variable_scope('Q'):\n                    out = self.V + (self.A - tf.reduce_mean(self.A, axis=1, keep_dims=True))     # Q = V(s) + A(s,a)\n            else:\n                with tf.variable_scope('Q'):\n                    w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)\n                    b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)\n                    out = tf.matmul(l1, w2) + b2\n\n            return out\n\n        # ------------------ build evaluate_net ------------------\n        self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')  # input\n        self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')  # for calculating loss\n        with tf.variable_scope('eval_net'):\n            c_names, n_l1, w_initializer, b_initializer = \\\n                ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 20, \\\n                tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)  # config of layers\n\n            self.q_eval = build_layers(self.s, c_names, n_l1, w_initializer, b_initializer)\n\n        with tf.variable_scope('loss'):\n            self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))\n        with tf.variable_scope('train'):\n            self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)\n\n        # ------------------ build target_net ------------------\n        self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_')    # input\n        with tf.variable_scope('target_net'):\n            c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]\n\n            self.q_next = build_layers(self.s_, c_names, n_l1, w_initializer, b_initializer)\n\n    def store_transition(self, s, a, r, s_):\n        if not hasattr(self, 'memory_counter'):\n            self.memory_counter = 0\n        transition = np.hstack((s, [a, r], s_))\n        index = self.memory_counter % self.memory_size\n        self.memory[index, :] = transition\n        self.memory_counter += 1\n\n    def choose_action(self, observation):\n        observation = observation[np.newaxis, :]\n        if np.random.uniform() < self.epsilon:  # choosing action\n            actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})\n            action = np.argmax(actions_value)\n        else:\n            action = np.random.randint(0, self.n_actions)\n        return action\n\n    def learn(self):\n        if self.learn_step_counter % self.replace_target_iter == 0:\n            self.sess.run(self.replace_target_op)\n            print('\\ntarget_params_replaced\\n')\n\n        sample_index = np.random.choice(self.memory_size, size=self.batch_size)\n        batch_memory = self.memory[sample_index, :]\n\n        q_next = self.sess.run(self.q_next, feed_dict={self.s_: batch_memory[:, -self.n_features:]}) # next observation\n        q_eval = self.sess.run(self.q_eval, {self.s: batch_memory[:, :self.n_features]})\n\n        q_target = q_eval.copy()\n\n        batch_index = np.arange(self.batch_size, dtype=np.int32)\n        eval_act_index = batch_memory[:, self.n_features].astype(int)\n        reward = batch_memory[:, self.n_features + 1]\n\n        q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)\n\n        _, self.cost = self.sess.run([self._train_op, self.loss],\n                                     feed_dict={self.s: batch_memory[:, :self.n_features],\n                                                self.q_target: q_target})\n        self.cost_his.append(self.cost)\n\n        self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max\n        self.learn_step_counter += 1\n\n\n\n\n\n"
  },
  {
    "path": "contents/5.3_Dueling_DQN/run_Pendulum.py",
    "content": "\"\"\"\nDueling DQN & Natural DQN comparison\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\nTensorflow: 1.0\ngym: 0.8.0\n\"\"\"\n\n\nimport gym\nfrom RL_brain import DuelingDQN\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport tensorflow as tf\n\n\nenv = gym.make('Pendulum-v0')\nenv = env.unwrapped\nenv.seed(1)\nMEMORY_SIZE = 3000\nACTION_SPACE = 25\n\nsess = tf.Session()\nwith tf.variable_scope('natural'):\n    natural_DQN = DuelingDQN(\n        n_actions=ACTION_SPACE, n_features=3, memory_size=MEMORY_SIZE,\n        e_greedy_increment=0.001, sess=sess, dueling=False)\n\nwith tf.variable_scope('dueling'):\n    dueling_DQN = DuelingDQN(\n        n_actions=ACTION_SPACE, n_features=3, memory_size=MEMORY_SIZE,\n        e_greedy_increment=0.001, sess=sess, dueling=True, output_graph=True)\n\nsess.run(tf.global_variables_initializer())\n\n\ndef train(RL):\n    acc_r = [0]\n    total_steps = 0\n    observation = env.reset()\n    while True:\n        # if total_steps-MEMORY_SIZE > 9000: env.render()\n\n        action = RL.choose_action(observation)\n\n        f_action = (action-(ACTION_SPACE-1)/2)/((ACTION_SPACE-1)/4)   # [-2 ~ 2] float actions\n        observation_, reward, done, info = env.step(np.array([f_action]))\n\n        reward /= 10      # normalize to a range of (-1, 0)\n        acc_r.append(reward + acc_r[-1])  # accumulated reward\n\n        RL.store_transition(observation, action, reward, observation_)\n\n        if total_steps > MEMORY_SIZE:\n            RL.learn()\n\n        if total_steps-MEMORY_SIZE > 15000:\n            break\n\n        observation = observation_\n        total_steps += 1\n    return RL.cost_his, acc_r\n\nc_natural, r_natural = train(natural_DQN)\nc_dueling, r_dueling = train(dueling_DQN)\n\nplt.figure(1)\nplt.plot(np.array(c_natural), c='r', label='natural')\nplt.plot(np.array(c_dueling), c='b', label='dueling')\nplt.legend(loc='best')\nplt.ylabel('cost')\nplt.xlabel('training steps')\nplt.grid()\n\nplt.figure(2)\nplt.plot(np.array(r_natural), c='r', label='natural')\nplt.plot(np.array(r_dueling), c='b', label='dueling')\nplt.legend(loc='best')\nplt.ylabel('accumulated reward')\nplt.xlabel('training steps')\nplt.grid()\n\nplt.show()\n\n"
  },
  {
    "path": "contents/5_Deep_Q_Network/DQN_modified.py",
    "content": "\"\"\"\nThis part of code is the Deep Q Network (DQN) brain.\n\nview the tensorboard picture about this DQN structure on: https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/4-3-DQN3/#modification\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\nTensorflow: r1.2\n\"\"\"\n\nimport numpy as np\nimport tensorflow as tf\n\nnp.random.seed(1)\ntf.set_random_seed(1)\n\n\n# Deep Q Network off-policy\nclass DeepQNetwork:\n    def __init__(\n            self,\n            n_actions,\n            n_features,\n            learning_rate=0.01,\n            reward_decay=0.9,\n            e_greedy=0.9,\n            replace_target_iter=300,\n            memory_size=500,\n            batch_size=32,\n            e_greedy_increment=None,\n            output_graph=False,\n    ):\n        self.n_actions = n_actions\n        self.n_features = n_features\n        self.lr = learning_rate\n        self.gamma = reward_decay\n        self.epsilon_max = e_greedy\n        self.replace_target_iter = replace_target_iter\n        self.memory_size = memory_size\n        self.batch_size = batch_size\n        self.epsilon_increment = e_greedy_increment\n        self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max\n\n        # total learning step\n        self.learn_step_counter = 0\n\n        # initialize zero memory [s, a, r, s_]\n        self.memory = np.zeros((self.memory_size, n_features * 2 + 2))\n\n        # consist of [target_net, evaluate_net]\n        self._build_net()\n\n        t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='target_net')\n        e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='eval_net')\n\n        with tf.variable_scope('hard_replacement'):\n            self.target_replace_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]\n\n        self.sess = tf.Session()\n\n        if output_graph:\n            # $ tensorboard --logdir=logs\n            tf.summary.FileWriter(\"logs/\", self.sess.graph)\n\n        self.sess.run(tf.global_variables_initializer())\n        self.cost_his = []\n\n    def _build_net(self):\n        # ------------------ all inputs ------------------------\n        self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')  # input State\n        self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_')  # input Next State\n        self.r = tf.placeholder(tf.float32, [None, ], name='r')  # input Reward\n        self.a = tf.placeholder(tf.int32, [None, ], name='a')  # input Action\n\n        w_initializer, b_initializer = tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)\n\n        # ------------------ build evaluate_net ------------------\n        with tf.variable_scope('eval_net'):\n            e1 = tf.layers.dense(self.s, 20, tf.nn.relu, kernel_initializer=w_initializer,\n                                 bias_initializer=b_initializer, name='e1')\n            self.q_eval = tf.layers.dense(e1, self.n_actions, kernel_initializer=w_initializer,\n                                          bias_initializer=b_initializer, name='q')\n\n        # ------------------ build target_net ------------------\n        with tf.variable_scope('target_net'):\n            t1 = tf.layers.dense(self.s_, 20, tf.nn.relu, kernel_initializer=w_initializer,\n                                 bias_initializer=b_initializer, name='t1')\n            self.q_next = tf.layers.dense(t1, self.n_actions, kernel_initializer=w_initializer,\n                                          bias_initializer=b_initializer, name='t2')\n\n        with tf.variable_scope('q_target'):\n            q_target = self.r + self.gamma * tf.reduce_max(self.q_next, axis=1, name='Qmax_s_')    # shape=(None, )\n            self.q_target = tf.stop_gradient(q_target)\n        with tf.variable_scope('q_eval'):\n            a_indices = tf.stack([tf.range(tf.shape(self.a)[0], dtype=tf.int32), self.a], axis=1)\n            self.q_eval_wrt_a = tf.gather_nd(params=self.q_eval, indices=a_indices)    # shape=(None, )\n        with tf.variable_scope('loss'):\n            self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval_wrt_a, name='TD_error'))\n        with tf.variable_scope('train'):\n            self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)\n\n    def store_transition(self, s, a, r, s_):\n        if not hasattr(self, 'memory_counter'):\n            self.memory_counter = 0\n        transition = np.hstack((s, [a, r], s_))\n        # replace the old memory with new memory\n        index = self.memory_counter % self.memory_size\n        self.memory[index, :] = transition\n        self.memory_counter += 1\n\n    def choose_action(self, observation):\n        # to have batch dimension when feed into tf placeholder\n        observation = observation[np.newaxis, :]\n\n        if np.random.uniform() < self.epsilon:\n            # forward feed the observation and get q value for every actions\n            actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})\n            action = np.argmax(actions_value)\n        else:\n            action = np.random.randint(0, self.n_actions)\n        return action\n\n    def learn(self):\n        # check to replace target parameters\n        if self.learn_step_counter % self.replace_target_iter == 0:\n            self.sess.run(self.target_replace_op)\n            print('\\ntarget_params_replaced\\n')\n\n        # sample batch memory from all memory\n        if self.memory_counter > self.memory_size:\n            sample_index = np.random.choice(self.memory_size, size=self.batch_size)\n        else:\n            sample_index = np.random.choice(self.memory_counter, size=self.batch_size)\n        batch_memory = self.memory[sample_index, :]\n\n        _, cost = self.sess.run(\n            [self._train_op, self.loss],\n            feed_dict={\n                self.s: batch_memory[:, :self.n_features],\n                self.a: batch_memory[:, self.n_features],\n                self.r: batch_memory[:, self.n_features + 1],\n                self.s_: batch_memory[:, -self.n_features:],\n            })\n\n        self.cost_his.append(cost)\n\n        # increasing epsilon\n        self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max\n        self.learn_step_counter += 1\n\n    def plot_cost(self):\n        import matplotlib.pyplot as plt\n        plt.plot(np.arange(len(self.cost_his)), self.cost_his)\n        plt.ylabel('Cost')\n        plt.xlabel('training steps')\n        plt.show()\n\nif __name__ == '__main__':\n    DQN = DeepQNetwork(3,4, output_graph=True)"
  },
  {
    "path": "contents/5_Deep_Q_Network/RL_brain.py",
    "content": "\"\"\"\nThis part of code is the DQN brain, which is a brain of the agent.\nAll decisions are made in here.\nUsing Tensorflow to build the neural network.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\nTensorflow: 1.0\ngym: 0.7.3\n\"\"\"\n\nimport numpy as np\nimport pandas as pd\nimport tensorflow as tf\n\nnp.random.seed(1)\ntf.set_random_seed(1)\n\n\n# Deep Q Network off-policy\nclass DeepQNetwork:\n    def __init__(\n            self,\n            n_actions,\n            n_features,\n            learning_rate=0.01,\n            reward_decay=0.9,\n            e_greedy=0.9,\n            replace_target_iter=300,\n            memory_size=500,\n            batch_size=32,\n            e_greedy_increment=None,\n            output_graph=False,\n    ):\n        self.n_actions = n_actions\n        self.n_features = n_features\n        self.lr = learning_rate\n        self.gamma = reward_decay\n        self.epsilon_max = e_greedy\n        self.replace_target_iter = replace_target_iter\n        self.memory_size = memory_size\n        self.batch_size = batch_size\n        self.epsilon_increment = e_greedy_increment\n        self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max\n\n        # total learning step\n        self.learn_step_counter = 0\n\n        # initialize zero memory [s, a, r, s_]\n        self.memory = np.zeros((self.memory_size, n_features * 2 + 2))\n\n        # consist of [target_net, evaluate_net]\n        self._build_net()\n        t_params = tf.get_collection('target_net_params')\n        e_params = tf.get_collection('eval_net_params')\n        self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]\n\n        self.sess = tf.Session()\n\n        if output_graph:\n            # $ tensorboard --logdir=logs\n            # tf.train.SummaryWriter soon be deprecated, use following\n            tf.summary.FileWriter(\"logs/\", self.sess.graph)\n\n        self.sess.run(tf.global_variables_initializer())\n        self.cost_his = []\n\n    def _build_net(self):\n        # ------------------ build evaluate_net ------------------\n        self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')  # input\n        self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')  # for calculating loss\n        with tf.variable_scope('eval_net'):\n            # c_names(collections_names) are the collections to store variables\n            c_names, n_l1, w_initializer, b_initializer = \\\n                ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \\\n                tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)  # config of layers\n\n            # first layer. collections is used later when assign to target net\n            with tf.variable_scope('l1'):\n                w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)\n                b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)\n                l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)\n\n            # second layer. collections is used later when assign to target net\n            with tf.variable_scope('l2'):\n                w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)\n                b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)\n                self.q_eval = tf.matmul(l1, w2) + b2\n\n        with tf.variable_scope('loss'):\n            self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))\n        with tf.variable_scope('train'):\n            self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)\n\n        # ------------------ build target_net ------------------\n        self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_')    # input\n        with tf.variable_scope('target_net'):\n            # c_names(collections_names) are the collections to store variables\n            c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]\n\n            # first layer. collections is used later when assign to target net\n            with tf.variable_scope('l1'):\n                w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)\n                b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)\n                l1 = tf.nn.relu(tf.matmul(self.s_, w1) + b1)\n\n            # second layer. collections is used later when assign to target net\n            with tf.variable_scope('l2'):\n                w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)\n                b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)\n                self.q_next = tf.matmul(l1, w2) + b2\n\n    def store_transition(self, s, a, r, s_):\n        if not hasattr(self, 'memory_counter'):\n            self.memory_counter = 0\n\n        transition = np.hstack((s, [a, r], s_))\n\n        # replace the old memory with new memory\n        index = self.memory_counter % self.memory_size\n        self.memory[index, :] = transition\n\n        self.memory_counter += 1\n\n    def choose_action(self, observation):\n        # to have batch dimension when feed into tf placeholder\n        observation = observation[np.newaxis, :]\n\n        if np.random.uniform() < self.epsilon:\n            # forward feed the observation and get q value for every actions\n            actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})\n            action = np.argmax(actions_value)\n        else:\n            action = np.random.randint(0, self.n_actions)\n        return action\n\n    def learn(self):\n        # check to replace target parameters\n        if self.learn_step_counter % self.replace_target_iter == 0:\n            self.sess.run(self.replace_target_op)\n            print('\\ntarget_params_replaced\\n')\n\n        # sample batch memory from all memory\n        if self.memory_counter > self.memory_size:\n            sample_index = np.random.choice(self.memory_size, size=self.batch_size)\n        else:\n            sample_index = np.random.choice(self.memory_counter, size=self.batch_size)\n        batch_memory = self.memory[sample_index, :]\n\n        q_next, q_eval = self.sess.run(\n            [self.q_next, self.q_eval],\n            feed_dict={\n                self.s_: batch_memory[:, -self.n_features:],  # fixed params\n                self.s: batch_memory[:, :self.n_features],  # newest params\n            })\n\n        # change q_target w.r.t q_eval's action\n        q_target = q_eval.copy()\n\n        batch_index = np.arange(self.batch_size, dtype=np.int32)\n        eval_act_index = batch_memory[:, self.n_features].astype(int)\n        reward = batch_memory[:, self.n_features + 1]\n\n        q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)\n\n        \"\"\"\n        For example in this batch I have 2 samples and 3 actions:\n        q_eval =\n        [[1, 2, 3],\n         [4, 5, 6]]\n\n        q_target = q_eval =\n        [[1, 2, 3],\n         [4, 5, 6]]\n\n        Then change q_target with the real q_target value w.r.t the q_eval's action.\n        For example in:\n            sample 0, I took action 0, and the max q_target value is -1;\n            sample 1, I took action 2, and the max q_target value is -2:\n        q_target =\n        [[-1, 2, 3],\n         [4, 5, -2]]\n\n        So the (q_target - q_eval) becomes:\n        [[(-1)-(1), 0, 0],\n         [0, 0, (-2)-(6)]]\n\n        We then backpropagate this error w.r.t the corresponding action to network,\n        leave other action as error=0 cause we didn't choose it.\n        \"\"\"\n\n        # train eval network\n        _, self.cost = self.sess.run([self._train_op, self.loss],\n                                     feed_dict={self.s: batch_memory[:, :self.n_features],\n                                                self.q_target: q_target})\n        self.cost_his.append(self.cost)\n\n        # increasing epsilon\n        self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max\n        self.learn_step_counter += 1\n\n    def plot_cost(self):\n        import matplotlib.pyplot as plt\n        plt.plot(np.arange(len(self.cost_his)), self.cost_his)\n        plt.ylabel('Cost')\n        plt.xlabel('training steps')\n        plt.show()\n\n\n\n"
  },
  {
    "path": "contents/5_Deep_Q_Network/maze_env.py",
    "content": "\"\"\"\nReinforcement learning maze example.\n\nRed rectangle:          explorer.\nBlack rectangles:       hells       [reward = -1].\nYellow bin circle:      paradise    [reward = +1].\nAll other states:       ground      [reward = 0].\n\nThis script is the environment part of this example.\nThe RL is in RL_brain.py.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\"\"\"\nimport numpy as np\nimport time\nimport sys\nif sys.version_info.major == 2:\n    import Tkinter as tk\nelse:\n    import tkinter as tk\n\nUNIT = 40   # pixels\nMAZE_H = 4  # grid height\nMAZE_W = 4  # grid width\n\n\nclass Maze(tk.Tk, object):\n    def __init__(self):\n        super(Maze, self).__init__()\n        self.action_space = ['u', 'd', 'l', 'r']\n        self.n_actions = len(self.action_space)\n        self.n_features = 2\n        self.title('maze')\n        self.geometry('{0}x{1}'.format(MAZE_W * UNIT, MAZE_H * UNIT))\n        self._build_maze()\n\n    def _build_maze(self):\n        self.canvas = tk.Canvas(self, bg='white',\n                           height=MAZE_H * UNIT,\n                           width=MAZE_W * UNIT)\n\n        # create grids\n        for c in range(0, MAZE_W * UNIT, UNIT):\n            x0, y0, x1, y1 = c, 0, c, MAZE_H * UNIT\n            self.canvas.create_line(x0, y0, x1, y1)\n        for r in range(0, MAZE_H * UNIT, UNIT):\n            x0, y0, x1, y1 = 0, r, MAZE_W * UNIT, r\n            self.canvas.create_line(x0, y0, x1, y1)\n\n        # create origin\n        origin = np.array([20, 20])\n\n        # hell\n        hell1_center = origin + np.array([UNIT * 2, UNIT])\n        self.hell1 = self.canvas.create_rectangle(\n            hell1_center[0] - 15, hell1_center[1] - 15,\n            hell1_center[0] + 15, hell1_center[1] + 15,\n            fill='black')\n        # hell\n        # hell2_center = origin + np.array([UNIT, UNIT * 2])\n        # self.hell2 = self.canvas.create_rectangle(\n        #     hell2_center[0] - 15, hell2_center[1] - 15,\n        #     hell2_center[0] + 15, hell2_center[1] + 15,\n        #     fill='black')\n\n        # create oval\n        oval_center = origin + UNIT * 2\n        self.oval = self.canvas.create_oval(\n            oval_center[0] - 15, oval_center[1] - 15,\n            oval_center[0] + 15, oval_center[1] + 15,\n            fill='yellow')\n\n        # create red rect\n        self.rect = self.canvas.create_rectangle(\n            origin[0] - 15, origin[1] - 15,\n            origin[0] + 15, origin[1] + 15,\n            fill='red')\n\n        # pack all\n        self.canvas.pack()\n\n    def reset(self):\n        self.update()\n        time.sleep(0.1)\n        self.canvas.delete(self.rect)\n        origin = np.array([20, 20])\n        self.rect = self.canvas.create_rectangle(\n            origin[0] - 15, origin[1] - 15,\n            origin[0] + 15, origin[1] + 15,\n            fill='red')\n        # return observation\n        return (np.array(self.canvas.coords(self.rect)[:2]) - np.array(self.canvas.coords(self.oval)[:2]))/(MAZE_H*UNIT)\n\n    def step(self, action):\n        s = self.canvas.coords(self.rect)\n        base_action = np.array([0, 0])\n        if action == 0:   # up\n            if s[1] > UNIT:\n                base_action[1] -= UNIT\n        elif action == 1:   # down\n            if s[1] < (MAZE_H - 1) * UNIT:\n                base_action[1] += UNIT\n        elif action == 2:   # right\n            if s[0] < (MAZE_W - 1) * UNIT:\n                base_action[0] += UNIT\n        elif action == 3:   # left\n            if s[0] > UNIT:\n                base_action[0] -= UNIT\n\n        self.canvas.move(self.rect, base_action[0], base_action[1])  # move agent\n\n        next_coords = self.canvas.coords(self.rect)  # next state\n\n        # reward function\n        if next_coords == self.canvas.coords(self.oval):\n            reward = 1\n            done = True\n        elif next_coords in [self.canvas.coords(self.hell1)]:\n            reward = -1\n            done = True\n        else:\n            reward = 0\n            done = False\n        s_ = (np.array(next_coords[:2]) - np.array(self.canvas.coords(self.oval)[:2]))/(MAZE_H*UNIT)\n        return s_, reward, done\n\n    def render(self):\n        # time.sleep(0.01)\n        self.update()\n\n\n"
  },
  {
    "path": "contents/5_Deep_Q_Network/run_this.py",
    "content": "from maze_env import Maze\nfrom RL_brain import DeepQNetwork\n\n\ndef run_maze():\n    step = 0\n    for episode in range(300):\n        # initial observation\n        observation = env.reset()\n\n        while True:\n            # fresh env\n            env.render()\n\n            # RL choose action based on observation\n            action = RL.choose_action(observation)\n\n            # RL take action and get next observation and reward\n            observation_, reward, done = env.step(action)\n\n            RL.store_transition(observation, action, reward, observation_)\n\n            if (step > 200) and (step % 5 == 0):\n                RL.learn()\n\n            # swap observation\n            observation = observation_\n\n            # break while loop when end of this episode\n            if done:\n                break\n            step += 1\n\n    # end of game\n    print('game over')\n    env.destroy()\n\n\nif __name__ == \"__main__\":\n    # maze game\n    env = Maze()\n    RL = DeepQNetwork(env.n_actions, env.n_features,\n                      learning_rate=0.01,\n                      reward_decay=0.9,\n                      e_greedy=0.9,\n                      replace_target_iter=200,\n                      memory_size=2000,\n                      # output_graph=True\n                      )\n    env.after(100, run_maze)\n    env.mainloop()\n    RL.plot_cost()"
  },
  {
    "path": "contents/6_OpenAI_gym/RL_brain.py",
    "content": "\"\"\"\nThis part of code is the DQN brain, which is a brain of the agent.\nAll decisions are made in here.\nUsing Tensorflow to build the neural network.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\nTensorflow: 1.0\ngym: 0.8.0\n\"\"\"\n\nimport numpy as np\nimport pandas as pd\nimport tensorflow as tf\n\n\n# Deep Q Network off-policy\nclass DeepQNetwork:\n    def __init__(\n            self,\n            n_actions,\n            n_features,\n            learning_rate=0.01,\n            reward_decay=0.9,\n            e_greedy=0.9,\n            replace_target_iter=300,\n            memory_size=500,\n            batch_size=32,\n            e_greedy_increment=None,\n            output_graph=False,\n    ):\n        self.n_actions = n_actions\n        self.n_features = n_features\n        self.lr = learning_rate\n        self.gamma = reward_decay\n        self.epsilon_max = e_greedy\n        self.replace_target_iter = replace_target_iter\n        self.memory_size = memory_size\n        self.batch_size = batch_size\n        self.epsilon_increment = e_greedy_increment\n        self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max\n\n        # total learning step\n        self.learn_step_counter = 0\n\n        # initialize zero memory [s, a, r, s_]\n        self.memory = np.zeros((self.memory_size, n_features * 2 + 2))\n\n        # consist of [target_net, evaluate_net]\n        self._build_net()\n        t_params = tf.get_collection('target_net_params')\n        e_params = tf.get_collection('eval_net_params')\n        self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]\n\n        self.sess = tf.Session()\n\n        if output_graph:\n            # $ tensorboard --logdir=logs\n            # tf.train.SummaryWriter soon be deprecated, use following\n            tf.summary.FileWriter(\"logs/\", self.sess.graph)\n\n        self.sess.run(tf.global_variables_initializer())\n        self.cost_his = []\n\n    def _build_net(self):\n        # ------------------ build evaluate_net ------------------\n        self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')  # input\n        self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')  # for calculating loss\n        with tf.variable_scope('eval_net'):\n            # c_names(collections_names) are the collections to store variables\n            c_names, n_l1, w_initializer, b_initializer = \\\n                ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \\\n                tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)  # config of layers\n\n            # first layer. collections is used later when assign to target net\n            with tf.variable_scope('l1'):\n                w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)\n                b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)\n                l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)\n\n            # second layer. collections is used later when assign to target net\n            with tf.variable_scope('l2'):\n                w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)\n                b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)\n                self.q_eval = tf.matmul(l1, w2) + b2\n\n        with tf.variable_scope('loss'):\n            self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))\n        with tf.variable_scope('train'):\n            self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)\n\n        # ------------------ build target_net ------------------\n        self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_')    # input\n        with tf.variable_scope('target_net'):\n            # c_names(collections_names) are the collections to store variables\n            c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]\n\n            # first layer. collections is used later when assign to target net\n            with tf.variable_scope('l1'):\n                w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)\n                b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)\n                l1 = tf.nn.relu(tf.matmul(self.s_, w1) + b1)\n\n            # second layer. collections is used later when assign to target net\n            with tf.variable_scope('l2'):\n                w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)\n                b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)\n                self.q_next = tf.matmul(l1, w2) + b2\n\n    def store_transition(self, s, a, r, s_):\n        if not hasattr(self, 'memory_counter'):\n            self.memory_counter = 0\n\n        transition = np.hstack((s, [a, r], s_))\n\n        # replace the old memory with new memory\n        index = self.memory_counter % self.memory_size\n        self.memory[index, :] = transition\n\n        self.memory_counter += 1\n\n    def choose_action(self, observation):\n        # to have batch dimension when feed into tf placeholder\n        observation = observation[np.newaxis, :]\n\n        if np.random.uniform() < self.epsilon:\n            # forward feed the observation and get q value for every actions\n            actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})\n            action = np.argmax(actions_value)\n        else:\n            action = np.random.randint(0, self.n_actions)\n        return action\n\n    def learn(self):\n        # check to replace target parameters\n        if self.learn_step_counter % self.replace_target_iter == 0:\n            self.sess.run(self.replace_target_op)\n            print('\\ntarget_params_replaced\\n')\n\n        # sample batch memory from all memory\n        if self.memory_counter > self.memory_size:\n            sample_index = np.random.choice(self.memory_size, size=self.batch_size)\n        else:\n            sample_index = np.random.choice(self.memory_counter, size=self.batch_size)\n        batch_memory = self.memory[sample_index, :]\n\n        q_next, q_eval = self.sess.run(\n            [self.q_next, self.q_eval],\n            feed_dict={\n                self.s_: batch_memory[:, -self.n_features:],  # fixed params\n                self.s: batch_memory[:, :self.n_features],  # newest params\n            })\n\n        # change q_target w.r.t q_eval's action\n        q_target = q_eval.copy()\n\n        batch_index = np.arange(self.batch_size, dtype=np.int32)\n        eval_act_index = batch_memory[:, self.n_features].astype(int)\n        reward = batch_memory[:, self.n_features + 1]\n\n        q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)\n\n        \"\"\"\n        For example in this batch I have 2 samples and 3 actions:\n        q_eval =\n        [[1, 2, 3],\n         [4, 5, 6]]\n\n        q_target = q_eval =\n        [[1, 2, 3],\n         [4, 5, 6]]\n\n        Then change q_target with the real q_target value w.r.t the q_eval's action.\n        For example in:\n            sample 0, I took action 0, and the max q_target value is -1;\n            sample 1, I took action 2, and the max q_target value is -2:\n        q_target =\n        [[-1, 2, 3],\n         [4, 5, -2]]\n\n        So the (q_target - q_eval) becomes:\n        [[(-1)-(1), 0, 0],\n         [0, 0, (-2)-(6)]]\n\n        We then backpropagate this error w.r.t the corresponding action to network,\n        leave other action as error=0 cause we didn't choose it.\n        \"\"\"\n\n        # train eval network\n        _, self.cost = self.sess.run([self._train_op, self.loss],\n                                     feed_dict={self.s: batch_memory[:, :self.n_features],\n                                                self.q_target: q_target})\n        self.cost_his.append(self.cost)\n\n        # increasing epsilon\n        self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max\n        self.learn_step_counter += 1\n\n    def plot_cost(self):\n        import matplotlib.pyplot as plt\n        plt.plot(np.arange(len(self.cost_his)), self.cost_his)\n        plt.ylabel('Cost')\n        plt.xlabel('training steps')\n        plt.show()\n\n\n\n"
  },
  {
    "path": "contents/6_OpenAI_gym/run_CartPole.py",
    "content": "\"\"\"\nDeep Q network,\n\nUsing:\nTensorflow: 1.0\ngym: 0.7.3\n\"\"\"\n\n\nimport gym\nfrom RL_brain import DeepQNetwork\n\nenv = gym.make('CartPole-v0')\nenv = env.unwrapped\n\nprint(env.action_space)\nprint(env.observation_space)\nprint(env.observation_space.high)\nprint(env.observation_space.low)\n\nRL = DeepQNetwork(n_actions=env.action_space.n,\n                  n_features=env.observation_space.shape[0],\n                  learning_rate=0.01, e_greedy=0.9,\n                  replace_target_iter=100, memory_size=2000,\n                  e_greedy_increment=0.001,)\n\ntotal_steps = 0\n\n\nfor i_episode in range(100):\n\n    observation = env.reset()\n    ep_r = 0\n    while True:\n        env.render()\n\n        action = RL.choose_action(observation)\n\n        observation_, reward, done, info = env.step(action)\n\n        # the smaller theta and closer to center the better\n        x, x_dot, theta, theta_dot = observation_\n        r1 = (env.x_threshold - abs(x))/env.x_threshold - 0.8\n        r2 = (env.theta_threshold_radians - abs(theta))/env.theta_threshold_radians - 0.5\n        reward = r1 + r2\n\n        RL.store_transition(observation, action, reward, observation_)\n\n        ep_r += reward\n        if total_steps > 1000:\n            RL.learn()\n\n        if done:\n            print('episode: ', i_episode,\n                  'ep_r: ', round(ep_r, 2),\n                  ' epsilon: ', round(RL.epsilon, 2))\n            break\n\n        observation = observation_\n        total_steps += 1\n\nRL.plot_cost()\n"
  },
  {
    "path": "contents/6_OpenAI_gym/run_MountainCar.py",
    "content": "\"\"\"\nDeep Q network,\n\nUsing:\nTensorflow: 1.0\ngym: 0.8.0\n\"\"\"\n\n\nimport gym\nfrom RL_brain import DeepQNetwork\n\nenv = gym.make('MountainCar-v0')\nenv = env.unwrapped\n\nprint(env.action_space)\nprint(env.observation_space)\nprint(env.observation_space.high)\nprint(env.observation_space.low)\n\nRL = DeepQNetwork(n_actions=3, n_features=2, learning_rate=0.001, e_greedy=0.9,\n                  replace_target_iter=300, memory_size=3000,\n                  e_greedy_increment=0.0002,)\n\ntotal_steps = 0\n\n\nfor i_episode in range(10):\n\n    observation = env.reset()\n    ep_r = 0\n    while True:\n        env.render()\n\n        action = RL.choose_action(observation)\n\n        observation_, reward, done, info = env.step(action)\n\n        position, velocity = observation_\n\n        # the higher the better\n        reward = abs(position - (-0.5))     # r in [0, 1]\n\n        RL.store_transition(observation, action, reward, observation_)\n\n        if total_steps > 1000:\n            RL.learn()\n\n        ep_r += reward\n        if done:\n            get = '| Get' if observation_[0] >= env.unwrapped.goal_position else '| ----'\n            print('Epi: ', i_episode,\n                  get,\n                  '| Ep_r: ', round(ep_r, 4),\n                  '| Epsilon: ', round(RL.epsilon, 2))\n            break\n\n        observation = observation_\n        total_steps += 1\n\nRL.plot_cost()\n"
  },
  {
    "path": "contents/7_Policy_gradient_softmax/RL_brain.py",
    "content": "\"\"\"\nThis part of code is the reinforcement learning brain, which is a brain of the agent.\nAll decisions are made in here.\n\nPolicy Gradient, Reinforcement Learning.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\nTensorflow: 1.0\ngym: 0.8.0\n\"\"\"\n\nimport numpy as np\nimport tensorflow as tf\n\n# reproducible\nnp.random.seed(1)\ntf.set_random_seed(1)\n\n\nclass PolicyGradient:\n    def __init__(\n            self,\n            n_actions,\n            n_features,\n            learning_rate=0.01,\n            reward_decay=0.95,\n            output_graph=False,\n    ):\n        self.n_actions = n_actions\n        self.n_features = n_features\n        self.lr = learning_rate\n        self.gamma = reward_decay\n\n        self.ep_obs, self.ep_as, self.ep_rs = [], [], []\n\n        self._build_net()\n\n        self.sess = tf.Session()\n\n        if output_graph:\n            # $ tensorboard --logdir=logs\n            # http://0.0.0.0:6006/\n            # tf.train.SummaryWriter soon be deprecated, use following\n            tf.summary.FileWriter(\"logs/\", self.sess.graph)\n\n        self.sess.run(tf.global_variables_initializer())\n\n    def _build_net(self):\n        with tf.name_scope('inputs'):\n            self.tf_obs = tf.placeholder(tf.float32, [None, self.n_features], name=\"observations\")\n            self.tf_acts = tf.placeholder(tf.int32, [None, ], name=\"actions_num\")\n            self.tf_vt = tf.placeholder(tf.float32, [None, ], name=\"actions_value\")\n        # fc1\n        layer = tf.layers.dense(\n            inputs=self.tf_obs,\n            units=10,\n            activation=tf.nn.tanh,  # tanh activation\n            kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3),\n            bias_initializer=tf.constant_initializer(0.1),\n            name='fc1'\n        )\n        # fc2\n        all_act = tf.layers.dense(\n            inputs=layer,\n            units=self.n_actions,\n            activation=None,\n            kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3),\n            bias_initializer=tf.constant_initializer(0.1),\n            name='fc2'\n        )\n\n        self.all_act_prob = tf.nn.softmax(all_act, name='act_prob')  # use softmax to convert to probability\n\n        with tf.name_scope('loss'):\n            # to maximize total reward (log_p * R) is to minimize -(log_p * R), and the tf only have minimize(loss)\n            neg_log_prob = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=all_act, labels=self.tf_acts)   # this is negative log of chosen action\n            # or in this way:\n            # neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1)\n            loss = tf.reduce_mean(neg_log_prob * self.tf_vt)  # reward guided loss\n\n        with tf.name_scope('train'):\n            self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)\n\n    def choose_action(self, observation):\n        prob_weights = self.sess.run(self.all_act_prob, feed_dict={self.tf_obs: observation[np.newaxis, :]})\n        action = np.random.choice(range(prob_weights.shape[1]), p=prob_weights.ravel())  # select action w.r.t the actions prob\n        return action\n\n    def store_transition(self, s, a, r):\n        self.ep_obs.append(s)\n        self.ep_as.append(a)\n        self.ep_rs.append(r)\n\n    def learn(self):\n        # discount and normalize episode reward\n        discounted_ep_rs_norm = self._discount_and_norm_rewards()\n\n        # train on episode\n        self.sess.run(self.train_op, feed_dict={\n             self.tf_obs: np.vstack(self.ep_obs),  # shape=[None, n_obs]\n             self.tf_acts: np.array(self.ep_as),  # shape=[None, ]\n             self.tf_vt: discounted_ep_rs_norm,  # shape=[None, ]\n        })\n\n        self.ep_obs, self.ep_as, self.ep_rs = [], [], []    # empty episode data\n        return discounted_ep_rs_norm\n\n    def _discount_and_norm_rewards(self):\n        # discount episode rewards\n        discounted_ep_rs = np.zeros_like(self.ep_rs)\n        running_add = 0\n        for t in reversed(range(0, len(self.ep_rs))):\n            running_add = running_add * self.gamma + self.ep_rs[t]\n            discounted_ep_rs[t] = running_add\n\n        # normalize episode rewards\n        discounted_ep_rs -= np.mean(discounted_ep_rs)\n        discounted_ep_rs /= np.std(discounted_ep_rs)\n        return discounted_ep_rs\n\n\n\n"
  },
  {
    "path": "contents/7_Policy_gradient_softmax/run_CartPole.py",
    "content": "\"\"\"\nPolicy Gradient, Reinforcement Learning.\n\nThe cart pole example\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\nTensorflow: 1.0\ngym: 0.8.0\n\"\"\"\n\nimport gym\nfrom RL_brain import PolicyGradient\nimport matplotlib.pyplot as plt\n\nDISPLAY_REWARD_THRESHOLD = 400  # renders environment if total episode reward is greater then this threshold\nRENDER = False  # rendering wastes time\n\nenv = gym.make('CartPole-v0')\nenv.seed(1)     # reproducible, general Policy gradient has high variance\nenv = env.unwrapped\n\nprint(env.action_space)\nprint(env.observation_space)\nprint(env.observation_space.high)\nprint(env.observation_space.low)\n\nRL = PolicyGradient(\n    n_actions=env.action_space.n,\n    n_features=env.observation_space.shape[0],\n    learning_rate=0.02,\n    reward_decay=0.99,\n    # output_graph=True,\n)\n\nfor i_episode in range(3000):\n\n    observation = env.reset()\n\n    while True:\n        if RENDER: env.render()\n\n        action = RL.choose_action(observation)\n\n        observation_, reward, done, info = env.step(action)\n\n        RL.store_transition(observation, action, reward)\n\n        if done:\n            ep_rs_sum = sum(RL.ep_rs)\n\n            if 'running_reward' not in globals():\n                running_reward = ep_rs_sum\n            else:\n                running_reward = running_reward * 0.99 + ep_rs_sum * 0.01\n            if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True     # rendering\n            print(\"episode:\", i_episode, \"  reward:\", int(running_reward))\n\n            vt = RL.learn()\n\n            if i_episode == 0:\n                plt.plot(vt)    # plot the episode vt\n                plt.xlabel('episode steps')\n                plt.ylabel('normalized state-action value')\n                plt.show()\n            break\n\n        observation = observation_\n"
  },
  {
    "path": "contents/7_Policy_gradient_softmax/run_MountainCar.py",
    "content": "\"\"\"\nPolicy Gradient, Reinforcement Learning.\n\nThe cart pole example\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\nTensorflow: 1.0\ngym: 0.8.0\n\"\"\"\n\nimport gym\nfrom RL_brain import PolicyGradient\nimport matplotlib.pyplot as plt\n\nDISPLAY_REWARD_THRESHOLD = -2000  # renders environment if total episode reward is greater then this threshold\n# episode: 154   reward: -10667\n# episode: 387   reward: -2009\n# episode: 489   reward: -1006\n# episode: 628   reward: -502\n\nRENDER = False  # rendering wastes time\n\nenv = gym.make('MountainCar-v0')\nenv.seed(1)     # reproducible, general Policy gradient has high variance\nenv = env.unwrapped\n\nprint(env.action_space)\nprint(env.observation_space)\nprint(env.observation_space.high)\nprint(env.observation_space.low)\n\nRL = PolicyGradient(\n    n_actions=env.action_space.n,\n    n_features=env.observation_space.shape[0],\n    learning_rate=0.02,\n    reward_decay=0.995,\n    # output_graph=True,\n)\n\nfor i_episode in range(1000):\n\n    observation = env.reset()\n\n    while True:\n        if RENDER: env.render()\n\n        action = RL.choose_action(observation)\n\n        observation_, reward, done, info = env.step(action)     # reward = -1 in all cases\n\n        RL.store_transition(observation, action, reward)\n\n        if done:\n            # calculate running reward\n            ep_rs_sum = sum(RL.ep_rs)\n            if 'running_reward' not in globals():\n                running_reward = ep_rs_sum\n            else:\n                running_reward = running_reward * 0.99 + ep_rs_sum * 0.01\n            if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True     # rendering\n\n            print(\"episode:\", i_episode, \"  reward:\", int(running_reward))\n\n            vt = RL.learn()  # train\n\n            if i_episode == 30:\n                plt.plot(vt)  # plot the episode vt\n                plt.xlabel('episode steps')\n                plt.ylabel('normalized state-action value')\n                plt.show()\n\n            break\n\n        observation = observation_\n"
  },
  {
    "path": "contents/8_Actor_Critic_Advantage/AC_CartPole.py",
    "content": "\"\"\"\nActor-Critic using TD-error as the Advantage, Reinforcement Learning.\n\nThe cart pole example. Policy is oscillated.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\ntensorflow 1.0\ngym 0.8.0\n\"\"\"\n\nimport numpy as np\nimport tensorflow as tf\nimport gym\n\nnp.random.seed(2)\ntf.set_random_seed(2)  # reproducible\n\n# Superparameters\nOUTPUT_GRAPH = False\nMAX_EPISODE = 3000\nDISPLAY_REWARD_THRESHOLD = 200  # renders environment if total episode reward is greater then this threshold\nMAX_EP_STEPS = 1000   # maximum time step in one episode\nRENDER = False  # rendering wastes time\nGAMMA = 0.9     # reward discount in TD error\nLR_A = 0.001    # learning rate for actor\nLR_C = 0.01     # learning rate for critic\n\nenv = gym.make('CartPole-v0')\nenv.seed(1)  # reproducible\nenv = env.unwrapped\n\nN_F = env.observation_space.shape[0]\nN_A = env.action_space.n\n\n\nclass Actor(object):\n    def __init__(self, sess, n_features, n_actions, lr=0.001):\n        self.sess = sess\n\n        self.s = tf.placeholder(tf.float32, [1, n_features], \"state\")\n        self.a = tf.placeholder(tf.int32, None, \"act\")\n        self.td_error = tf.placeholder(tf.float32, None, \"td_error\")  # TD_error\n\n        with tf.variable_scope('Actor'):\n            l1 = tf.layers.dense(\n                inputs=self.s,\n                units=20,    # number of hidden units\n                activation=tf.nn.relu,\n                kernel_initializer=tf.random_normal_initializer(0., .1),    # weights\n                bias_initializer=tf.constant_initializer(0.1),  # biases\n                name='l1'\n            )\n\n            self.acts_prob = tf.layers.dense(\n                inputs=l1,\n                units=n_actions,    # output units\n                activation=tf.nn.softmax,   # get action probabilities\n                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights\n                bias_initializer=tf.constant_initializer(0.1),  # biases\n                name='acts_prob'\n            )\n\n        with tf.variable_scope('exp_v'):\n            log_prob = tf.log(self.acts_prob[0, self.a])\n            self.exp_v = tf.reduce_mean(log_prob * self.td_error)  # advantage (TD_error) guided loss\n\n        with tf.variable_scope('train'):\n            self.train_op = tf.train.AdamOptimizer(lr).minimize(-self.exp_v)  # minimize(-exp_v) = maximize(exp_v)\n\n    def learn(self, s, a, td):\n        s = s[np.newaxis, :]\n        feed_dict = {self.s: s, self.a: a, self.td_error: td}\n        _, exp_v = self.sess.run([self.train_op, self.exp_v], feed_dict)\n        return exp_v\n\n    def choose_action(self, s):\n        s = s[np.newaxis, :]\n        probs = self.sess.run(self.acts_prob, {self.s: s})   # get probabilities for all actions\n        return np.random.choice(np.arange(probs.shape[1]), p=probs.ravel())   # return a int\n\n\nclass Critic(object):\n    def __init__(self, sess, n_features, lr=0.01):\n        self.sess = sess\n\n        self.s = tf.placeholder(tf.float32, [1, n_features], \"state\")\n        self.v_ = tf.placeholder(tf.float32, [1, 1], \"v_next\")\n        self.r = tf.placeholder(tf.float32, None, 'r')\n\n        with tf.variable_scope('Critic'):\n            l1 = tf.layers.dense(\n                inputs=self.s,\n                units=20,  # number of hidden units\n                activation=tf.nn.relu,  # None\n                # have to be linear to make sure the convergence of actor.\n                # But linear approximator seems hardly learns the correct Q.\n                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights\n                bias_initializer=tf.constant_initializer(0.1),  # biases\n                name='l1'\n            )\n\n            self.v = tf.layers.dense(\n                inputs=l1,\n                units=1,  # output units\n                activation=None,\n                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights\n                bias_initializer=tf.constant_initializer(0.1),  # biases\n                name='V'\n            )\n\n        with tf.variable_scope('squared_TD_error'):\n            self.td_error = self.r + GAMMA * self.v_ - self.v\n            self.loss = tf.square(self.td_error)    # TD_error = (r+gamma*V_next) - V_eval\n        with tf.variable_scope('train'):\n            self.train_op = tf.train.AdamOptimizer(lr).minimize(self.loss)\n\n    def learn(self, s, r, s_):\n        s, s_ = s[np.newaxis, :], s_[np.newaxis, :]\n\n        v_ = self.sess.run(self.v, {self.s: s_})\n        td_error, _ = self.sess.run([self.td_error, self.train_op],\n                                          {self.s: s, self.v_: v_, self.r: r})\n        return td_error\n\n\nsess = tf.Session()\n\nactor = Actor(sess, n_features=N_F, n_actions=N_A, lr=LR_A)\ncritic = Critic(sess, n_features=N_F, lr=LR_C)     # we need a good teacher, so the teacher should learn faster than the actor\n\nsess.run(tf.global_variables_initializer())\n\nif OUTPUT_GRAPH:\n    tf.summary.FileWriter(\"logs/\", sess.graph)\n\nfor i_episode in range(MAX_EPISODE):\n    s = env.reset()\n    t = 0\n    track_r = []\n    while True:\n        if RENDER: env.render()\n\n        a = actor.choose_action(s)\n\n        s_, r, done, info = env.step(a)\n\n        if done: r = -20\n\n        track_r.append(r)\n\n        td_error = critic.learn(s, r, s_)  # gradient = grad[r + gamma * V(s_) - V(s)]\n        actor.learn(s, a, td_error)     # true_gradient = grad[logPi(s,a) * td_error]\n\n        s = s_\n        t += 1\n\n        if done or t >= MAX_EP_STEPS:\n            ep_rs_sum = sum(track_r)\n\n            if 'running_reward' not in globals():\n                running_reward = ep_rs_sum\n            else:\n                running_reward = running_reward * 0.95 + ep_rs_sum * 0.05\n            if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True  # rendering\n            print(\"episode:\", i_episode, \"  reward:\", int(running_reward))\n            break\n\n"
  },
  {
    "path": "contents/8_Actor_Critic_Advantage/AC_continue_Pendulum.py",
    "content": "\"\"\"\nActor-Critic with continuous action using TD-error as the Advantage, Reinforcement Learning.\n\nThe Pendulum example (based on https://github.com/dennybritz/reinforcement-learning/blob/master/PolicyGradient/Continuous%20MountainCar%20Actor%20Critic%20Solution.ipynb)\n\nCannot converge!!! oscillate!!!\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\ntensorflow r1.3\ngym 0.8.0\n\"\"\"\n\nimport tensorflow as tf\nimport numpy as np\nimport gym\n\nnp.random.seed(2)\ntf.set_random_seed(2)  # reproducible\n\n\nclass Actor(object):\n    def __init__(self, sess, n_features, action_bound, lr=0.0001):\n        self.sess = sess\n\n        self.s = tf.placeholder(tf.float32, [1, n_features], \"state\")\n        self.a = tf.placeholder(tf.float32, None, name=\"act\")\n        self.td_error = tf.placeholder(tf.float32, None, name=\"td_error\")  # TD_error\n\n        l1 = tf.layers.dense(\n            inputs=self.s,\n            units=30,  # number of hidden units\n            activation=tf.nn.relu,\n            kernel_initializer=tf.random_normal_initializer(0., .1),  # weights\n            bias_initializer=tf.constant_initializer(0.1),  # biases\n            name='l1'\n        )\n\n        mu = tf.layers.dense(\n            inputs=l1,\n            units=1,  # number of hidden units\n            activation=tf.nn.tanh,\n            kernel_initializer=tf.random_normal_initializer(0., .1),  # weights\n            bias_initializer=tf.constant_initializer(0.1),  # biases\n            name='mu'\n        )\n\n        sigma = tf.layers.dense(\n            inputs=l1,\n            units=1,  # output units\n            activation=tf.nn.softplus,  # get action probabilities\n            kernel_initializer=tf.random_normal_initializer(0., .1),  # weights\n            bias_initializer=tf.constant_initializer(1.),  # biases\n            name='sigma'\n        )\n        global_step = tf.Variable(0, trainable=False)\n        # self.e = epsilon = tf.train.exponential_decay(2., global_step, 1000, 0.9)\n        self.mu, self.sigma = tf.squeeze(mu*2), tf.squeeze(sigma+0.1)\n        self.normal_dist = tf.distributions.Normal(self.mu, self.sigma)\n\n        self.action = tf.clip_by_value(self.normal_dist.sample(1), action_bound[0], action_bound[1])\n\n        with tf.name_scope('exp_v'):\n            log_prob = self.normal_dist.log_prob(self.a)  # loss without advantage\n            self.exp_v = log_prob * self.td_error  # advantage (TD_error) guided loss\n            # Add cross entropy cost to encourage exploration\n            self.exp_v += 0.01*self.normal_dist.entropy()\n\n        with tf.name_scope('train'):\n            self.train_op = tf.train.AdamOptimizer(lr).minimize(-self.exp_v, global_step)    # min(v) = max(-v)\n\n    def learn(self, s, a, td):\n        s = s[np.newaxis, :]\n        feed_dict = {self.s: s, self.a: a, self.td_error: td}\n        _, exp_v = self.sess.run([self.train_op, self.exp_v], feed_dict)\n        return exp_v\n\n    def choose_action(self, s):\n        s = s[np.newaxis, :]\n        return self.sess.run(self.action, {self.s: s})  # get probabilities for all actions\n\n\nclass Critic(object):\n    def __init__(self, sess, n_features, lr=0.01):\n        self.sess = sess\n        with tf.name_scope('inputs'):\n            self.s = tf.placeholder(tf.float32, [1, n_features], \"state\")\n            self.v_ = tf.placeholder(tf.float32, [1, 1], name=\"v_next\")\n            self.r = tf.placeholder(tf.float32, name='r')\n\n        with tf.variable_scope('Critic'):\n            l1 = tf.layers.dense(\n                inputs=self.s,\n                units=30,  # number of hidden units\n                activation=tf.nn.relu,\n                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights\n                bias_initializer=tf.constant_initializer(0.1),  # biases\n                name='l1'\n            )\n\n            self.v = tf.layers.dense(\n                inputs=l1,\n                units=1,  # output units\n                activation=None,\n                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights\n                bias_initializer=tf.constant_initializer(0.1),  # biases\n                name='V'\n            )\n\n        with tf.variable_scope('squared_TD_error'):\n            self.td_error = tf.reduce_mean(self.r + GAMMA * self.v_ - self.v)\n            self.loss = tf.square(self.td_error)    # TD_error = (r+gamma*V_next) - V_eval\n        with tf.variable_scope('train'):\n            self.train_op = tf.train.AdamOptimizer(lr).minimize(self.loss)\n\n    def learn(self, s, r, s_):\n        s, s_ = s[np.newaxis, :], s_[np.newaxis, :]\n\n        v_ = self.sess.run(self.v, {self.s: s_})\n        td_error, _ = self.sess.run([self.td_error, self.train_op],\n                                          {self.s: s, self.v_: v_, self.r: r})\n        return td_error\n\n\nOUTPUT_GRAPH = False\nMAX_EPISODE = 1000\nMAX_EP_STEPS = 200\nDISPLAY_REWARD_THRESHOLD = -100  # renders environment if total episode reward is greater then this threshold\nRENDER = False  # rendering wastes time\nGAMMA = 0.9\nLR_A = 0.001    # learning rate for actor\nLR_C = 0.01     # learning rate for critic\n\nenv = gym.make('Pendulum-v0')\nenv.seed(1)  # reproducible\nenv = env.unwrapped\n\nN_S = env.observation_space.shape[0]\nA_BOUND = env.action_space.high\n\nsess = tf.Session()\n\nactor = Actor(sess, n_features=N_S, lr=LR_A, action_bound=[-A_BOUND, A_BOUND])\ncritic = Critic(sess, n_features=N_S, lr=LR_C)\n\nsess.run(tf.global_variables_initializer())\n\nif OUTPUT_GRAPH:\n    tf.summary.FileWriter(\"logs/\", sess.graph)\n\nfor i_episode in range(MAX_EPISODE):\n    s = env.reset()\n    t = 0\n    ep_rs = []\n    while True:\n        # if RENDER:\n        env.render()\n        a = actor.choose_action(s)\n\n        s_, r, done, info = env.step(a)\n        r /= 10\n\n        td_error = critic.learn(s, r, s_)  # gradient = grad[r + gamma * V(s_) - V(s)]\n        actor.learn(s, a, td_error)  # true_gradient = grad[logPi(s,a) * td_error]\n\n        s = s_\n        t += 1\n        ep_rs.append(r)\n        if t > MAX_EP_STEPS:\n            ep_rs_sum = sum(ep_rs)\n            if 'running_reward' not in globals():\n                running_reward = ep_rs_sum\n            else:\n                running_reward = running_reward * 0.9 + ep_rs_sum * 0.1\n            if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True  # rendering\n            print(\"episode:\", i_episode, \"  reward:\", int(running_reward))\n            break\n\n"
  },
  {
    "path": "contents/9_Deep_Deterministic_Policy_Gradient_DDPG/DDPG.py",
    "content": "\"\"\"\nDeep Deterministic Policy Gradient (DDPG), Reinforcement Learning.\nDDPG is Actor Critic based algorithm.\nPendulum example.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\ntensorflow 1.0\ngym 0.8.0\n\"\"\"\n\nimport tensorflow as tf\nimport numpy as np\nimport gym\nimport time\n\n\nnp.random.seed(1)\ntf.set_random_seed(1)\n\n#####################  hyper parameters  ####################\n\nMAX_EPISODES = 200\nMAX_EP_STEPS = 200\nLR_A = 0.001    # learning rate for actor\nLR_C = 0.001    # learning rate for critic\nGAMMA = 0.9     # reward discount\nREPLACEMENT = [\n    dict(name='soft', tau=0.01),\n    dict(name='hard', rep_iter_a=600, rep_iter_c=500)\n][0]            # you can try different target replacement strategies\nMEMORY_CAPACITY = 10000\nBATCH_SIZE = 32\n\nRENDER = False\nOUTPUT_GRAPH = True\nENV_NAME = 'Pendulum-v0'\n\n###############################  Actor  ####################################\n\n\nclass Actor(object):\n    def __init__(self, sess, action_dim, action_bound, learning_rate, replacement):\n        self.sess = sess\n        self.a_dim = action_dim\n        self.action_bound = action_bound\n        self.lr = learning_rate\n        self.replacement = replacement\n        self.t_replace_counter = 0\n\n        with tf.variable_scope('Actor'):\n            # input s, output a\n            self.a = self._build_net(S, scope='eval_net', trainable=True)\n\n            # input s_, output a, get a_ for critic\n            self.a_ = self._build_net(S_, scope='target_net', trainable=False)\n\n        self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval_net')\n        self.t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target_net')\n\n        if self.replacement['name'] == 'hard':\n            self.t_replace_counter = 0\n            self.hard_replace = [tf.assign(t, e) for t, e in zip(self.t_params, self.e_params)]\n        else:\n            self.soft_replace = [tf.assign(t, (1 - self.replacement['tau']) * t + self.replacement['tau'] * e)\n                                 for t, e in zip(self.t_params, self.e_params)]\n\n    def _build_net(self, s, scope, trainable):\n        with tf.variable_scope(scope):\n            init_w = tf.random_normal_initializer(0., 0.3)\n            init_b = tf.constant_initializer(0.1)\n            net = tf.layers.dense(s, 30, activation=tf.nn.relu,\n                                  kernel_initializer=init_w, bias_initializer=init_b, name='l1',\n                                  trainable=trainable)\n            with tf.variable_scope('a'):\n                actions = tf.layers.dense(net, self.a_dim, activation=tf.nn.tanh, kernel_initializer=init_w,\n                                          bias_initializer=init_b, name='a', trainable=trainable)\n                scaled_a = tf.multiply(actions, self.action_bound, name='scaled_a')  # Scale output to -action_bound to action_bound\n        return scaled_a\n\n    def learn(self, s):   # batch update\n        self.sess.run(self.train_op, feed_dict={S: s})\n\n        if self.replacement['name'] == 'soft':\n            self.sess.run(self.soft_replace)\n        else:\n            if self.t_replace_counter % self.replacement['rep_iter_a'] == 0:\n                self.sess.run(self.hard_replace)\n            self.t_replace_counter += 1\n\n    def choose_action(self, s):\n        s = s[np.newaxis, :]    # single state\n        return self.sess.run(self.a, feed_dict={S: s})[0]  # single action\n\n    def add_grad_to_graph(self, a_grads):\n        with tf.variable_scope('policy_grads'):\n            # ys = policy;\n            # xs = policy's parameters;\n            # a_grads = the gradients of the policy to get more Q\n            # tf.gradients will calculate dys/dxs with a initial gradients for ys, so this is dq/da * da/dparams\n            self.policy_grads = tf.gradients(ys=self.a, xs=self.e_params, grad_ys=a_grads)\n\n        with tf.variable_scope('A_train'):\n            opt = tf.train.AdamOptimizer(-self.lr)  # (- learning rate) for ascent policy\n            self.train_op = opt.apply_gradients(zip(self.policy_grads, self.e_params))\n\n\n###############################  Critic  ####################################\n\nclass Critic(object):\n    def __init__(self, sess, state_dim, action_dim, learning_rate, gamma, replacement, a, a_):\n        self.sess = sess\n        self.s_dim = state_dim\n        self.a_dim = action_dim\n        self.lr = learning_rate\n        self.gamma = gamma\n        self.replacement = replacement\n\n        with tf.variable_scope('Critic'):\n            # Input (s, a), output q\n            self.a = tf.stop_gradient(a)    # stop critic update flows to actor\n            self.q = self._build_net(S, self.a, 'eval_net', trainable=True)\n\n            # Input (s_, a_), output q_ for q_target\n            self.q_ = self._build_net(S_, a_, 'target_net', trainable=False)    # target_q is based on a_ from Actor's target_net\n\n            self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval_net')\n            self.t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target_net')\n\n        with tf.variable_scope('target_q'):\n            self.target_q = R + self.gamma * self.q_\n\n        with tf.variable_scope('TD_error'):\n            self.loss = tf.reduce_mean(tf.squared_difference(self.target_q, self.q))\n\n        with tf.variable_scope('C_train'):\n            self.train_op = tf.train.AdamOptimizer(self.lr).minimize(self.loss)\n\n        with tf.variable_scope('a_grad'):\n            self.a_grads = tf.gradients(self.q, self.a)[0]   # tensor of gradients of each sample (None, a_dim)\n\n        if self.replacement['name'] == 'hard':\n            self.t_replace_counter = 0\n            self.hard_replacement = [tf.assign(t, e) for t, e in zip(self.t_params, self.e_params)]\n        else:\n            self.soft_replacement = [tf.assign(t, (1 - self.replacement['tau']) * t + self.replacement['tau'] * e)\n                                     for t, e in zip(self.t_params, self.e_params)]\n\n    def _build_net(self, s, a, scope, trainable):\n        with tf.variable_scope(scope):\n            init_w = tf.random_normal_initializer(0., 0.1)\n            init_b = tf.constant_initializer(0.1)\n\n            with tf.variable_scope('l1'):\n                n_l1 = 30\n                w1_s = tf.get_variable('w1_s', [self.s_dim, n_l1], initializer=init_w, trainable=trainable)\n                w1_a = tf.get_variable('w1_a', [self.a_dim, n_l1], initializer=init_w, trainable=trainable)\n                b1 = tf.get_variable('b1', [1, n_l1], initializer=init_b, trainable=trainable)\n                net = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1)\n\n            with tf.variable_scope('q'):\n                q = tf.layers.dense(net, 1, kernel_initializer=init_w, bias_initializer=init_b, trainable=trainable)   # Q(s,a)\n        return q\n\n    def learn(self, s, a, r, s_):\n        self.sess.run(self.train_op, feed_dict={S: s, self.a: a, R: r, S_: s_})\n        if self.replacement['name'] == 'soft':\n            self.sess.run(self.soft_replacement)\n        else:\n            if self.t_replace_counter % self.replacement['rep_iter_c'] == 0:\n                self.sess.run(self.hard_replacement)\n            self.t_replace_counter += 1\n\n\n#####################  Memory  ####################\n\nclass Memory(object):\n    def __init__(self, capacity, dims):\n        self.capacity = capacity\n        self.data = np.zeros((capacity, dims))\n        self.pointer = 0\n\n    def store_transition(self, s, a, r, s_):\n        transition = np.hstack((s, a, [r], s_))\n        index = self.pointer % self.capacity  # replace the old memory with new memory\n        self.data[index, :] = transition\n        self.pointer += 1\n\n    def sample(self, n):\n        assert self.pointer >= self.capacity, 'Memory has not been fulfilled'\n        indices = np.random.choice(self.capacity, size=n)\n        return self.data[indices, :]\n\n\nenv = gym.make(ENV_NAME)\nenv = env.unwrapped\nenv.seed(1)\n\nstate_dim = env.observation_space.shape[0]\naction_dim = env.action_space.shape[0]\naction_bound = env.action_space.high\n\n# all placeholder for tf\nwith tf.name_scope('S'):\n    S = tf.placeholder(tf.float32, shape=[None, state_dim], name='s')\nwith tf.name_scope('R'):\n    R = tf.placeholder(tf.float32, [None, 1], name='r')\nwith tf.name_scope('S_'):\n    S_ = tf.placeholder(tf.float32, shape=[None, state_dim], name='s_')\n\n\nsess = tf.Session()\n\n# Create actor and critic.\n# They are actually connected to each other, details can be seen in tensorboard or in this picture:\nactor = Actor(sess, action_dim, action_bound, LR_A, REPLACEMENT)\ncritic = Critic(sess, state_dim, action_dim, LR_C, GAMMA, REPLACEMENT, actor.a, actor.a_)\nactor.add_grad_to_graph(critic.a_grads)\n\nsess.run(tf.global_variables_initializer())\n\nM = Memory(MEMORY_CAPACITY, dims=2 * state_dim + action_dim + 1)\n\nif OUTPUT_GRAPH:\n    tf.summary.FileWriter(\"logs/\", sess.graph)\n\nvar = 3  # control exploration\n\nt1 = time.time()\nfor i in range(MAX_EPISODES):\n    s = env.reset()\n    ep_reward = 0\n\n    for j in range(MAX_EP_STEPS):\n\n        if RENDER:\n            env.render()\n\n        # Add exploration noise\n        a = actor.choose_action(s)\n        a = np.clip(np.random.normal(a, var), -2, 2)    # add randomness to action selection for exploration\n        s_, r, done, info = env.step(a)\n\n        M.store_transition(s, a, r / 10, s_)\n\n        if M.pointer > MEMORY_CAPACITY:\n            var *= .9995    # decay the action randomness\n            b_M = M.sample(BATCH_SIZE)\n            b_s = b_M[:, :state_dim]\n            b_a = b_M[:, state_dim: state_dim + action_dim]\n            b_r = b_M[:, -state_dim - 1: -state_dim]\n            b_s_ = b_M[:, -state_dim:]\n\n            critic.learn(b_s, b_a, b_r, b_s_)\n            actor.learn(b_s)\n\n        s = s_\n        ep_reward += r\n\n        if j == MAX_EP_STEPS-1:\n            print('Episode:', i, ' Reward: %i' % int(ep_reward), 'Explore: %.2f' % var, )\n            if ep_reward > -300:\n                RENDER = True\n            break\n\nprint('Running time: ', time.time()-t1)"
  },
  {
    "path": "contents/9_Deep_Deterministic_Policy_Gradient_DDPG/DDPG_update.py",
    "content": "\"\"\"\nDeep Deterministic Policy Gradient (DDPG), Reinforcement Learning.\nDDPG is Actor Critic based algorithm.\nPendulum example.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\ntensorflow 1.0\ngym 0.8.0\n\"\"\"\n\nimport tensorflow as tf\nimport numpy as np\nimport gym\nimport time\n\n\n#####################  hyper parameters  ####################\n\nMAX_EPISODES = 200\nMAX_EP_STEPS = 200\nLR_A = 0.001    # learning rate for actor\nLR_C = 0.002    # learning rate for critic\nGAMMA = 0.9     # reward discount\nTAU = 0.01      # soft replacement\nMEMORY_CAPACITY = 10000\nBATCH_SIZE = 32\n\nRENDER = False\nENV_NAME = 'Pendulum-v0'\n\n###############################  DDPG  ####################################\n\nclass DDPG(object):\n    def __init__(self, a_dim, s_dim, a_bound,):\n        self.memory = np.zeros((MEMORY_CAPACITY, s_dim * 2 + a_dim + 1), dtype=np.float32)\n        self.pointer = 0\n        self.sess = tf.Session()\n\n        self.a_dim, self.s_dim, self.a_bound = a_dim, s_dim, a_bound,\n        self.S = tf.placeholder(tf.float32, [None, s_dim], 's')\n        self.S_ = tf.placeholder(tf.float32, [None, s_dim], 's_')\n        self.R = tf.placeholder(tf.float32, [None, 1], 'r')\n\n        with tf.variable_scope('Actor'):\n            self.a = self._build_a(self.S, scope='eval', trainable=True)\n            a_ = self._build_a(self.S_, scope='target', trainable=False)\n        with tf.variable_scope('Critic'):\n            # assign self.a = a in memory when calculating q for td_error,\n            # otherwise the self.a is from Actor when updating Actor\n            q = self._build_c(self.S, self.a, scope='eval', trainable=True)\n            q_ = self._build_c(self.S_, a_, scope='target', trainable=False)\n\n        # networks parameters\n        self.ae_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval')\n        self.at_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target')\n        self.ce_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval')\n        self.ct_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target')\n\n        # target net replacement\n        self.soft_replace = [tf.assign(t, (1 - TAU) * t + TAU * e)\n                             for t, e in zip(self.at_params + self.ct_params, self.ae_params + self.ce_params)]\n\n        q_target = self.R + GAMMA * q_\n        # in the feed_dic for the td_error, the self.a should change to actions in memory\n        td_error = tf.losses.mean_squared_error(labels=q_target, predictions=q)\n        self.ctrain = tf.train.AdamOptimizer(LR_C).minimize(td_error, var_list=self.ce_params)\n\n        a_loss = - tf.reduce_mean(q)    # maximize the q\n        self.atrain = tf.train.AdamOptimizer(LR_A).minimize(a_loss, var_list=self.ae_params)\n\n        self.sess.run(tf.global_variables_initializer())\n\n    def choose_action(self, s):\n        return self.sess.run(self.a, {self.S: s[np.newaxis, :]})[0]\n\n    def learn(self):\n        # soft target replacement\n        self.sess.run(self.soft_replace)\n\n        indices = np.random.choice(MEMORY_CAPACITY, size=BATCH_SIZE)\n        bt = self.memory[indices, :]\n        bs = bt[:, :self.s_dim]\n        ba = bt[:, self.s_dim: self.s_dim + self.a_dim]\n        br = bt[:, -self.s_dim - 1: -self.s_dim]\n        bs_ = bt[:, -self.s_dim:]\n\n        self.sess.run(self.atrain, {self.S: bs})\n        self.sess.run(self.ctrain, {self.S: bs, self.a: ba, self.R: br, self.S_: bs_})\n\n    def store_transition(self, s, a, r, s_):\n        transition = np.hstack((s, a, [r], s_))\n        index = self.pointer % MEMORY_CAPACITY  # replace the old memory with new memory\n        self.memory[index, :] = transition\n        self.pointer += 1\n\n    def _build_a(self, s, scope, trainable):\n        with tf.variable_scope(scope):\n            net = tf.layers.dense(s, 30, activation=tf.nn.relu, name='l1', trainable=trainable)\n            a = tf.layers.dense(net, self.a_dim, activation=tf.nn.tanh, name='a', trainable=trainable)\n            return tf.multiply(a, self.a_bound, name='scaled_a')\n\n    def _build_c(self, s, a, scope, trainable):\n        with tf.variable_scope(scope):\n            n_l1 = 30\n            w1_s = tf.get_variable('w1_s', [self.s_dim, n_l1], trainable=trainable)\n            w1_a = tf.get_variable('w1_a', [self.a_dim, n_l1], trainable=trainable)\n            b1 = tf.get_variable('b1', [1, n_l1], trainable=trainable)\n            net = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1)\n            return tf.layers.dense(net, 1, trainable=trainable)  # Q(s,a)\n\n###############################  training  ####################################\n\nenv = gym.make(ENV_NAME)\nenv = env.unwrapped\nenv.seed(1)\n\ns_dim = env.observation_space.shape[0]\na_dim = env.action_space.shape[0]\na_bound = env.action_space.high\n\nddpg = DDPG(a_dim, s_dim, a_bound)\n\nvar = 3  # control exploration\nt1 = time.time()\nfor i in range(MAX_EPISODES):\n    s = env.reset()\n    ep_reward = 0\n    for j in range(MAX_EP_STEPS):\n        if RENDER:\n            env.render()\n\n        # Add exploration noise\n        a = ddpg.choose_action(s)\n        a = np.clip(np.random.normal(a, var), -2, 2)    # add randomness to action selection for exploration\n        s_, r, done, info = env.step(a)\n\n        ddpg.store_transition(s, a, r / 10, s_)\n\n        if ddpg.pointer > MEMORY_CAPACITY:\n            var *= .9995    # decay the action randomness\n            ddpg.learn()\n\n        s = s_\n        ep_reward += r\n        if j == MAX_EP_STEPS-1:\n            print('Episode:', i, ' Reward: %i' % int(ep_reward), 'Explore: %.2f' % var, )\n            # if ep_reward > -300:RENDER = True\n            break\nprint('Running time: ', time.time() - t1)"
  },
  {
    "path": "contents/9_Deep_Deterministic_Policy_Gradient_DDPG/DDPG_update2.py",
    "content": "\"\"\"\nNote: This is a updated version from my previous code,\nfor the target network, I use moving average to soft replace target parameters instead using assign function.\nBy doing this, it has 20% speed up on my machine (CPU).\n\nDeep Deterministic Policy Gradient (DDPG), Reinforcement Learning.\nDDPG is Actor Critic based algorithm.\nPendulum example.\n\nView more on my tutorial page: https://morvanzhou.github.io/tutorials/\n\nUsing:\ntensorflow 1.0\ngym 0.8.0\n\"\"\"\n\nimport tensorflow as tf\nimport numpy as np\nimport gym\nimport time\n\n\n#####################  hyper parameters  ####################\n\nMAX_EPISODES = 200\nMAX_EP_STEPS = 200\nLR_A = 0.001    # learning rate for actor\nLR_C = 0.002    # learning rate for critic\nGAMMA = 0.9     # reward discount\nTAU = 0.01      # soft replacement\nMEMORY_CAPACITY = 10000\nBATCH_SIZE = 32\n\nRENDER = False\nENV_NAME = 'Pendulum-v0'\n\n\n###############################  DDPG  ####################################\n\n\nclass DDPG(object):\n    def __init__(self, a_dim, s_dim, a_bound,):\n        self.memory = np.zeros((MEMORY_CAPACITY, s_dim * 2 + a_dim + 1), dtype=np.float32)\n        self.pointer = 0\n        self.sess = tf.Session()\n\n        self.a_dim, self.s_dim, self.a_bound = a_dim, s_dim, a_bound,\n        self.S = tf.placeholder(tf.float32, [None, s_dim], 's')\n        self.S_ = tf.placeholder(tf.float32, [None, s_dim], 's_')\n        self.R = tf.placeholder(tf.float32, [None, 1], 'r')\n\n        self.a = self._build_a(self.S,)\n        q = self._build_c(self.S, self.a, )\n        a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='Actor')\n        c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='Critic')\n        ema = tf.train.ExponentialMovingAverage(decay=1 - TAU)          # soft replacement\n\n        def ema_getter(getter, name, *args, **kwargs):\n            return ema.average(getter(name, *args, **kwargs))\n\n        target_update = [ema.apply(a_params), ema.apply(c_params)]      # soft update operation\n        a_ = self._build_a(self.S_, reuse=True, custom_getter=ema_getter)   # replaced target parameters\n        q_ = self._build_c(self.S_, a_, reuse=True, custom_getter=ema_getter)\n\n        a_loss = - tf.reduce_mean(q)  # maximize the q\n        self.atrain = tf.train.AdamOptimizer(LR_A).minimize(a_loss, var_list=a_params)\n\n        with tf.control_dependencies(target_update):    # soft replacement happened at here\n            q_target = self.R + GAMMA * q_\n            td_error = tf.losses.mean_squared_error(labels=q_target, predictions=q)\n            self.ctrain = tf.train.AdamOptimizer(LR_C).minimize(td_error, var_list=c_params)\n\n        self.sess.run(tf.global_variables_initializer())\n\n    def choose_action(self, s):\n        return self.sess.run(self.a, {self.S: s[np.newaxis, :]})[0]\n\n    def learn(self):\n        indices = np.random.choice(MEMORY_CAPACITY, size=BATCH_SIZE)\n        bt = self.memory[indices, :]\n        bs = bt[:, :self.s_dim]\n        ba = bt[:, self.s_dim: self.s_dim + self.a_dim]\n        br = bt[:, -self.s_dim - 1: -self.s_dim]\n        bs_ = bt[:, -self.s_dim:]\n\n        self.sess.run(self.atrain, {self.S: bs})\n        self.sess.run(self.ctrain, {self.S: bs, self.a: ba, self.R: br, self.S_: bs_})\n\n    def store_transition(self, s, a, r, s_):\n        transition = np.hstack((s, a, [r], s_))\n        index = self.pointer % MEMORY_CAPACITY  # replace the old memory with new memory\n        self.memory[index, :] = transition\n        self.pointer += 1\n\n    def _build_a(self, s, reuse=None, custom_getter=None):\n        trainable = True if reuse is None else False\n        with tf.variable_scope('Actor', reuse=reuse, custom_getter=custom_getter):\n            net = tf.layers.dense(s, 30, activation=tf.nn.relu, name='l1', trainable=trainable)\n            a = tf.layers.dense(net, self.a_dim, activation=tf.nn.tanh, name='a', trainable=trainable)\n            return tf.multiply(a, self.a_bound, name='scaled_a')\n\n    def _build_c(self, s, a, reuse=None, custom_getter=None):\n        trainable = True if reuse is None else False\n        with tf.variable_scope('Critic', reuse=reuse, custom_getter=custom_getter):\n            n_l1 = 30\n            w1_s = tf.get_variable('w1_s', [self.s_dim, n_l1], trainable=trainable)\n            w1_a = tf.get_variable('w1_a', [self.a_dim, n_l1], trainable=trainable)\n            b1 = tf.get_variable('b1', [1, n_l1], trainable=trainable)\n            net = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1)\n            return tf.layers.dense(net, 1, trainable=trainable)  # Q(s,a)\n\n\n###############################  training  ####################################\n\nenv = gym.make(ENV_NAME)\nenv = env.unwrapped\nenv.seed(1)\n\ns_dim = env.observation_space.shape[0]\na_dim = env.action_space.shape[0]\na_bound = env.action_space.high\n\nddpg = DDPG(a_dim, s_dim, a_bound)\n\nvar = 3  # control exploration\nt1 = time.time()\nfor i in range(MAX_EPISODES):\n    s = env.reset()\n    ep_reward = 0\n    for j in range(MAX_EP_STEPS):\n        if RENDER:\n            env.render()\n\n        # Add exploration noise\n        a = ddpg.choose_action(s)\n        a = np.clip(np.random.normal(a, var), -2, 2)    # add randomness to action selection for exploration\n        s_, r, done, info = env.step(a)\n\n        ddpg.store_transition(s, a, r / 10, s_)\n\n        if ddpg.pointer > MEMORY_CAPACITY:\n            var *= .9995    # decay the action randomness\n            ddpg.learn()\n\n        s = s_\n        ep_reward += r\n        if j == MAX_EP_STEPS-1:\n            print('Episode:', i, ' Reward: %i' % int(ep_reward), 'Explore: %.2f' % var, )\n            # if ep_reward > -300:RENDER = True\n            break\n\nprint('Running time: ', time.time() - t1)"
  },
  {
    "path": "contents/Curiosity_Model/Curiosity.py",
    "content": "\"\"\"This is a simple implementation of [Large-Scale Study of Curiosity-Driven Learning](https://arxiv.org/abs/1808.04355)\"\"\"\n\nimport numpy as np\nimport tensorflow as tf\nimport gym\nimport matplotlib.pyplot as plt\n\n\nclass CuriosityNet:\n    def __init__(\n            self,\n            n_a,\n            n_s,\n            lr=0.01,\n            gamma=0.98,\n            epsilon=0.95,\n            replace_target_iter=300,\n            memory_size=10000,\n            batch_size=128,\n            output_graph=False,\n    ):\n        self.n_a = n_a\n        self.n_s = n_s\n        self.lr = lr\n        self.gamma = gamma\n        self.epsilon = epsilon\n        self.replace_target_iter = replace_target_iter\n        self.memory_size = memory_size\n        self.batch_size = batch_size\n\n        # total learning step\n        self.learn_step_counter = 0\n        self.memory_counter = 0\n\n        # initialize zero memory [s, a, r, s_]\n        self.memory = np.zeros((self.memory_size, n_s * 2 + 2))\n        self.tfs, self.tfa, self.tfr, self.tfs_, self.dyn_train, self.dqn_train, self.q, self.int_r = \\\n            self._build_nets()\n\n        t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='target_net')\n        e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='eval_net')\n\n        with tf.variable_scope('hard_replacement'):\n            self.target_replace_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]\n\n        self.sess = tf.Session()\n\n        if output_graph:\n            tf.summary.FileWriter(\"logs/\", self.sess.graph)\n\n        self.sess.run(tf.global_variables_initializer())\n\n    def _build_nets(self):\n        tfs = tf.placeholder(tf.float32, [None, self.n_s], name=\"s\")    # input State\n        tfa = tf.placeholder(tf.int32, [None, ], name=\"a\")              # input Action\n        tfr = tf.placeholder(tf.float32, [None, ], name=\"ext_r\")        # extrinsic reward\n        tfs_ = tf.placeholder(tf.float32, [None, self.n_s], name=\"s_\")  # input Next State\n\n        # dynamics net\n        dyn_s_, curiosity, dyn_train = self._build_dynamics_net(tfs, tfa, tfs_)\n\n        # normal RL model\n        total_reward = tf.add(curiosity, tfr, name=\"total_r\")\n        q, dqn_loss, dqn_train = self._build_dqn(tfs, tfa, total_reward, tfs_)\n        return tfs, tfa, tfr, tfs_, dyn_train, dqn_train, q, curiosity\n\n    def _build_dynamics_net(self, s, a, s_):\n        with tf.variable_scope(\"dyn_net\"):\n            float_a = tf.expand_dims(tf.cast(a, dtype=tf.float32, name=\"float_a\"), axis=1, name=\"2d_a\")\n            sa = tf.concat((s, float_a), axis=1, name=\"sa\")\n            encoded_s_ = s_                # here we use s_ as the encoded s_\n\n            dyn_l = tf.layers.dense(sa, 32, activation=tf.nn.relu)\n            dyn_s_ = tf.layers.dense(dyn_l, self.n_s)  # predicted s_\n        with tf.name_scope(\"int_r\"):\n            squared_diff = tf.reduce_sum(tf.square(encoded_s_ - dyn_s_), axis=1)  # intrinsic reward\n\n        # It is better to reduce the learning rate in order to stay curious\n        train_op = tf.train.RMSPropOptimizer(self.lr, name=\"dyn_opt\").minimize(tf.reduce_mean(squared_diff))\n        return dyn_s_, squared_diff, train_op\n\n    def _build_dqn(self, s, a, r, s_):\n        with tf.variable_scope('eval_net'):\n            e1 = tf.layers.dense(s, 128, tf.nn.relu)\n            q = tf.layers.dense(e1, self.n_a, name=\"q\")\n        with tf.variable_scope('target_net'):\n            t1 = tf.layers.dense(s_, 128, tf.nn.relu)\n            q_ = tf.layers.dense(t1, self.n_a, name=\"q_\")\n\n        with tf.variable_scope('q_target'):\n            q_target = r + self.gamma * tf.reduce_max(q_, axis=1, name=\"Qmax_s_\")\n\n        with tf.variable_scope('q_wrt_a'):\n            a_indices = tf.stack([tf.range(tf.shape(a)[0], dtype=tf.int32), a], axis=1)\n            q_wrt_a = tf.gather_nd(params=q, indices=a_indices)\n\n        loss = tf.losses.mean_squared_error(labels=q_target, predictions=q_wrt_a)   # TD error\n        train_op = tf.train.RMSPropOptimizer(self.lr, name=\"dqn_opt\").minimize(\n            loss, var_list=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, \"eval_net\"))\n        return q, loss, train_op\n\n    def store_transition(self, s, a, r, s_):            \n        transition = np.hstack((s, [a, r], s_))\n        # replace the old memory with new memory\n        index = self.memory_counter % self.memory_size\n        self.memory[index, :] = transition\n        self.memory_counter += 1\n\n    def choose_action(self, observation):\n        # to have batch dimension when feed into tf placeholder\n        s = observation[np.newaxis, :]\n\n        if np.random.uniform() < self.epsilon:\n            # forward feed the observation and get q value for every actions\n            actions_value = self.sess.run(self.q, feed_dict={self.tfs: s})\n            action = np.argmax(actions_value)\n        else:\n            action = np.random.randint(0, self.n_a)\n        return action\n\n    def learn(self):\n        # check to replace target parameters\n        if self.learn_step_counter % self.replace_target_iter == 0:\n            self.sess.run(self.target_replace_op)\n\n        # sample batch memory from all memory\n        top = self.memory_size if self.memory_counter > self.memory_size else self.memory_counter\n        sample_index = np.random.choice(top, size=self.batch_size)\n        batch_memory = self.memory[sample_index, :]\n\n        bs, ba, br, bs_ = batch_memory[:, :self.n_s], batch_memory[:, self.n_s], \\\n            batch_memory[:, self.n_s + 1], batch_memory[:, -self.n_s:]\n        self.sess.run(self.dqn_train, feed_dict={self.tfs: bs, self.tfa: ba, self.tfr: br, self.tfs_: bs_})\n        if self.learn_step_counter % 1000 == 0:     # delay training in order to stay curious\n            self.sess.run(self.dyn_train, feed_dict={self.tfs: bs, self.tfa: ba, self.tfs_: bs_})\n        self.learn_step_counter += 1\n\n\nenv = gym.make('MountainCar-v0')\nenv = env.unwrapped\n\ndqn = CuriosityNet(n_a=3, n_s=2, lr=0.01, output_graph=False)\nep_steps = []\nfor epi in range(200):\n    s = env.reset()\n    steps = 0\n    while True:\n        env.render()\n        a = dqn.choose_action(s)\n        s_, r, done, info = env.step(a)\n        dqn.store_transition(s, a, r, s_)\n        dqn.learn()\n        if done:\n            print('Epi: ', epi, \"| steps: \", steps)\n            ep_steps.append(steps)\n            break\n        s = s_\n        steps += 1\n\nplt.plot(ep_steps)\nplt.ylabel(\"steps\")\nplt.xlabel(\"episode\")\nplt.show()"
  },
  {
    "path": "contents/Curiosity_Model/Random_Network_Distillation.py",
    "content": "\"\"\"This is a simple implementation of [Exploration by Random Network Distillation](https://arxiv.org/abs/1810.12894)\"\"\"\n\nimport numpy as np\nimport tensorflow as tf\nimport gym\nimport matplotlib.pyplot as plt\n\n\nclass CuriosityNet:\n    def __init__(\n            self,\n            n_a,\n            n_s,\n            lr=0.01,\n            gamma=0.95,\n            epsilon=1.,\n            replace_target_iter=300,\n            memory_size=10000,\n            batch_size=128,\n            output_graph=False,\n    ):\n        self.n_a = n_a\n        self.n_s = n_s\n        self.lr = lr\n        self.gamma = gamma\n        self.epsilon = epsilon\n        self.replace_target_iter = replace_target_iter\n        self.memory_size = memory_size\n        self.batch_size = batch_size\n        self.s_encode_size = 1000       # give a hard job for predictor to learn\n\n        # total learning step\n        self.learn_step_counter = 0\n        self.memory_counter = 0\n\n        # initialize zero memory [s, a, r, s_]\n        self.memory = np.zeros((self.memory_size, n_s * 2 + 2))\n        self.tfs, self.tfa, self.tfr, self.tfs_, self.pred_train, self.dqn_train, self.q = \\\n            self._build_nets()\n\n        t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='target_net')\n        e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='eval_net')\n\n        with tf.variable_scope('hard_replacement'):\n            self.target_replace_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]\n\n        self.sess = tf.Session()\n\n        if output_graph:\n            tf.summary.FileWriter(\"logs/\", self.sess.graph)\n\n        self.sess.run(tf.global_variables_initializer())\n\n    def _build_nets(self):\n        tfs = tf.placeholder(tf.float32, [None, self.n_s], name=\"s\")    # input State\n        tfa = tf.placeholder(tf.int32, [None, ], name=\"a\")              # input Action\n        tfr = tf.placeholder(tf.float32, [None, ], name=\"ext_r\")        # extrinsic reward\n        tfs_ = tf.placeholder(tf.float32, [None, self.n_s], name=\"s_\")  # input Next State\n\n        # fixed random net\n        with tf.variable_scope(\"random_net\"):\n            rand_encode_s_ = tf.layers.dense(tfs_, self.s_encode_size)\n\n        # predictor\n        ri, pred_train = self._build_predictor(tfs_, rand_encode_s_)\n\n        # normal RL model\n        q, dqn_loss, dqn_train = self._build_dqn(tfs, tfa, ri, tfr, tfs_)\n        return tfs, tfa, tfr, tfs_, pred_train, dqn_train, q\n\n    def _build_predictor(self, s_, rand_encode_s_):\n        with tf.variable_scope(\"predictor\"):\n            net = tf.layers.dense(s_, 128, tf.nn.relu)\n            out = tf.layers.dense(net, self.s_encode_size)\n\n        with tf.name_scope(\"int_r\"):\n            ri = tf.reduce_sum(tf.square(rand_encode_s_ - out), axis=1)  # intrinsic reward\n        train_op = tf.train.RMSPropOptimizer(self.lr, name=\"predictor_opt\").minimize(\n            tf.reduce_mean(ri), var_list=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, \"predictor\"))\n\n        return ri, train_op\n\n    def _build_dqn(self, s, a, ri, re, s_):\n        with tf.variable_scope('eval_net'):\n            e1 = tf.layers.dense(s, 128, tf.nn.relu)\n            q = tf.layers.dense(e1, self.n_a, name=\"q\")\n        with tf.variable_scope('target_net'):\n            t1 = tf.layers.dense(s_, 128, tf.nn.relu)\n            q_ = tf.layers.dense(t1, self.n_a, name=\"q_\")\n\n        with tf.variable_scope('q_target'):\n            q_target = re + ri + self.gamma * tf.reduce_max(q_, axis=1, name=\"Qmax_s_\")\n\n        with tf.variable_scope('q_wrt_a'):\n            a_indices = tf.stack([tf.range(tf.shape(a)[0], dtype=tf.int32), a], axis=1)\n            q_wrt_a = tf.gather_nd(params=q, indices=a_indices)\n\n        loss = tf.losses.mean_squared_error(labels=q_target, predictions=q_wrt_a)   # TD error\n        train_op = tf.train.RMSPropOptimizer(self.lr, name=\"dqn_opt\").minimize(\n            loss, var_list=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, \"eval_net\"))\n        return q, loss, train_op\n\n    def store_transition(self, s, a, r, s_):            \n        transition = np.hstack((s, [a, r], s_))\n        # replace the old memory with new memory\n        index = self.memory_counter % self.memory_size\n        self.memory[index, :] = transition\n        self.memory_counter += 1\n\n    def choose_action(self, observation):\n        # to have batch dimension when feed into tf placeholder\n        s = observation[np.newaxis, :]\n\n        if np.random.uniform() < self.epsilon:\n            # forward feed the observation and get q value for every actions\n            actions_value = self.sess.run(self.q, feed_dict={self.tfs: s})\n            action = np.argmax(actions_value)\n        else:\n            action = np.random.randint(0, self.n_a)\n        return action\n\n    def learn(self):\n        # check to replace target parameters\n        if self.learn_step_counter % self.replace_target_iter == 0:\n            self.sess.run(self.target_replace_op)\n\n        # sample batch memory from all memory\n        top = self.memory_size if self.memory_counter > self.memory_size else self.memory_counter\n        sample_index = np.random.choice(top, size=self.batch_size)\n        batch_memory = self.memory[sample_index, :]\n\n        bs, ba, br, bs_ = batch_memory[:, :self.n_s], batch_memory[:, self.n_s], \\\n            batch_memory[:, self.n_s + 1], batch_memory[:, -self.n_s:]\n        self.sess.run(self.dqn_train, feed_dict={self.tfs: bs, self.tfa: ba, self.tfr: br, self.tfs_: bs_})\n        if self.learn_step_counter % 100 == 0:     # delay training in order to stay curious\n            self.sess.run(self.pred_train, feed_dict={self.tfs_: bs_})\n        self.learn_step_counter += 1\n\n\nenv = gym.make('MountainCar-v0')\nenv = env.unwrapped\n\ndqn = CuriosityNet(n_a=3, n_s=2, lr=0.01, output_graph=False)\nep_steps = []\nfor epi in range(200):\n    s = env.reset()\n    steps = 0\n    while True:\n        # env.render()\n        a = dqn.choose_action(s)\n        s_, r, done, info = env.step(a)\n        dqn.store_transition(s, a, r, s_)\n        dqn.learn()\n        if done:\n            print('Epi: ', epi, \"| steps: \", steps)\n            ep_steps.append(steps)\n            break\n        s = s_\n        steps += 1\n\nplt.plot(ep_steps)\nplt.ylabel(\"steps\")\nplt.xlabel(\"episode\")\nplt.show()"
  },
  {
    "path": "experiments/2D_car/DDPG.py",
    "content": "\"\"\"\nEnvironment is a 2D car.\nCar has 5 sensors to obtain distance information.\n\nCar collision => reward = -1, otherwise => reward = 0.\n \nYou can train this RL by using LOAD = False, after training, this model will be store in the a local folder.\nUsing LOAD = True to reload the trained model for playing.\n\nYou can customize this script in a way you want.\n\nView more on [莫烦Python] : https://morvanzhou.github.io/tutorials/\n\nRequirement:\npyglet >= 1.2.4\nnumpy >= 1.12.1\ntensorflow >= 1.0.1\n\"\"\"\n\nimport tensorflow as tf\nimport numpy as np\nimport os\nimport shutil\nfrom car_env import CarEnv\n\n\nnp.random.seed(1)\ntf.set_random_seed(1)\n\nMAX_EPISODES = 500\nMAX_EP_STEPS = 600\nLR_A = 1e-4  # learning rate for actor\nLR_C = 1e-4  # learning rate for critic\nGAMMA = 0.9  # reward discount\nREPLACE_ITER_A = 800\nREPLACE_ITER_C = 700\nMEMORY_CAPACITY = 2000\nBATCH_SIZE = 16\nVAR_MIN = 0.1\nRENDER = True\nLOAD = False\nDISCRETE_ACTION = False\n\nenv = CarEnv(discrete_action=DISCRETE_ACTION)\nSTATE_DIM = env.state_dim\nACTION_DIM = env.action_dim\nACTION_BOUND = env.action_bound\n\n# all placeholder for tf\nwith tf.name_scope('S'):\n    S = tf.placeholder(tf.float32, shape=[None, STATE_DIM], name='s')\nwith tf.name_scope('R'):\n    R = tf.placeholder(tf.float32, [None, 1], name='r')\nwith tf.name_scope('S_'):\n    S_ = tf.placeholder(tf.float32, shape=[None, STATE_DIM], name='s_')\n\n\nclass Actor(object):\n    def __init__(self, sess, action_dim, action_bound, learning_rate, t_replace_iter):\n        self.sess = sess\n        self.a_dim = action_dim\n        self.action_bound = action_bound\n        self.lr = learning_rate\n        self.t_replace_iter = t_replace_iter\n        self.t_replace_counter = 0\n\n        with tf.variable_scope('Actor'):\n            # input s, output a\n            self.a = self._build_net(S, scope='eval_net', trainable=True)\n\n            # input s_, output a, get a_ for critic\n            self.a_ = self._build_net(S_, scope='target_net', trainable=False)\n\n        self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval_net')\n        self.t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target_net')\n\n    def _build_net(self, s, scope, trainable):\n        with tf.variable_scope(scope):\n            init_w = tf.contrib.layers.xavier_initializer()\n            init_b = tf.constant_initializer(0.001)\n            net = tf.layers.dense(s, 100, activation=tf.nn.relu,\n                                  kernel_initializer=init_w, bias_initializer=init_b, name='l1',\n                                  trainable=trainable)\n            net = tf.layers.dense(net, 20, activation=tf.nn.relu,\n                                  kernel_initializer=init_w, bias_initializer=init_b, name='l2',\n                                  trainable=trainable)\n            with tf.variable_scope('a'):\n                actions = tf.layers.dense(net, self.a_dim, activation=tf.nn.tanh, kernel_initializer=init_w,\n                                          name='a', trainable=trainable)\n                scaled_a = tf.multiply(actions, self.action_bound, name='scaled_a')  # Scale output to -action_bound to action_bound\n        return scaled_a\n\n    def learn(self, s):   # batch update\n        self.sess.run(self.train_op, feed_dict={S: s})\n        if self.t_replace_counter % self.t_replace_iter == 0:\n            self.sess.run([tf.assign(t, e) for t, e in zip(self.t_params, self.e_params)])\n        self.t_replace_counter += 1\n\n    def choose_action(self, s):\n        s = s[np.newaxis, :]    # single state\n        return self.sess.run(self.a, feed_dict={S: s})[0]  # single action\n\n    def add_grad_to_graph(self, a_grads):\n        with tf.variable_scope('policy_grads'):\n            self.policy_grads = tf.gradients(ys=self.a, xs=self.e_params, grad_ys=a_grads)\n\n        with tf.variable_scope('A_train'):\n            opt = tf.train.RMSPropOptimizer(-self.lr)  # (- learning rate) for ascent policy\n            self.train_op = opt.apply_gradients(zip(self.policy_grads, self.e_params))\n\n\nclass Critic(object):\n    def __init__(self, sess, state_dim, action_dim, learning_rate, gamma, t_replace_iter, a, a_):\n        self.sess = sess\n        self.s_dim = state_dim\n        self.a_dim = action_dim\n        self.lr = learning_rate\n        self.gamma = gamma\n        self.t_replace_iter = t_replace_iter\n        self.t_replace_counter = 0\n\n        with tf.variable_scope('Critic'):\n            # Input (s, a), output q\n            self.a = a\n            self.q = self._build_net(S, self.a, 'eval_net', trainable=True)\n\n            # Input (s_, a_), output q_ for q_target\n            self.q_ = self._build_net(S_, a_, 'target_net', trainable=False)    # target_q is based on a_ from Actor's target_net\n\n            self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval_net')\n            self.t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target_net')\n\n        with tf.variable_scope('target_q'):\n            self.target_q = R + self.gamma * self.q_\n\n        with tf.variable_scope('TD_error'):\n            self.loss = tf.reduce_mean(tf.squared_difference(self.target_q, self.q))\n\n        with tf.variable_scope('C_train'):\n            self.train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)\n\n        with tf.variable_scope('a_grad'):\n            self.a_grads = tf.gradients(self.q, a)[0]   # tensor of gradients of each sample (None, a_dim)\n\n    def _build_net(self, s, a, scope, trainable):\n        with tf.variable_scope(scope):\n            init_w = tf.contrib.layers.xavier_initializer()\n            init_b = tf.constant_initializer(0.01)\n\n            with tf.variable_scope('l1'):\n                n_l1 = 100\n                w1_s = tf.get_variable('w1_s', [self.s_dim, n_l1], initializer=init_w, trainable=trainable)\n                w1_a = tf.get_variable('w1_a', [self.a_dim, n_l1], initializer=init_w, trainable=trainable)\n                b1 = tf.get_variable('b1', [1, n_l1], initializer=init_b, trainable=trainable)\n                net = tf.nn.relu6(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1)\n            net = tf.layers.dense(net, 20, activation=tf.nn.relu,\n                                  kernel_initializer=init_w, bias_initializer=init_b, name='l2',\n                                  trainable=trainable)\n            with tf.variable_scope('q'):\n                q = tf.layers.dense(net, 1, kernel_initializer=init_w, bias_initializer=init_b, trainable=trainable)   # Q(s,a)\n        return q\n\n    def learn(self, s, a, r, s_):\n        self.sess.run(self.train_op, feed_dict={S: s, self.a: a, R: r, S_: s_})\n        if self.t_replace_counter % self.t_replace_iter == 0:\n            self.sess.run([tf.assign(t, e) for t, e in zip(self.t_params, self.e_params)])\n        self.t_replace_counter += 1\n\n\nclass Memory(object):\n    def __init__(self, capacity, dims):\n        self.capacity = capacity\n        self.data = np.zeros((capacity, dims))\n        self.pointer = 0\n\n    def store_transition(self, s, a, r, s_):\n        transition = np.hstack((s, a, [r], s_))\n        index = self.pointer % self.capacity  # replace the old memory with new memory\n        self.data[index, :] = transition\n        self.pointer += 1\n\n    def sample(self, n):\n        assert self.pointer >= self.capacity, 'Memory has not been fulfilled'\n        indices = np.random.choice(self.capacity, size=n)\n        return self.data[indices, :]\n\n\nsess = tf.Session()\n\n# Create actor and critic.\nactor = Actor(sess, ACTION_DIM, ACTION_BOUND[1], LR_A, REPLACE_ITER_A)\ncritic = Critic(sess, STATE_DIM, ACTION_DIM, LR_C, GAMMA, REPLACE_ITER_C, actor.a, actor.a_)\nactor.add_grad_to_graph(critic.a_grads)\n\nM = Memory(MEMORY_CAPACITY, dims=2 * STATE_DIM + ACTION_DIM + 1)\n\nsaver = tf.train.Saver()\npath = './discrete' if DISCRETE_ACTION else './continuous'\n\nif LOAD:\n    saver.restore(sess, tf.train.latest_checkpoint(path))\nelse:\n    sess.run(tf.global_variables_initializer())\n\n\ndef train():\n    var = 2.  # control exploration\n    for ep in range(MAX_EPISODES):\n        s = env.reset()\n        ep_step = 0\n\n        for t in range(MAX_EP_STEPS):\n        # while True:\n            if RENDER:\n                env.render()\n\n            # Added exploration noise\n            a = actor.choose_action(s)\n            a = np.clip(np.random.normal(a, var), *ACTION_BOUND)    # add randomness to action selection for exploration\n            s_, r, done = env.step(a)\n            M.store_transition(s, a, r, s_)\n\n            if M.pointer > MEMORY_CAPACITY:\n                var = max([var*.9995, VAR_MIN])    # decay the action randomness\n                b_M = M.sample(BATCH_SIZE)\n                b_s = b_M[:, :STATE_DIM]\n                b_a = b_M[:, STATE_DIM: STATE_DIM + ACTION_DIM]\n                b_r = b_M[:, -STATE_DIM - 1: -STATE_DIM]\n                b_s_ = b_M[:, -STATE_DIM:]\n\n                critic.learn(b_s, b_a, b_r, b_s_)\n                actor.learn(b_s)\n\n            s = s_\n            ep_step += 1\n\n            if done or t == MAX_EP_STEPS - 1:\n            # if done:\n                print('Ep:', ep,\n                      '| Steps: %i' % int(ep_step),\n                      '| Explore: %.2f' % var,\n                      )\n                break\n\n    if os.path.isdir(path): shutil.rmtree(path)\n    os.mkdir(path)\n    ckpt_path = os.path.join(path, 'DDPG.ckpt')\n    save_path = saver.save(sess, ckpt_path, write_meta_graph=False)\n    print(\"\\nSave Model %s\\n\" % save_path)\n\n\ndef eval():\n    env.set_fps(30)\n    while True:\n        s = env.reset()\n        while True:\n            env.render()\n            a = actor.choose_action(s)\n            s_, r, done = env.step(a)\n            s = s_\n            if done:\n                break\n\nif __name__ == '__main__':\n    if LOAD:\n        eval()\n    else:\n        train()"
  },
  {
    "path": "experiments/2D_car/car_env.py",
    "content": "\"\"\"\nEnvironment for 2D car driving.\nYou can customize this script in a way you want.\n\nView more on [莫烦Python] : https://morvanzhou.github.io/tutorials/\n\n\nRequirement:\npyglet >= 1.2.4\nnumpy >= 1.12.1\n\"\"\"\nimport numpy as np\nimport pyglet\n\n\npyglet.clock.set_fps_limit(10000)\n\n\nclass CarEnv(object):\n    n_sensor = 5\n    action_dim = 1\n    state_dim = n_sensor\n    viewer = None\n    viewer_xy = (500, 500)\n    sensor_max = 150.\n    start_point = [450, 300]\n    speed = 50.\n    dt = 0.1\n\n    def __init__(self, discrete_action=False):\n        self.is_discrete_action = discrete_action\n        if discrete_action:\n            self.actions = [-1, 0, 1]\n        else:\n            self.action_bound = [-1, 1]\n\n        self.terminal = False\n        # node1 (x, y, r, w, l),\n        self.car_info = np.array([0, 0, 0, 20, 40], dtype=np.float64)   # car coordination\n        self.obstacle_coords = np.array([\n            [120, 120],\n            [380, 120],\n            [380, 380],\n            [120, 380],\n        ])\n        self.sensor_info = self.sensor_max + np.zeros((self.n_sensor, 3))  # n sensors, (distance, end_x, end_y)\n\n    def step(self, action):\n        if self.is_discrete_action:\n            action = self.actions[action]\n        else:\n            action = np.clip(action, *self.action_bound)[0]\n        self.car_info[2] += action * np.pi/30  # max r = 6 degree\n        self.car_info[:2] = self.car_info[:2] + \\\n                            self.speed * self.dt * np.array([np.cos(self.car_info[2]), np.sin(self.car_info[2])])\n\n        self._update_sensor()\n        s = self._get_state()\n        r = -1 if self.terminal else 0\n        return s, r, self.terminal\n\n    def reset(self):\n        self.terminal = False\n        self.car_info[:3] = np.array([*self.start_point, -np.pi/2])\n        self._update_sensor()\n        return self._get_state()\n\n    def render(self):\n        if self.viewer is None:\n            self.viewer = Viewer(*self.viewer_xy, self.car_info, self.sensor_info, self.obstacle_coords)\n        self.viewer.render()\n\n    def sample_action(self):\n        if self.is_discrete_action:\n            a = np.random.choice(list(range(3)))\n        else:\n            a = np.random.uniform(*self.action_bound, size=self.action_dim)\n        return a\n\n    def set_fps(self, fps=30):\n        pyglet.clock.set_fps_limit(fps)\n\n    def _get_state(self):\n        s = self.sensor_info[:, 0].flatten()/self.sensor_max\n        return s\n\n    def _update_sensor(self):\n        cx, cy, rotation = self.car_info[:3]\n\n        n_sensors = len(self.sensor_info)\n        sensor_theta = np.linspace(-np.pi / 2, np.pi / 2, n_sensors)\n        xs = cx + (np.zeros((n_sensors, ))+self.sensor_max) * np.cos(sensor_theta)\n        ys = cy + (np.zeros((n_sensors, ))+self.sensor_max) * np.sin(sensor_theta)\n        xys = np.array([[x, y] for x, y in zip(xs, ys)])    # shape (5 sensors, 2)\n\n        # sensors\n        tmp_x = xys[:, 0] - cx\n        tmp_y = xys[:, 1] - cy\n        # apply rotation\n        rotated_x = tmp_x * np.cos(rotation) - tmp_y * np.sin(rotation)\n        rotated_y = tmp_x * np.sin(rotation) + tmp_y * np.cos(rotation)\n        # rotated x y\n        self.sensor_info[:, -2:] = np.vstack([rotated_x+cx, rotated_y+cy]).T\n\n        q = np.array([cx, cy])\n        for si in range(len(self.sensor_info)):\n            s = self.sensor_info[si, -2:] - q\n            possible_sensor_distance = [self.sensor_max]\n            possible_intersections = [self.sensor_info[si, -2:]]\n\n            # obstacle collision\n            for oi in range(len(self.obstacle_coords)):\n                p = self.obstacle_coords[oi]\n                r = self.obstacle_coords[(oi + 1) % len(self.obstacle_coords)] - self.obstacle_coords[oi]\n                if np.cross(r, s) != 0:  # may collision\n                    t = np.cross((q - p), s) / np.cross(r, s)\n                    u = np.cross((q - p), r) / np.cross(r, s)\n                    if 0 <= t <= 1 and 0 <= u <= 1:\n                        intersection = q + u * s\n                        possible_intersections.append(intersection)\n                        possible_sensor_distance.append(np.linalg.norm(u*s))\n\n            # window collision\n            win_coord = np.array([\n                [0, 0],\n                [self.viewer_xy[0], 0],\n                [*self.viewer_xy],\n                [0, self.viewer_xy[1]],\n                [0, 0],\n            ])\n            for oi in range(4):\n                p = win_coord[oi]\n                r = win_coord[(oi + 1) % len(win_coord)] - win_coord[oi]\n                if np.cross(r, s) != 0:  # may collision\n                    t = np.cross((q - p), s) / np.cross(r, s)\n                    u = np.cross((q - p), r) / np.cross(r, s)\n                    if 0 <= t <= 1 and 0 <= u <= 1:\n                        intersection = p + t * r\n                        possible_intersections.append(intersection)\n                        possible_sensor_distance.append(np.linalg.norm(intersection - q))\n\n            distance = np.min(possible_sensor_distance)\n            distance_index = np.argmin(possible_sensor_distance)\n            self.sensor_info[si, 0] = distance\n            self.sensor_info[si, -2:] = possible_intersections[distance_index]\n            if distance < self.car_info[-1]/2:\n                self.terminal = True\n\n\nclass Viewer(pyglet.window.Window):\n    color = {\n        'background': [1]*3 + [1]\n    }\n    fps_display = pyglet.clock.ClockDisplay()\n    bar_thc = 5\n\n    def __init__(self, width, height, car_info, sensor_info, obstacle_coords):\n        super(Viewer, self).__init__(width, height, resizable=False, caption='2D car', vsync=False)  # vsync=False to not use the monitor FPS\n        self.set_location(x=80, y=10)\n        pyglet.gl.glClearColor(*self.color['background'])\n\n        self.car_info = car_info\n        self.sensor_info = sensor_info\n\n        self.batch = pyglet.graphics.Batch()\n        background = pyglet.graphics.OrderedGroup(0)\n        foreground = pyglet.graphics.OrderedGroup(1)\n\n        self.sensors = []\n        line_coord = [0, 0] * 2\n        c = (73, 73, 73) * 2\n        for i in range(len(self.sensor_info)):\n            self.sensors.append(self.batch.add(2, pyglet.gl.GL_LINES, foreground, ('v2f', line_coord), ('c3B', c)))\n\n        car_box = [0, 0] * 4\n        c = (249, 86, 86) * 4\n        self.car = self.batch.add(4, pyglet.gl.GL_QUADS, foreground, ('v2f', car_box), ('c3B', c))\n\n        c = (134, 181, 244) * 4\n        self.obstacle = self.batch.add(4, pyglet.gl.GL_QUADS, background, ('v2f', obstacle_coords.flatten()), ('c3B', c))\n\n    def render(self):\n        pyglet.clock.tick()\n        self._update()\n        self.switch_to()\n        self.dispatch_events()\n        self.dispatch_event('on_draw')\n        self.flip()\n\n    def on_draw(self):\n        self.clear()\n        self.batch.draw()\n        # self.fps_display.draw()\n\n    def _update(self):\n        cx, cy, r, w, l = self.car_info\n\n        # sensors\n        for i, sensor in enumerate(self.sensors):\n            sensor.vertices = [cx, cy, *self.sensor_info[i, -2:]]\n\n        # car\n        xys = [\n            [cx + l / 2, cy + w / 2],\n            [cx - l / 2, cy + w / 2],\n            [cx - l / 2, cy - w / 2],\n            [cx + l / 2, cy - w / 2],\n        ]\n        r_xys = []\n        for x, y in xys:\n            tempX = x - cx\n            tempY = y - cy\n            # apply rotation\n            rotatedX = tempX * np.cos(r) - tempY * np.sin(r)\n            rotatedY = tempX * np.sin(r) + tempY * np.cos(r)\n            # rotated x y\n            x = rotatedX + cx\n            y = rotatedY + cy\n            r_xys += [x, y]\n        self.car.vertices = r_xys\n\n\nif __name__ == '__main__':\n    np.random.seed(1)\n    env = CarEnv()\n    env.set_fps(30)\n    for ep in range(20):\n        s = env.reset()\n        # for t in range(100):\n        while True:\n            env.render()\n            s, r, done = env.step(env.sample_action())\n            if done:\n                break"
  },
  {
    "path": "experiments/2D_car/collision.py",
    "content": "import numpy as np\n\ndef intersection():\n    p = np.array([0, 0])\n    r = np.array([1, 1])\n    q = np.array([0.1, 0.1])\n    s = np.array([.1, .1])\n\n    if np.cross(r, s) == 0 and np.cross((q-p), r) == 0:    # collinear\n        # t0 = (q − p) · r / (r · r)\n        # t1 = (q + s − p) · r / (r · r) = t0 + s · r / (r · r)\n        t0 = np.dot(q-p, r)/np.dot(r, r)\n        t1 = t0 + np.dot(s, r)/np.dot(r, r)\n        print(t1, t0)\n        if ((np.dot(s, r) > 0) and (0 <= t1 - t0 <= 1)) or ((np.dot(s, r) <= 0) and (0 <= t0 - t1 <= 1)):\n            print('collinear and overlapping, q_s in p_r')\n        else:\n            print('collinear and disjoint')\n    elif np.cross(r, s) == 0 and np.cross((q-p), r) != 0:  # parallel r × s = 0 and (q − p) × r ≠ 0,\n        print('parallel')\n    else:\n        t = np.cross((q - p), s) / np.cross(r, s)\n        u = np.cross((q - p), r) / np.cross(r, s)\n        if 0 <= t <= 1 and 0 <= u <= 1:\n            # If r × s ≠ 0 and 0 ≤ t ≤ 1 and 0 ≤ u ≤ 1, the two line segments meet at the point p + t r = q + u s\n            print('intersection: ', p + t*r)\n        else:\n            print('not parallel and not intersect')\n\n\ndef point2segment():\n    p = np.array([-1, 1])    # coordination of point\n    a = np.array([0, 1])    # coordination of line segment end 1\n    b = np.array([1, 0])    # coordination of line segment end 2\n    ab = b-a    # line ab\n    ap = p-a\n    distance = np.abs(np.cross(ab, ap)/np.linalg.norm(ab))  # d = (AB x AC)/|AB|\n    print(distance)\n\n    # angle  Cos(θ) = A dot B /(|A||B|)\n    bp = p-b\n    cosTheta1 = np.dot(ap, ab) / (np.linalg.norm(ap) * np.linalg.norm(ab))\n    theta1 = np.arccos(cosTheta1)\n    cosTheta2 = np.dot(bp, ab) / (np.linalg.norm(bp) * np.linalg.norm(ab))\n    theta2 = np.arccos(cosTheta2)\n    if np.pi/2 <= (theta1 % (np.pi*2)) <= 3/2 * np.pi:\n        print('out of a')\n    elif -np.pi/2 <= (theta2 % (np.pi*2)) <= np.pi/2:\n        print('out of b')\n    else:\n        print('between a and b')\n\n\n\nif __name__ == '__main__':\n    point2segment()\n    # intersection()\n"
  },
  {
    "path": "experiments/Robot_arm/A3C.py",
    "content": "\"\"\"\nEnvironment is a Robot Arm. The arm tries to get to the blue point.\nThe environment will return a geographic (distance) information for the arm to learn.\n\nThe far away from blue point the less reward; touch blue r+=1; stop at blue for a while then get r=+10.\n \nYou can train this RL by using LOAD = False, after training, this model will be store in the a local folder.\nUsing LOAD = True to reload the trained model for playing.\n\nYou can customize this script in a way you want.\n\nView more on [莫烦Python] : https://morvanzhou.github.io/tutorials/\n\n\nRequirement:\npyglet >= 1.2.4\nnumpy >= 1.12.1\ntensorflow >= 1.0.1\n\"\"\"\n\nimport multiprocessing\nimport threading\nimport tensorflow as tf\nimport numpy as np\nfrom arm_env import ArmEnv\n\n\n# np.random.seed(1)\n# tf.set_random_seed(1)\n\nMAX_GLOBAL_EP = 2000\nMAX_EP_STEP = 300\nUPDATE_GLOBAL_ITER = 5\nN_WORKERS = multiprocessing.cpu_count()\nLR_A = 1e-4  # learning rate for actor\nLR_C = 2e-4  # learning rate for critic\nGAMMA = 0.9  # reward discount\nMODE = ['easy', 'hard']\nn_model = 1\nGLOBAL_NET_SCOPE = 'Global_Net'\nENTROPY_BETA = 0.01\nGLOBAL_RUNNING_R = []\nGLOBAL_EP = 0\n\n\nenv = ArmEnv(mode=MODE[n_model])\nN_S = env.state_dim\nN_A = env.action_dim\nA_BOUND = env.action_bound\ndel env\n\n\nclass ACNet(object):\n    def __init__(self, scope, globalAC=None):\n\n        if scope == GLOBAL_NET_SCOPE:   # get global network\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self._build_net()\n                self.a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\n                self.c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\n        else:   # local net, calculate losses\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self.a_his = tf.placeholder(tf.float32, [None, N_A], 'A')\n                self.v_target = tf.placeholder(tf.float32, [None, 1], 'Vtarget')\n\n                mu, sigma, self.v = self._build_net()\n\n                td = tf.subtract(self.v_target, self.v, name='TD_error')\n                with tf.name_scope('c_loss'):\n                    self.c_loss = tf.reduce_mean(tf.square(td))\n\n                with tf.name_scope('wrap_a_out'):\n                    self.test = sigma[0]\n                    mu, sigma = mu * A_BOUND[1], sigma + 1e-5\n\n                normal_dist = tf.contrib.distributions.Normal(mu, sigma)\n\n                with tf.name_scope('a_loss'):\n                    log_prob = normal_dist.log_prob(self.a_his)\n                    exp_v = log_prob * td\n                    entropy = normal_dist.entropy()  # encourage exploration\n                    self.exp_v = ENTROPY_BETA * entropy + exp_v\n                    self.a_loss = tf.reduce_mean(-self.exp_v)\n\n                with tf.name_scope('choose_a'):  # use local params to choose action\n                    self.A = tf.clip_by_value(tf.squeeze(normal_dist.sample(1), axis=0), *A_BOUND)\n                with tf.name_scope('local_grad'):\n                    self.a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\n                    self.c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\n                    self.a_grads = tf.gradients(self.a_loss, self.a_params)\n                    self.c_grads = tf.gradients(self.c_loss, self.c_params)\n\n            with tf.name_scope('sync'):\n                with tf.name_scope('pull'):\n                    self.pull_a_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.a_params, globalAC.a_params)]\n                    self.pull_c_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.c_params, globalAC.c_params)]\n                with tf.name_scope('push'):\n                    self.update_a_op = OPT_A.apply_gradients(zip(self.a_grads, globalAC.a_params))\n                    self.update_c_op = OPT_C.apply_gradients(zip(self.c_grads, globalAC.c_params))\n\n    def _build_net(self):\n        w_init = tf.contrib.layers.xavier_initializer()\n        with tf.variable_scope('actor'):\n            l_a = tf.layers.dense(self.s, 400, tf.nn.relu6, kernel_initializer=w_init, name='la')\n            l_a = tf.layers.dense(l_a, 300, tf.nn.relu6, kernel_initializer=w_init, name='la2')\n            mu = tf.layers.dense(l_a, N_A, tf.nn.tanh, kernel_initializer=w_init, name='mu')\n            sigma = tf.layers.dense(l_a, N_A, tf.nn.softplus, kernel_initializer=w_init, name='sigma')\n        with tf.variable_scope('critic'):\n            l_c = tf.layers.dense(self.s, 400, tf.nn.relu6, kernel_initializer=w_init, name='lc')\n            l_c = tf.layers.dense(l_c, 200, tf.nn.relu6, kernel_initializer=w_init, name='lc2')\n            v = tf.layers.dense(l_c, 1, kernel_initializer=w_init, name='v')  # state value\n        return mu, sigma, v\n\n    def update_global(self, feed_dict):  # run by a local\n        _, _, t = SESS.run([self.update_a_op, self.update_c_op, self.test], feed_dict)  # local grads applies to global net\n        return t\n\n    def pull_global(self):  # run by a local\n        SESS.run([self.pull_a_params_op, self.pull_c_params_op])\n\n    def choose_action(self, s):  # run by a local\n        s = s[np.newaxis, :]\n        return SESS.run(self.A, {self.s: s})[0]\n\n\nclass Worker(object):\n    def __init__(self, name, globalAC):\n        self.env = ArmEnv(mode=MODE[n_model])\n        self.name = name\n        self.AC = ACNet(name, globalAC)\n\n    def work(self):\n        global GLOBAL_RUNNING_R, GLOBAL_EP\n        total_step = 1\n        buffer_s, buffer_a, buffer_r = [], [], []\n        while not COORD.should_stop() and GLOBAL_EP < MAX_GLOBAL_EP:\n            s = self.env.reset()\n            ep_r = 0\n            for ep_t in range(MAX_EP_STEP):\n                if self.name == 'W_0':\n                    self.env.render()\n                a = self.AC.choose_action(s)\n                s_, r, done = self.env.step(a)\n                if ep_t == MAX_EP_STEP - 1: done = True\n                ep_r += r\n                buffer_s.append(s)\n                buffer_a.append(a)\n                buffer_r.append(r)\n\n                if total_step % UPDATE_GLOBAL_ITER == 0 or done:   # update global and assign to local net\n                    if done:\n                        v_s_ = 0   # terminal\n                    else:\n                        v_s_ = SESS.run(self.AC.v, {self.AC.s: s_[np.newaxis, :]})[0, 0]\n                    buffer_v_target = []\n                    for r in buffer_r[::-1]:    # reverse buffer r\n                        v_s_ = r + GAMMA * v_s_\n                        buffer_v_target.append(v_s_)\n                    buffer_v_target.reverse()\n\n                    buffer_s, buffer_a, buffer_v_target = np.vstack(buffer_s), np.vstack(buffer_a), np.vstack(buffer_v_target)\n                    feed_dict = {\n                        self.AC.s: buffer_s,\n                        self.AC.a_his: buffer_a,\n                        self.AC.v_target: buffer_v_target,\n                    }\n                    test = self.AC.update_global(feed_dict)\n                    buffer_s, buffer_a, buffer_r = [], [], []\n                    self.AC.pull_global()\n\n                s = s_\n                total_step += 1\n                if done:\n                    if len(GLOBAL_RUNNING_R) == 0:  # record running episode reward\n                        GLOBAL_RUNNING_R.append(ep_r)\n                    else:\n                        GLOBAL_RUNNING_R.append(0.9 * GLOBAL_RUNNING_R[-1] + 0.1 * ep_r)\n                    print(\n                        self.name,\n                        \"Ep:\", GLOBAL_EP,\n                        \"| Ep_r: %i\" % GLOBAL_RUNNING_R[-1],\n                        '| Var:', test,\n\n                          )\n                    GLOBAL_EP += 1\n                    break\n\nif __name__ == \"__main__\":\n    SESS = tf.Session()\n\n    with tf.device(\"/cpu:0\"):\n        OPT_A = tf.train.RMSPropOptimizer(LR_A, name='RMSPropA')\n        OPT_C = tf.train.RMSPropOptimizer(LR_C, name='RMSPropC')\n        GLOBAL_AC = ACNet(GLOBAL_NET_SCOPE)  # we only need its params\n        workers = []\n        # Create worker\n        for i in range(N_WORKERS):\n            i_name = 'W_%i' % i   # worker name\n            workers.append(Worker(i_name, GLOBAL_AC))\n\n    COORD = tf.train.Coordinator()\n    SESS.run(tf.global_variables_initializer())\n\n    worker_threads = []\n    for worker in workers:\n        job = lambda: worker.work()\n        t = threading.Thread(target=job)\n        t.start()\n        worker_threads.append(t)\n    COORD.join(worker_threads)\n\n\n"
  },
  {
    "path": "experiments/Robot_arm/DDPG.py",
    "content": "\"\"\"\nEnvironment is a Robot Arm. The arm tries to get to the blue point.\nThe environment will return a geographic (distance) information for the arm to learn.\n\nThe far away from blue point the less reward; touch blue r+=1; stop at blue for a while then get r=+10.\n \nYou can train this RL by using LOAD = False, after training, this model will be store in the a local folder.\nUsing LOAD = True to reload the trained model for playing.\n\nYou can customize this script in a way you want.\n\nView more on [莫烦Python] : https://morvanzhou.github.io/tutorials/\n\nRequirement:\npyglet >= 1.2.4\nnumpy >= 1.12.1\ntensorflow >= 1.0.1\n\"\"\"\n\nimport tensorflow as tf\nimport numpy as np\nimport os\nimport shutil\nfrom arm_env import ArmEnv\n\n\nnp.random.seed(1)\ntf.set_random_seed(1)\n\nMAX_EPISODES = 600\nMAX_EP_STEPS = 200\nLR_A = 1e-4  # learning rate for actor\nLR_C = 1e-4  # learning rate for critic\nGAMMA = 0.9  # reward discount\nREPLACE_ITER_A = 1100\nREPLACE_ITER_C = 1000\nMEMORY_CAPACITY = 5000\nBATCH_SIZE = 16\nVAR_MIN = 0.1\nRENDER = True\nLOAD = False\nMODE = ['easy', 'hard']\nn_model = 1\n\nenv = ArmEnv(mode=MODE[n_model])\nSTATE_DIM = env.state_dim\nACTION_DIM = env.action_dim\nACTION_BOUND = env.action_bound\n\n# all placeholder for tf\nwith tf.name_scope('S'):\n    S = tf.placeholder(tf.float32, shape=[None, STATE_DIM], name='s')\nwith tf.name_scope('R'):\n    R = tf.placeholder(tf.float32, [None, 1], name='r')\nwith tf.name_scope('S_'):\n    S_ = tf.placeholder(tf.float32, shape=[None, STATE_DIM], name='s_')\n\n\nclass Actor(object):\n    def __init__(self, sess, action_dim, action_bound, learning_rate, t_replace_iter):\n        self.sess = sess\n        self.a_dim = action_dim\n        self.action_bound = action_bound\n        self.lr = learning_rate\n        self.t_replace_iter = t_replace_iter\n        self.t_replace_counter = 0\n\n        with tf.variable_scope('Actor'):\n            # input s, output a\n            self.a = self._build_net(S, scope='eval_net', trainable=True)\n\n            # input s_, output a, get a_ for critic\n            self.a_ = self._build_net(S_, scope='target_net', trainable=False)\n\n        self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval_net')\n        self.t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target_net')\n        self.replace = [tf.assign(t, e) for t, e in zip(self.t_params, self.e_params)]\n\n    def _build_net(self, s, scope, trainable):\n        with tf.variable_scope(scope):\n            init_w = tf.contrib.layers.xavier_initializer()\n            init_b = tf.constant_initializer(0.001)\n            net = tf.layers.dense(s, 200, activation=tf.nn.relu6,\n                                  kernel_initializer=init_w, bias_initializer=init_b, name='l1',\n                                  trainable=trainable)\n            net = tf.layers.dense(net, 200, activation=tf.nn.relu6,\n                                  kernel_initializer=init_w, bias_initializer=init_b, name='l2',\n                                  trainable=trainable)\n            net = tf.layers.dense(net, 10, activation=tf.nn.relu,\n                                  kernel_initializer=init_w, bias_initializer=init_b, name='l3',\n                                  trainable=trainable)\n            with tf.variable_scope('a'):\n                actions = tf.layers.dense(net, self.a_dim, activation=tf.nn.tanh, kernel_initializer=init_w,\n                                          name='a', trainable=trainable)\n                scaled_a = tf.multiply(actions, self.action_bound, name='scaled_a')  # Scale output to -action_bound to action_bound\n        return scaled_a\n\n    def learn(self, s):   # batch update\n        self.sess.run(self.train_op, feed_dict={S: s})\n        if self.t_replace_counter % self.t_replace_iter == 0:\n            self.sess.run(self.replace)\n        self.t_replace_counter += 1\n\n    def choose_action(self, s):\n        s = s[np.newaxis, :]    # single state\n        return self.sess.run(self.a, feed_dict={S: s})[0]  # single action\n\n    def add_grad_to_graph(self, a_grads):\n        with tf.variable_scope('policy_grads'):\n            self.policy_grads = tf.gradients(ys=self.a, xs=self.e_params, grad_ys=a_grads)\n\n        with tf.variable_scope('A_train'):\n            opt = tf.train.RMSPropOptimizer(-self.lr)  # (- learning rate) for ascent policy\n            self.train_op = opt.apply_gradients(zip(self.policy_grads, self.e_params))\n\n\nclass Critic(object):\n    def __init__(self, sess, state_dim, action_dim, learning_rate, gamma, t_replace_iter, a, a_):\n        self.sess = sess\n        self.s_dim = state_dim\n        self.a_dim = action_dim\n        self.lr = learning_rate\n        self.gamma = gamma\n        self.t_replace_iter = t_replace_iter\n        self.t_replace_counter = 0\n\n        with tf.variable_scope('Critic'):\n            # Input (s, a), output q\n            self.a = a\n            self.q = self._build_net(S, self.a, 'eval_net', trainable=True)\n\n            # Input (s_, a_), output q_ for q_target\n            self.q_ = self._build_net(S_, a_, 'target_net', trainable=False)    # target_q is based on a_ from Actor's target_net\n\n            self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval_net')\n            self.t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target_net')\n\n        with tf.variable_scope('target_q'):\n            self.target_q = R + self.gamma * self.q_\n\n        with tf.variable_scope('TD_error'):\n            self.loss = tf.reduce_mean(tf.squared_difference(self.target_q, self.q))\n\n        with tf.variable_scope('C_train'):\n            self.train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)\n\n        with tf.variable_scope('a_grad'):\n            self.a_grads = tf.gradients(self.q, a)[0]   # tensor of gradients of each sample (None, a_dim)\n        self.replace = [tf.assign(t, e) for t, e in zip(self.t_params, self.e_params)]\n\n    def _build_net(self, s, a, scope, trainable):\n        with tf.variable_scope(scope):\n            init_w = tf.contrib.layers.xavier_initializer()\n            init_b = tf.constant_initializer(0.01)\n\n            with tf.variable_scope('l1'):\n                n_l1 = 200\n                w1_s = tf.get_variable('w1_s', [self.s_dim, n_l1], initializer=init_w, trainable=trainable)\n                w1_a = tf.get_variable('w1_a', [self.a_dim, n_l1], initializer=init_w, trainable=trainable)\n                b1 = tf.get_variable('b1', [1, n_l1], initializer=init_b, trainable=trainable)\n                net = tf.nn.relu6(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1)\n            net = tf.layers.dense(net, 200, activation=tf.nn.relu6,\n                                  kernel_initializer=init_w, bias_initializer=init_b, name='l2',\n                                  trainable=trainable)\n            net = tf.layers.dense(net, 10, activation=tf.nn.relu,\n                                  kernel_initializer=init_w, bias_initializer=init_b, name='l3',\n                                  trainable=trainable)\n            with tf.variable_scope('q'):\n                q = tf.layers.dense(net, 1, kernel_initializer=init_w, bias_initializer=init_b, trainable=trainable)   # Q(s,a)\n        return q\n\n    def learn(self, s, a, r, s_):\n        self.sess.run(self.train_op, feed_dict={S: s, self.a: a, R: r, S_: s_})\n        if self.t_replace_counter % self.t_replace_iter == 0:\n            self.sess.run(self.replace)\n        self.t_replace_counter += 1\n\n\nclass Memory(object):\n    def __init__(self, capacity, dims):\n        self.capacity = capacity\n        self.data = np.zeros((capacity, dims))\n        self.pointer = 0\n\n    def store_transition(self, s, a, r, s_):\n        transition = np.hstack((s, a, [r], s_))\n        index = self.pointer % self.capacity  # replace the old memory with new memory\n        self.data[index, :] = transition\n        self.pointer += 1\n\n    def sample(self, n):\n        assert self.pointer >= self.capacity, 'Memory has not been fulfilled'\n        indices = np.random.choice(self.capacity, size=n)\n        return self.data[indices, :]\n\n\nsess = tf.Session()\n\n# Create actor and critic.\nactor = Actor(sess, ACTION_DIM, ACTION_BOUND[1], LR_A, REPLACE_ITER_A)\ncritic = Critic(sess, STATE_DIM, ACTION_DIM, LR_C, GAMMA, REPLACE_ITER_C, actor.a, actor.a_)\nactor.add_grad_to_graph(critic.a_grads)\n\nM = Memory(MEMORY_CAPACITY, dims=2 * STATE_DIM + ACTION_DIM + 1)\n\nsaver = tf.train.Saver()\npath = './'+MODE[n_model]\n\nif LOAD:\n    saver.restore(sess, tf.train.latest_checkpoint(path))\nelse:\n    sess.run(tf.global_variables_initializer())\n\n\ndef train():\n    var = 2.  # control exploration\n\n    for ep in range(MAX_EPISODES):\n        s = env.reset()\n        ep_reward = 0\n\n        for t in range(MAX_EP_STEPS):\n        # while True:\n            if RENDER:\n                env.render()\n\n            # Added exploration noise\n            a = actor.choose_action(s)\n            a = np.clip(np.random.normal(a, var), *ACTION_BOUND)    # add randomness to action selection for exploration\n            s_, r, done = env.step(a)\n            M.store_transition(s, a, r, s_)\n\n            if M.pointer > MEMORY_CAPACITY:\n                var = max([var*.9999, VAR_MIN])    # decay the action randomness\n                b_M = M.sample(BATCH_SIZE)\n                b_s = b_M[:, :STATE_DIM]\n                b_a = b_M[:, STATE_DIM: STATE_DIM + ACTION_DIM]\n                b_r = b_M[:, -STATE_DIM - 1: -STATE_DIM]\n                b_s_ = b_M[:, -STATE_DIM:]\n\n                critic.learn(b_s, b_a, b_r, b_s_)\n                actor.learn(b_s)\n\n            s = s_\n            ep_reward += r\n\n            if t == MAX_EP_STEPS-1 or done:\n            # if done:\n                result = '| done' if done else '| ----'\n                print('Ep:', ep,\n                      result,\n                      '| R: %i' % int(ep_reward),\n                      '| Explore: %.2f' % var,\n                      )\n                break\n\n    if os.path.isdir(path): shutil.rmtree(path)\n    os.mkdir(path)\n    ckpt_path = os.path.join('./'+MODE[n_model], 'DDPG.ckpt')\n    save_path = saver.save(sess, ckpt_path, write_meta_graph=False)\n    print(\"\\nSave Model %s\\n\" % save_path)\n\n\ndef eval():\n    env.set_fps(30)\n    s = env.reset()\n    while True:\n        if RENDER:\n            env.render()\n        a = actor.choose_action(s)\n        s_, r, done = env.step(a)\n        s = s_\n\nif __name__ == '__main__':\n    if LOAD:\n        eval()\n    else:\n        train()\n"
  },
  {
    "path": "experiments/Robot_arm/DPPO.py",
    "content": "\"\"\"\nA simple version of OpenAI's Proximal Policy Optimization (PPO). [http://adsabs.harvard.edu/abs/2017arXiv170706347S]\n\nDistributing workers in parallel to collect data, then stop worker's roll-out and train PPO on collected data.\nRestart workers once PPO is updated.\n\nThe global PPO updating rule is adopted from DeepMind's paper (DPPO):\nEmergence of Locomotion Behaviours in Rich Environments (Google Deepmind): [http://adsabs.harvard.edu/abs/2017arXiv170702286H]\n\nView more on my tutorial website: https://morvanzhou.github.io/tutorials\n\nDependencies:\ntensorflow r1.2\ngym 0.9.2\n\"\"\"\n\nimport tensorflow as tf\nfrom tensorflow.contrib.distributions import Normal\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport threading, queue\nfrom arm_env import ArmEnv\n\n\nEP_MAX = 2000\nEP_LEN = 300\nN_WORKER = 4                # parallel workers\nGAMMA = 0.9                 # reward discount factor\nA_LR = 0.0001               # learning rate for actor\nC_LR = 0.0005                # learning rate for critic\nMIN_BATCH_SIZE = 64         # minimum batch size for updating PPO\nUPDATE_STEP = 5             # loop update operation n-steps\nEPSILON = 0.2               # Clipped surrogate objective\nMODE = ['easy', 'hard']\nn_model = 1\n\nenv = ArmEnv(mode=MODE[n_model])\nS_DIM = env.state_dim\nA_DIM = env.action_dim\nA_BOUND = env.action_bound[1]\n\n\nclass PPO(object):\n    def __init__(self):\n        self.sess = tf.Session()\n\n        self.tfs = tf.placeholder(tf.float32, [None, S_DIM], 'state')\n\n        # critic\n        l1 = tf.layers.dense(self.tfs, 100, tf.nn.relu)\n        self.v = tf.layers.dense(l1, 1)\n        self.tfdc_r = tf.placeholder(tf.float32, [None, 1], 'discounted_r')\n        self.advantage = self.tfdc_r - self.v\n        self.closs = tf.reduce_mean(tf.square(self.advantage))\n        self.ctrain_op = tf.train.AdamOptimizer(C_LR).minimize(self.closs)\n\n        # actor\n        pi, pi_params = self._build_anet('pi', trainable=True)\n        oldpi, oldpi_params = self._build_anet('oldpi', trainable=False)\n        self.sample_op = tf.squeeze(pi.sample(1), axis=0)  # choosing action\n        self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)]\n\n        self.tfa = tf.placeholder(tf.float32, [None, A_DIM], 'action')\n        self.tfadv = tf.placeholder(tf.float32, [None, 1], 'advantage')\n        # ratio = tf.exp(pi.log_prob(self.tfa) - oldpi.log_prob(self.tfa))\n        ratio = pi.prob(self.tfa) / (oldpi.prob(self.tfa) + 1e-5)\n        surr = ratio * self.tfadv   # surrogate loss\n\n        self.aloss = -tf.reduce_mean(tf.minimum(\n            surr,\n            tf.clip_by_value(ratio, 1. - EPSILON, 1. + EPSILON) * self.tfadv))\n\n        self.atrain_op = tf.train.AdamOptimizer(A_LR).minimize(self.aloss)\n        self.sess.run(tf.global_variables_initializer())\n\n    def update(self):\n        global GLOBAL_UPDATE_COUNTER\n        while not COORD.should_stop():\n            if GLOBAL_EP < EP_MAX:\n                UPDATE_EVENT.wait()         # wait until get batch of data\n                self.sess.run(self.update_oldpi_op)   # old pi to pi\n                data = [QUEUE.get() for _ in range(QUEUE.qsize())]\n                data = np.vstack(data)\n                s, a, r = data[:, :S_DIM], data[:, S_DIM: S_DIM + A_DIM], data[:, -1:]\n                adv = self.sess.run(self.advantage, {self.tfs: s, self.tfdc_r: r})\n                [self.sess.run(self.atrain_op, {self.tfs: s, self.tfa: a, self.tfadv: adv}) for _ in range(UPDATE_STEP)]\n                [self.sess.run(self.ctrain_op, {self.tfs: s, self.tfdc_r: r}) for _ in range(UPDATE_STEP)]\n                UPDATE_EVENT.clear()        # updating finished\n                GLOBAL_UPDATE_COUNTER = 0   # reset counter\n                ROLLING_EVENT.set()         # set roll-out available\n\n    def _build_anet(self, name, trainable):\n        with tf.variable_scope(name):\n            l1 = tf.layers.dense(self.tfs, 200, tf.nn.relu, trainable=trainable)\n            mu = A_BOUND * tf.layers.dense(l1, A_DIM, tf.nn.tanh, trainable=trainable)\n            sigma = tf.layers.dense(l1, A_DIM, tf.nn.softplus, trainable=trainable)\n            norm_dist = Normal(loc=mu, scale=sigma)\n        params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)\n        return norm_dist, params\n\n    def choose_action(self, s):\n        s = s[np.newaxis, :]\n        a = self.sess.run(self.sample_op, {self.tfs: s})[0]\n        return np.clip(a, -2, 2)\n\n    def get_v(self, s):\n        if s.ndim < 2: s = s[np.newaxis, :]\n        return self.sess.run(self.v, {self.tfs: s})[0, 0]\n\n\nclass Worker(object):\n    def __init__(self, wid):\n        self.wid = wid\n        self.env = ArmEnv(mode=MODE[n_model])\n        self.ppo = GLOBAL_PPO\n\n    def work(self):\n        global GLOBAL_EP, GLOBAL_RUNNING_R, GLOBAL_UPDATE_COUNTER\n        while not COORD.should_stop():\n            s = self.env.reset()\n            ep_r = 0\n            buffer_s, buffer_a, buffer_r = [], [], []\n            for t in range(EP_LEN):\n                if not ROLLING_EVENT.is_set():                  # while global PPO is updating\n                    ROLLING_EVENT.wait()                        # wait until PPO is updated\n                    buffer_s, buffer_a, buffer_r = [], [], []   # clear history buffer\n                a = self.ppo.choose_action(s)\n                s_, r, done = self.env.step(a)\n                buffer_s.append(s)\n                buffer_a.append(a)\n                buffer_r.append(r)                    # normalize reward, find to be useful\n                s = s_\n                ep_r += r\n\n                GLOBAL_UPDATE_COUNTER += 1                      # count to minimum batch size\n                if t == EP_LEN - 1 or GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE:\n                    v_s_ = self.ppo.get_v(s_)\n                    discounted_r = []                           # compute discounted reward\n                    for r in buffer_r[::-1]:\n                        v_s_ = r + GAMMA * v_s_\n                        discounted_r.append(v_s_)\n                    discounted_r.reverse()\n\n                    bs, ba, br = np.vstack(buffer_s), np.vstack(buffer_a), np.array(discounted_r)[:, np.newaxis]\n                    buffer_s, buffer_a, buffer_r = [], [], []\n                    QUEUE.put(np.hstack((bs, ba, br)))\n                    if GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE:\n                        ROLLING_EVENT.clear()       # stop collecting data\n                        UPDATE_EVENT.set()          # globalPPO update\n\n                    if GLOBAL_EP >= EP_MAX:         # stop training\n                        COORD.request_stop()\n                        break\n\n            # record reward changes, plot later\n            if len(GLOBAL_RUNNING_R) == 0: GLOBAL_RUNNING_R.append(ep_r)\n            else: GLOBAL_RUNNING_R.append(GLOBAL_RUNNING_R[-1]*0.9+ep_r*0.1)\n            GLOBAL_EP += 1\n            print('{0:.1f}%'.format(GLOBAL_EP/EP_MAX*100), '|W%i' % self.wid,  '|Ep_r: %.2f' % ep_r,)\n\n\nif __name__ == '__main__':\n    GLOBAL_PPO = PPO()\n    UPDATE_EVENT, ROLLING_EVENT = threading.Event(), threading.Event()\n    UPDATE_EVENT.clear()    # no update now\n    ROLLING_EVENT.set()     # start to roll out\n    workers = [Worker(wid=i) for i in range(N_WORKER)]\n    \n    GLOBAL_UPDATE_COUNTER, GLOBAL_EP = 0, 0\n    GLOBAL_RUNNING_R = []\n    COORD = tf.train.Coordinator()\n    QUEUE = queue.Queue()\n    threads = []\n    for worker in workers:  # worker threads\n        t = threading.Thread(target=worker.work, args=())\n        t.start()\n        threads.append(t)\n    # add a PPO updating thread\n    threads.append(threading.Thread(target=GLOBAL_PPO.update,))\n    threads[-1].start()\n    COORD.join(threads)\n\n    # plot reward change and testing\n    plt.plot(np.arange(len(GLOBAL_RUNNING_R)), GLOBAL_RUNNING_R)\n    plt.xlabel('Episode'); plt.ylabel('Moving reward'); plt.ion(); plt.show()\n    env.set_fps(30)\n    while True:\n        s = env.reset()\n        for t in range(400):\n            env.render()\n            s = env.step(GLOBAL_PPO.choose_action(s))[0]"
  },
  {
    "path": "experiments/Robot_arm/arm_env.py",
    "content": "\"\"\"\nEnvironment for Robot Arm.\nYou can customize this script in a way you want.\n\nView more on [莫烦Python] : https://morvanzhou.github.io/tutorials/\n\n\nRequirement:\npyglet >= 1.2.4\nnumpy >= 1.12.1\n\"\"\"\nimport numpy as np\nimport pyglet\n\n\npyglet.clock.set_fps_limit(10000)\n\n\nclass ArmEnv(object):\n    action_bound = [-1, 1]\n    action_dim = 2\n    state_dim = 7\n    dt = .1  # refresh rate\n    arm1l = 100\n    arm2l = 100\n    viewer = None\n    viewer_xy = (400, 400)\n    get_point = False\n    mouse_in = np.array([False])\n    point_l = 15\n    grab_counter = 0\n\n    def __init__(self, mode='easy'):\n        # node1 (l, d_rad, x, y),\n        # node2 (l, d_rad, x, y)\n        self.mode = mode\n        self.arm_info = np.zeros((2, 4))\n        self.arm_info[0, 0] = self.arm1l\n        self.arm_info[1, 0] = self.arm2l\n        self.point_info = np.array([250, 303])\n        self.point_info_init = self.point_info.copy()\n        self.center_coord = np.array(self.viewer_xy)/2\n\n    def step(self, action):\n        # action = (node1 angular v, node2 angular v)\n        action = np.clip(action, *self.action_bound)\n        self.arm_info[:, 1] += action * self.dt\n        self.arm_info[:, 1] %= np.pi * 2\n\n        arm1rad = self.arm_info[0, 1]\n        arm2rad = self.arm_info[1, 1]\n        arm1dx_dy = np.array([self.arm_info[0, 0] * np.cos(arm1rad), self.arm_info[0, 0] * np.sin(arm1rad)])\n        arm2dx_dy = np.array([self.arm_info[1, 0] * np.cos(arm2rad), self.arm_info[1, 0] * np.sin(arm2rad)])\n        self.arm_info[0, 2:4] = self.center_coord + arm1dx_dy  # (x1, y1)\n        self.arm_info[1, 2:4] = self.arm_info[0, 2:4] + arm2dx_dy  # (x2, y2)\n\n        s, arm2_distance = self._get_state()\n        r = self._r_func(arm2_distance)\n\n        return s, r, self.get_point\n\n    def reset(self):\n        self.get_point = False\n        self.grab_counter = 0\n\n        if self.mode == 'hard':\n            pxy = np.clip(np.random.rand(2) * self.viewer_xy[0], 100, 300)\n            self.point_info[:] = pxy\n        else:\n            arm1rad, arm2rad = np.random.rand(2) * np.pi * 2\n            self.arm_info[0, 1] = arm1rad\n            self.arm_info[1, 1] = arm2rad\n            arm1dx_dy = np.array([self.arm_info[0, 0] * np.cos(arm1rad), self.arm_info[0, 0] * np.sin(arm1rad)])\n            arm2dx_dy = np.array([self.arm_info[1, 0] * np.cos(arm2rad), self.arm_info[1, 0] * np.sin(arm2rad)])\n            self.arm_info[0, 2:4] = self.center_coord + arm1dx_dy  # (x1, y1)\n            self.arm_info[1, 2:4] = self.arm_info[0, 2:4] + arm2dx_dy  # (x2, y2)\n\n            self.point_info[:] = self.point_info_init\n        return self._get_state()[0]\n\n    def render(self):\n        if self.viewer is None:\n            self.viewer = Viewer(*self.viewer_xy, self.arm_info, self.point_info, self.point_l, self.mouse_in)\n        self.viewer.render()\n\n    def sample_action(self):\n        return np.random.uniform(*self.action_bound, size=self.action_dim)\n\n    def set_fps(self, fps=30):\n        pyglet.clock.set_fps_limit(fps)\n\n    def _get_state(self):\n        # return the distance (dx, dy) between arm finger point with blue point\n        arm_end = self.arm_info[:, 2:4]\n        t_arms = np.ravel(arm_end - self.point_info)\n        center_dis = (self.center_coord - self.point_info)/200\n        in_point = 1 if self.grab_counter > 0 else 0\n        return np.hstack([in_point, t_arms/200, center_dis,\n                          # arm1_distance_p, arm1_distance_b,\n                          ]), t_arms[-2:]\n\n    def _r_func(self, distance):\n        t = 50\n        abs_distance = np.sqrt(np.sum(np.square(distance)))\n        r = -abs_distance/200\n        if abs_distance < self.point_l and (not self.get_point):\n            r += 1.\n            self.grab_counter += 1\n            if self.grab_counter > t:\n                r += 10.\n                self.get_point = True\n        elif abs_distance > self.point_l:\n            self.grab_counter = 0\n            self.get_point = False\n        return r\n\n\nclass Viewer(pyglet.window.Window):\n    color = {\n        'background': [1]*3 + [1]\n    }\n    fps_display = pyglet.clock.ClockDisplay()\n    bar_thc = 5\n\n    def __init__(self, width, height, arm_info, point_info, point_l, mouse_in):\n        super(Viewer, self).__init__(width, height, resizable=False, caption='Arm', vsync=False)  # vsync=False to not use the monitor FPS\n        self.set_location(x=80, y=10)\n        pyglet.gl.glClearColor(*self.color['background'])\n\n        self.arm_info = arm_info\n        self.point_info = point_info\n        self.mouse_in = mouse_in\n        self.point_l = point_l\n\n        self.center_coord = np.array((min(width, height)/2, ) * 2)\n        self.batch = pyglet.graphics.Batch()\n\n        arm1_box, arm2_box, point_box = [0]*8, [0]*8, [0]*8\n        c1, c2, c3 = (249, 86, 86)*4, (86, 109, 249)*4, (249, 39, 65)*4\n        self.point = self.batch.add(4, pyglet.gl.GL_QUADS, None, ('v2f', point_box), ('c3B', c2))\n        self.arm1 = self.batch.add(4, pyglet.gl.GL_QUADS, None, ('v2f', arm1_box), ('c3B', c1))\n        self.arm2 = self.batch.add(4, pyglet.gl.GL_QUADS, None, ('v2f', arm2_box), ('c3B', c1))\n\n    def render(self):\n        pyglet.clock.tick()\n        self._update_arm()\n        self.switch_to()\n        self.dispatch_events()\n        self.dispatch_event('on_draw')\n        self.flip()\n\n    def on_draw(self):\n        self.clear()\n        self.batch.draw()\n        # self.fps_display.draw()\n\n    def _update_arm(self):\n        point_l = self.point_l\n        point_box = (self.point_info[0] - point_l, self.point_info[1] - point_l,\n                     self.point_info[0] + point_l, self.point_info[1] - point_l,\n                     self.point_info[0] + point_l, self.point_info[1] + point_l,\n                     self.point_info[0] - point_l, self.point_info[1] + point_l)\n        self.point.vertices = point_box\n\n        arm1_coord = (*self.center_coord, *(self.arm_info[0, 2:4]))  # (x0, y0, x1, y1)\n        arm2_coord = (*(self.arm_info[0, 2:4]), *(self.arm_info[1, 2:4]))  # (x1, y1, x2, y2)\n        arm1_thick_rad = np.pi / 2 - self.arm_info[0, 1]\n        x01, y01 = arm1_coord[0] - np.cos(arm1_thick_rad) * self.bar_thc, arm1_coord[1] + np.sin(\n            arm1_thick_rad) * self.bar_thc\n        x02, y02 = arm1_coord[0] + np.cos(arm1_thick_rad) * self.bar_thc, arm1_coord[1] - np.sin(\n            arm1_thick_rad) * self.bar_thc\n        x11, y11 = arm1_coord[2] + np.cos(arm1_thick_rad) * self.bar_thc, arm1_coord[3] - np.sin(\n            arm1_thick_rad) * self.bar_thc\n        x12, y12 = arm1_coord[2] - np.cos(arm1_thick_rad) * self.bar_thc, arm1_coord[3] + np.sin(\n            arm1_thick_rad) * self.bar_thc\n        arm1_box = (x01, y01, x02, y02, x11, y11, x12, y12)\n        arm2_thick_rad = np.pi / 2 - self.arm_info[1, 1]\n        x11_, y11_ = arm2_coord[0] + np.cos(arm2_thick_rad) * self.bar_thc, arm2_coord[1] - np.sin(\n            arm2_thick_rad) * self.bar_thc\n        x12_, y12_ = arm2_coord[0] - np.cos(arm2_thick_rad) * self.bar_thc, arm2_coord[1] + np.sin(\n            arm2_thick_rad) * self.bar_thc\n        x21, y21 = arm2_coord[2] - np.cos(arm2_thick_rad) * self.bar_thc, arm2_coord[3] + np.sin(\n            arm2_thick_rad) * self.bar_thc\n        x22, y22 = arm2_coord[2] + np.cos(arm2_thick_rad) * self.bar_thc, arm2_coord[3] - np.sin(\n            arm2_thick_rad) * self.bar_thc\n        arm2_box = (x11_, y11_, x12_, y12_, x21, y21, x22, y22)\n        self.arm1.vertices = arm1_box\n        self.arm2.vertices = arm2_box\n\n    def on_key_press(self, symbol, modifiers):\n        if symbol == pyglet.window.key.UP:\n            self.arm_info[0, 1] += .1\n            print(self.arm_info[:, 2:4] - self.point_info)\n        elif symbol == pyglet.window.key.DOWN:\n            self.arm_info[0, 1] -= .1\n            print(self.arm_info[:, 2:4] - self.point_info)\n        elif symbol == pyglet.window.key.LEFT:\n            self.arm_info[1, 1] += .1\n            print(self.arm_info[:, 2:4] - self.point_info)\n        elif symbol == pyglet.window.key.RIGHT:\n            self.arm_info[1, 1] -= .1\n            print(self.arm_info[:, 2:4] - self.point_info)\n        elif symbol == pyglet.window.key.Q:\n            pyglet.clock.set_fps_limit(1000)\n        elif symbol == pyglet.window.key.A:\n            pyglet.clock.set_fps_limit(30)\n\n    def on_mouse_motion(self, x, y, dx, dy):\n        self.point_info[:] = [x, y]\n\n    def on_mouse_enter(self, x, y):\n        self.mouse_in[0] = True\n\n    def on_mouse_leave(self, x, y):\n        self.mouse_in[0] = False\n\n\n\n"
  },
  {
    "path": "experiments/Solve_BipedalWalker/A3C.py",
    "content": "\"\"\"\nAsynchronous Advantage Actor Critic (A3C), Reinforcement Learning.\n\nThe BipedalWalker example.\n\nView more on [莫烦Python] : https://morvanzhou.github.io/tutorials/\n\nUsing:\ntensorflow 1.8.0\ngym 0.10.5\n\"\"\"\n\nimport multiprocessing\nimport threading\nimport tensorflow as tf\nimport numpy as np\nimport gym\nimport os\nimport shutil\n\n\nGAME = 'BipedalWalker-v2'\nOUTPUT_GRAPH = False\nLOG_DIR = './log'\nN_WORKERS = multiprocessing.cpu_count()\nMAX_GLOBAL_EP = 8000\nGLOBAL_NET_SCOPE = 'Global_Net'\nUPDATE_GLOBAL_ITER = 10\nGAMMA = 0.99\nENTROPY_BETA = 0.005\nLR_A = 0.00005    # learning rate for actor\nLR_C = 0.0001    # learning rate for critic\nGLOBAL_RUNNING_R = []\nGLOBAL_EP = 0\n\nenv = gym.make(GAME)\n\nN_S = env.observation_space.shape[0]\nN_A = env.action_space.shape[0]\nA_BOUND = [env.action_space.low, env.action_space.high]\ndel env\n\n\nclass ACNet(object):\n    def __init__(self, scope, globalAC=None):\n\n        if scope == GLOBAL_NET_SCOPE:   # get global network\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self._build_net()\n                self.a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\n                self.c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\n        else:   # local net, calculate losses\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self.a_his = tf.placeholder(tf.float32, [None, N_A], 'A')\n                self.v_target = tf.placeholder(tf.float32, [None, 1], 'Vtarget')\n\n                mu, sigma, self.v = self._build_net()\n\n                td = tf.subtract(self.v_target, self.v, name='TD_error')\n                with tf.name_scope('c_loss'):\n                    self.c_loss = tf.reduce_mean(tf.square(td))\n\n                with tf.name_scope('wrap_a_out'):\n                    self.test = sigma[0]\n                    mu, sigma = mu * A_BOUND[1], sigma + 1e-5\n\n                normal_dist = tf.contrib.distributions.Normal(mu, sigma)\n\n                with tf.name_scope('a_loss'):\n                    log_prob = normal_dist.log_prob(self.a_his)\n                    exp_v = log_prob * td\n                    entropy = normal_dist.entropy()  # encourage exploration\n                    self.exp_v = ENTROPY_BETA * entropy + exp_v\n                    self.a_loss = tf.reduce_mean(-self.exp_v)\n\n                with tf.name_scope('choose_a'):  # use local params to choose action\n                    self.A = tf.clip_by_value(tf.squeeze(normal_dist.sample(1)), *A_BOUND)\n                with tf.name_scope('local_grad'):\n                    self.a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\n                    self.c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\n                    self.a_grads = tf.gradients(self.a_loss, self.a_params)\n                    self.c_grads = tf.gradients(self.c_loss, self.c_params)\n\n            with tf.name_scope('sync'):\n                with tf.name_scope('pull'):\n                    self.pull_a_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.a_params, globalAC.a_params)]\n                    self.pull_c_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.c_params, globalAC.c_params)]\n                with tf.name_scope('push'):\n                    self.update_a_op = OPT_A.apply_gradients(zip(self.a_grads, globalAC.a_params))\n                    self.update_c_op = OPT_C.apply_gradients(zip(self.c_grads, globalAC.c_params))\n\n    def _build_net(self):\n        w_init = tf.contrib.layers.xavier_initializer()\n        with tf.variable_scope('actor'):\n            l_a = tf.layers.dense(self.s, 500, tf.nn.relu6, kernel_initializer=w_init, name='la')\n            l_a = tf.layers.dense(l_a, 300, tf.nn.relu6, kernel_initializer=w_init, name='la2')\n            mu = tf.layers.dense(l_a, N_A, tf.nn.tanh, kernel_initializer=w_init, name='mu')\n            sigma = tf.layers.dense(l_a, N_A, tf.nn.softplus, kernel_initializer=w_init, name='sigma')\n        with tf.variable_scope('critic'):\n            l_c = tf.layers.dense(self.s, 500, tf.nn.relu6, kernel_initializer=w_init, name='lc')\n            l_c = tf.layers.dense(l_c, 300, tf.nn.relu6, kernel_initializer=w_init, name='lc2')\n            v = tf.layers.dense(l_c, 1, kernel_initializer=w_init, name='v')  # state value\n        return mu, sigma, v\n\n    def update_global(self, feed_dict):  # run by a local\n        _, _, t = SESS.run([self.update_a_op, self.update_c_op, self.test], feed_dict)  # local grads applies to global net\n        return t\n\n    def pull_global(self):  # run by a local\n        SESS.run([self.pull_a_params_op, self.pull_c_params_op])\n\n    def choose_action(self, s):  # run by a local\n        s = s[np.newaxis, :]\n        return SESS.run(self.A, {self.s: s})\n\n\nclass Worker(object):\n    def __init__(self, name, globalAC):\n        self.env = gym.make(GAME)\n        self.name = name\n        self.AC = ACNet(name, globalAC)\n\n    def work(self):\n        global GLOBAL_RUNNING_R, GLOBAL_EP\n        total_step = 1\n        buffer_s, buffer_a, buffer_r = [], [], []\n        while not COORD.should_stop() and GLOBAL_EP < MAX_GLOBAL_EP:\n            s = self.env.reset()\n            ep_r = 0\n            while True:\n                if self.name == 'W_0' and total_step % 30 == 0:\n                    self.env.render()\n                a = self.AC.choose_action(s)\n                s_, r, done, info = self.env.step(a)\n                if r == -100: r = -2\n\n                ep_r += r\n                buffer_s.append(s)\n                buffer_a.append(a)\n                buffer_r.append(r)\n\n                if total_step % UPDATE_GLOBAL_ITER == 0 or done:   # update global and assign to local net\n                    if done:\n                        v_s_ = 0   # terminal\n                    else:\n                        v_s_ = SESS.run(self.AC.v, {self.AC.s: s_[np.newaxis, :]})[0, 0]\n                    buffer_v_target = []\n                    for r in buffer_r[::-1]:    # reverse buffer r\n                        v_s_ = r + GAMMA * v_s_\n                        buffer_v_target.append(v_s_)\n                    buffer_v_target.reverse()\n\n                    buffer_s, buffer_a, buffer_v_target = np.vstack(buffer_s), np.vstack(buffer_a), np.vstack(buffer_v_target)\n                    feed_dict = {\n                        self.AC.s: buffer_s,\n                        self.AC.a_his: buffer_a,\n                        self.AC.v_target: buffer_v_target,\n                    }\n                    test = self.AC.update_global(feed_dict)\n                    buffer_s, buffer_a, buffer_r = [], [], []\n                    self.AC.pull_global()\n\n                s = s_\n                total_step += 1\n                if done:\n                    achieve = '| Achieve' if self.env.unwrapped.hull.position[0] >= 88 else '| -------'\n                    if len(GLOBAL_RUNNING_R) == 0:  # record running episode reward\n                        GLOBAL_RUNNING_R.append(ep_r)\n                    else:\n                        GLOBAL_RUNNING_R.append(0.95 * GLOBAL_RUNNING_R[-1] + 0.05 * ep_r)\n                    print(\n                        self.name,\n                        \"Ep:\", GLOBAL_EP,\n                        achieve,\n                        \"| Pos: %i\" % self.env.unwrapped.hull.position[0],\n                        \"| RR: %.1f\" % GLOBAL_RUNNING_R[-1],\n                        '| EpR: %.1f' % ep_r,\n                        '| var:', test,\n                    )\n                    GLOBAL_EP += 1\n                    break\n\nif __name__ == \"__main__\":\n    SESS = tf.Session()\n\n    with tf.device(\"/cpu:0\"):\n        OPT_A = tf.train.RMSPropOptimizer(LR_A, name='RMSPropA')\n        OPT_C = tf.train.RMSPropOptimizer(LR_C, name='RMSPropC')\n        GLOBAL_AC = ACNet(GLOBAL_NET_SCOPE)  # we only need its params\n        workers = []\n        # Create worker\n        for i in range(N_WORKERS):\n            i_name = 'W_%i' % i   # worker name\n            workers.append(Worker(i_name, GLOBAL_AC))\n\n    COORD = tf.train.Coordinator()\n    SESS.run(tf.global_variables_initializer())\n\n    worker_threads = []\n    for worker in workers:\n        job = lambda: worker.work()\n        t = threading.Thread(target=job)\n        t.start()\n        worker_threads.append(t)\n    COORD.join(worker_threads)\n    import matplotlib.pyplot as plt\n    plt.plot(GLOBAL_RUNNING_R)\n    plt.xlabel('episode')\n    plt.ylabel('global running reward')\n    plt.show()\n\n"
  },
  {
    "path": "experiments/Solve_BipedalWalker/A3C_rnn.py",
    "content": "\"\"\"\nAsynchronous Advantage Actor Critic (A3C), Reinforcement Learning.\n\nThe BipedalWalker example.\n\nView more on [莫烦Python] : https://morvanzhou.github.io/tutorials/\n\nUsing:\ntensorflow 1.8.0\ngym 0.10.5\n\"\"\"\n\nimport multiprocessing\nimport threading\nimport tensorflow as tf\nimport numpy as np\nimport gym\nimport os\nimport shutil\n\n\nGAME = 'BipedalWalker-v2'\nOUTPUT_GRAPH = False\nLOG_DIR = './log'\nN_WORKERS = multiprocessing.cpu_count()\nMAX_GLOBAL_EP = 8000\nGLOBAL_NET_SCOPE = 'Global_Net'\nUPDATE_GLOBAL_ITER = 10\nGAMMA = 0.9\nENTROPY_BETA = 0.001\nLR_A = 0.00002    # learning rate for actor\nLR_C = 0.0001    # learning rate for critic\nGLOBAL_RUNNING_R = []\nGLOBAL_EP = 0\n\nenv = gym.make(GAME)\n\nN_S = env.observation_space.shape[0]\nN_A = env.action_space.shape[0]\nA_BOUND = [env.action_space.low, env.action_space.high]\ndel env\n\n\nclass ACNet(object):\n    def __init__(self, scope, globalAC=None):\n\n        if scope == GLOBAL_NET_SCOPE:   # get global network\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self._build_net()\n                self.a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\n                self.c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\n        else:   # local net, calculate losses\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self.a_his = tf.placeholder(tf.float32, [None, N_A], 'A')\n                self.v_target = tf.placeholder(tf.float32, [None, 1], 'Vtarget')\n\n                mu, sigma, self.v = self._build_net()\n\n                td = tf.subtract(self.v_target, self.v, name='TD_error')\n                with tf.name_scope('c_loss'):\n                    self.c_loss = tf.reduce_mean(tf.square(td))\n\n                with tf.name_scope('wrap_a_out'):\n                    self.test = sigma[0]\n                    mu, sigma = mu * A_BOUND[1], sigma + 1e-5\n\n                normal_dist = tf.contrib.distributions.Normal(mu, sigma)\n\n                with tf.name_scope('a_loss'):\n                    log_prob = normal_dist.log_prob(self.a_his)\n                    exp_v = log_prob * td\n                    entropy = normal_dist.entropy()  # encourage exploration\n                    self.exp_v = ENTROPY_BETA * entropy + exp_v\n                    self.a_loss = tf.reduce_mean(-self.exp_v)\n\n                with tf.name_scope('choose_a'):  # use local params to choose action\n                    self.A = tf.clip_by_value(tf.squeeze(normal_dist.sample(1)), A_BOUND[0], A_BOUND[1])\n\n                with tf.name_scope('local_grad'):\n                    self.a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\n                    self.c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\n                    self.a_grads = tf.gradients(self.a_loss, self.a_params)\n                    self.c_grads = tf.gradients(self.c_loss, self.c_params)\n\n            with tf.name_scope('sync'):\n                with tf.name_scope('pull'):\n                    self.pull_a_params_op = [l_p.assign(g_p) for l_p, g_p in\n                                             zip(self.a_params, globalAC.a_params)]\n                    self.pull_c_params_op = [l_p.assign(g_p) for l_p, g_p in\n                                             zip(self.c_params, globalAC.c_params)]\n                with tf.name_scope('push'):\n                    self.update_a_op = OPT_A.apply_gradients(zip(self.a_grads, globalAC.a_params))\n                    self.update_c_op = OPT_C.apply_gradients(zip(self.c_grads, globalAC.c_params))\n\n    def _build_net(self):\n        w_init = tf.random_normal_initializer(0., .1)\n        with tf.variable_scope('critic'):  # only critic controls the rnn update\n            cell_size = 126\n            s = tf.expand_dims(self.s, axis=1,\n                               name='timely_input')  # [time_step, feature] => [time_step, batch, feature]\n            rnn_cell = tf.contrib.rnn.BasicRNNCell(cell_size)\n            self.init_state = rnn_cell.zero_state(batch_size=1, dtype=tf.float32)\n            outputs, self.final_state = tf.nn.dynamic_rnn(\n                cell=rnn_cell, inputs=s, initial_state=self.init_state, time_major=True)\n            cell_out = tf.reshape(outputs, [-1, cell_size], name='flatten_rnn_outputs')  # joined state representation\n            l_c = tf.layers.dense(cell_out, 512, tf.nn.relu6, kernel_initializer=w_init, name='lc')\n            v = tf.layers.dense(l_c, 1, kernel_initializer=w_init, name='v')  # state value\n\n        with tf.variable_scope('actor'):  # state representation is based on critic\n            cell_out = tf.stop_gradient(cell_out, name='c_cell_out')  # from what critic think it is\n            l_a = tf.layers.dense(cell_out, 512, tf.nn.relu6, kernel_initializer=w_init, name='la')\n            mu = tf.layers.dense(l_a, N_A, tf.nn.tanh, kernel_initializer=w_init, name='mu')\n            sigma = tf.layers.dense(l_a, N_A, tf.nn.softplus, kernel_initializer=w_init, name='sigma') # restrict variance\n        return mu, sigma, v\n\n    def update_global(self, feed_dict):  # run by a local\n        _, _, t = SESS.run([self.update_a_op, self.update_c_op, self.test], feed_dict)  # local grads applies to global net\n        return t\n\n    def pull_global(self):  # run by a local\n        SESS.run([self.pull_a_params_op, self.pull_c_params_op])\n\n    def choose_action(self, s, cell_state):  # run by a local\n        s = s[np.newaxis, :]\n        a, cell_state = SESS.run([self.A, self.final_state], {self.s: s, self.init_state: cell_state})\n        return a, cell_state\n\n\nclass Worker(object):\n    def __init__(self, name, globalAC):\n        self.env = gym.make(GAME)\n        self.name = name\n        self.AC = ACNet(name, globalAC)\n\n    def work(self):\n        global GLOBAL_RUNNING_R, GLOBAL_EP\n        total_step = 1\n        buffer_s, buffer_a, buffer_r = [], [], []\n        while not COORD.should_stop() and GLOBAL_EP < MAX_GLOBAL_EP:\n            s = self.env.reset()\n            ep_r = 0\n            rnn_state = SESS.run(self.AC.init_state)  # zero rnn state at beginning\n            keep_state = rnn_state.copy()  # keep rnn state for updating global net\n            while True:\n                if self.name == 'W_0' and total_step % 30 == 0:\n                    self.env.render()\n\n                a, rnn_state_ = self.AC.choose_action(s, rnn_state)  # get the action and next rnn state\n                s_, r, done, info = self.env.step(a)\n                if r == -100: r = -2\n\n                ep_r += r\n                buffer_s.append(s)\n                buffer_a.append(a)\n                buffer_r.append(r)\n\n                if total_step % UPDATE_GLOBAL_ITER == 0 or done:  # update global and assign to local net\n                    if done:\n                        v_s_ = 0  # terminal\n                    else:\n                        v_s_ = SESS.run(self.AC.v, {self.AC.s: s_[np.newaxis, :], self.AC.init_state: rnn_state_})[\n                            0, 0]\n                    buffer_v_target = []\n                    for r in buffer_r[::-1]:  # reverse buffer r\n                        v_s_ = r + GAMMA * v_s_\n                        buffer_v_target.append(v_s_)\n                    buffer_v_target.reverse()\n\n                    buffer_s, buffer_a, buffer_v_target = np.vstack(buffer_s), np.vstack(buffer_a), np.vstack(\n                        buffer_v_target)\n\n                    feed_dict = {\n                        self.AC.s: buffer_s,\n                        self.AC.a_his: buffer_a,\n                        self.AC.v_target: buffer_v_target,\n                        self.AC.init_state: keep_state,\n                    }\n\n                    test = self.AC.update_global(feed_dict)\n                    buffer_s, buffer_a, buffer_r = [], [], []\n                    self.AC.pull_global()\n                    keep_state = rnn_state_.copy()  # replace the keep_state as the new initial rnn state_\n\n                s = s_\n                rnn_state = rnn_state_  # renew rnn state\n                total_step += 1\n\n                if done:\n                    achieve = '| Achieve' if self.env.unwrapped.hull.position[0] >= 88 else '| -------'\n                    if len(GLOBAL_RUNNING_R) == 0:  # record running episode reward\n                        GLOBAL_RUNNING_R.append(ep_r)\n                    else:\n                        GLOBAL_RUNNING_R.append(0.95 * GLOBAL_RUNNING_R[-1] + 0.05 * ep_r)\n                    print(\n                        self.name,\n                        \"Ep:\", GLOBAL_EP,\n                        achieve,\n                        \"| Pos: %i\" % self.env.unwrapped.hull.position[0],\n                        \"| RR: %.1f\" % GLOBAL_RUNNING_R[-1],\n                        '| EpR: %.1f' % ep_r,\n                        '| var:', test,\n                    )\n                    GLOBAL_EP += 1\n                    break\n\n\nif __name__ == \"__main__\":\n    SESS = tf.Session()\n\n    with tf.device(\"/cpu:0\"):\n        OPT_A = tf.train.RMSPropOptimizer(LR_A, name='RMSPropA', decay=0.95)\n        OPT_C = tf.train.RMSPropOptimizer(LR_C, name='RMSPropC', decay=0.95)\n        GLOBAL_AC = ACNet(GLOBAL_NET_SCOPE)  # we only need its params\n        workers = []\n        # Create worker\n        for i in range(N_WORKERS):\n            i_name = 'W_%i' % i   # worker name\n            workers.append(Worker(i_name, GLOBAL_AC))\n\n    COORD = tf.train.Coordinator()\n    SESS.run(tf.global_variables_initializer())\n\n    if OUTPUT_GRAPH:\n        if os.path.exists(LOG_DIR):\n            shutil.rmtree(LOG_DIR)\n        tf.summary.FileWriter(LOG_DIR, SESS.graph)\n\n    worker_threads = []\n    for worker in workers:\n        t = threading.Thread(target=worker.work)\n        t.start()\n        worker_threads.append(t)\n    COORD.join(worker_threads)\n    import matplotlib.pyplot as plt\n    plt.plot(GLOBAL_RUNNING_R)\n    plt.xlabel('episode')\n    plt.ylabel('global running reward')\n    plt.show()"
  },
  {
    "path": "experiments/Solve_BipedalWalker/DDPG.py",
    "content": "import tensorflow as tf\nimport numpy as np\nimport gym\nimport os\nimport shutil\n\nnp.random.seed(1)\ntf.set_random_seed(1)\n\nMAX_EPISODES = 2000\nLR_A = 0.0005  # learning rate for actor\nLR_C = 0.0005  # learning rate for critic\nGAMMA = 0.999  # reward discount\nREPLACE_ITER_A = 1700\nREPLACE_ITER_C = 1500\nMEMORY_CAPACITY = 200000\nBATCH_SIZE = 32\nDISPLAY_THRESHOLD = 100  # display until the running reward > 100\nDATA_PATH = './data'\nLOAD_MODEL = False\nSAVE_MODEL_ITER = 100000\nRENDER = False\nOUTPUT_GRAPH = False\nENV_NAME = 'BipedalWalker-v2'\n\nGLOBAL_STEP = tf.Variable(0, trainable=False)\nINCREASE_GS = GLOBAL_STEP.assign(tf.add(GLOBAL_STEP, 1))\nLR_A = tf.train.exponential_decay(LR_A, GLOBAL_STEP, 10000, .97, staircase=True)\nLR_C = tf.train.exponential_decay(LR_C, GLOBAL_STEP, 10000, .97, staircase=True)\nEND_POINT = (200 - 10) * (14/30)    # from game\n\nenv = gym.make(ENV_NAME)\nenv.seed(1)\n\nSTATE_DIM = env.observation_space.shape[0]  # 24\nACTION_DIM = env.action_space.shape[0]  # 4\nACTION_BOUND = env.action_space.high    # [1, 1, 1, 1]\n\n# all placeholder for tf\nwith tf.name_scope('S'):\n    S = tf.placeholder(tf.float32, shape=[None, STATE_DIM], name='s')\nwith tf.name_scope('R'):\n    R = tf.placeholder(tf.float32, [None, 1], name='r')\nwith tf.name_scope('S_'):\n    S_ = tf.placeholder(tf.float32, shape=[None, STATE_DIM], name='s_')\n\n###############################  Actor  ####################################\n\nclass Actor(object):\n    def __init__(self, sess, action_dim, action_bound, learning_rate, t_replace_iter):\n        self.sess = sess\n        self.a_dim = action_dim\n        self.action_bound = action_bound\n        self.lr = learning_rate\n        self.t_replace_iter = t_replace_iter\n        self.t_replace_counter = 0\n\n        with tf.variable_scope('Actor'):\n            # input s, output a\n            self.a = self._build_net(S, scope='eval_net', trainable=True)\n\n            # input s_, output a, get a_ for critic\n            self.a_ = self._build_net(S_, scope='target_net', trainable=False)\n\n        self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval_net')\n        self.t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target_net')\n\n    def _build_net(self, s, scope, trainable):\n        with tf.variable_scope(scope):\n            init_w = tf.random_normal_initializer(0., 0.01)\n            init_b = tf.constant_initializer(0.01)\n            net = tf.layers.dense(s, 500, activation=tf.nn.relu,\n                                  kernel_initializer=init_w, bias_initializer=init_b, name='l1', trainable=trainable)\n            net = tf.layers.dense(net, 200, activation=tf.nn.relu,\n                                  kernel_initializer=init_w, bias_initializer=init_b, name='l2', trainable=trainable)\n\n            with tf.variable_scope('a'):\n                actions = tf.layers.dense(net, self.a_dim, activation=tf.nn.tanh, kernel_initializer=init_w,\n                                          bias_initializer=init_b, name='a', trainable=trainable)\n                scaled_a = tf.multiply(actions, self.action_bound, name='scaled_a')  # Scale output to -action_bound to action_bound\n        return scaled_a\n\n    def learn(self, s):  # batch update\n        self.sess.run(self.train_op, feed_dict={S: s})\n        if self.t_replace_counter % self.t_replace_iter == 0:\n            self.sess.run([tf.assign(t, e) for t, e in zip(self.t_params, self.e_params)])\n        self.t_replace_counter += 1\n\n    def choose_action(self, s):\n        s = s[np.newaxis, :]    # single state\n        return self.sess.run(self.a, feed_dict={S: s})[0]  # single action\n\n    def add_grad_to_graph(self, a_grads):\n        with tf.variable_scope('policy_grads'):\n            # ys = policy;\n            # xs = policy's parameters;\n            # self.a_grads = the gradients of the policy to get more Q\n            # tf.gradients will calculate dys/dxs with a initial gradients for ys, so this is dq/da * da/dparams\n            self.policy_grads_and_vars = tf.gradients(ys=self.a, xs=self.e_params, grad_ys=a_grads)\n\n        with tf.variable_scope('A_train'):\n            opt = tf.train.RMSPropOptimizer(-self.lr)  # (- learning rate) for ascent policy\n            self.train_op = opt.apply_gradients(zip(self.policy_grads_and_vars, self.e_params), global_step=GLOBAL_STEP)\n\n\n###############################  Critic  ####################################\n\nclass Critic(object):\n    def __init__(self, sess, state_dim, action_dim, learning_rate, gamma, t_replace_iter, a, a_):\n        self.sess = sess\n        self.s_dim = state_dim\n        self.a_dim = action_dim\n        self.lr = learning_rate\n        self.gamma = gamma\n        self.t_replace_iter = t_replace_iter\n        self.t_replace_counter = 0\n\n        with tf.variable_scope('Critic'):\n            # Input (s, a), output q\n            self.a = a\n            self.q = self._build_net(S, self.a, 'eval_net', trainable=True)\n\n            # Input (s_, a_), output q_ for q_target\n            self.q_ = self._build_net(S_, a_, 'target_net', trainable=False)    # target_q is based on a_ from Actor's target_net\n\n            self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval_net')\n            self.t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target_net')\n\n        with tf.variable_scope('target_q'):\n            self.target_q = R + self.gamma * self.q_\n\n        with tf.variable_scope('abs_TD'):\n            self.abs_td = tf.abs(self.target_q - self.q)\n        self.ISWeights = tf.placeholder(tf.float32, [None, 1], name='IS_weights')\n        with tf.variable_scope('TD_error'):\n            self.loss = tf.reduce_mean(self.ISWeights * tf.squared_difference(self.target_q, self.q))\n\n        with tf.variable_scope('C_train'):\n            self.train_op = tf.train.AdamOptimizer(self.lr).minimize(self.loss, global_step=GLOBAL_STEP)\n\n        with tf.variable_scope('a_grad'):\n            self.a_grads = tf.gradients(self.q, a)[0]   # tensor of gradients of each sample (None, a_dim)\n\n    def _build_net(self, s, a, scope, trainable):\n        with tf.variable_scope(scope):\n            init_w = tf.random_normal_initializer(0., 0.01)\n            init_b = tf.constant_initializer(0.01)\n\n            with tf.variable_scope('l1'):\n                n_l1 = 700\n                # combine the action and states together in this way\n                w1_s = tf.get_variable('w1_s', [self.s_dim, n_l1], initializer=init_w, trainable=trainable)\n                w1_a = tf.get_variable('w1_a', [self.a_dim, n_l1], initializer=init_w, trainable=trainable)\n                b1 = tf.get_variable('b1', [1, n_l1], initializer=init_b, trainable=trainable)\n                net = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1)\n            with tf.variable_scope('l2'):\n                net = tf.layers.dense(net, 20, activation=tf.nn.relu, kernel_initializer=init_w,\n                                      bias_initializer=init_b, name='l2', trainable=trainable)\n            with tf.variable_scope('q'):\n                q = tf.layers.dense(net, 1, kernel_initializer=init_w, bias_initializer=init_b, trainable=trainable)   # Q(s,a)\n        return q\n\n    def learn(self, s, a, r, s_, ISW):\n        _, abs_td = self.sess.run([self.train_op, self.abs_td], feed_dict={S: s, self.a: a, R: r, S_: s_, self.ISWeights: ISW})\n        if self.t_replace_counter % self.t_replace_iter == 0:\n            self.sess.run([tf.assign(t, e) for t, e in zip(self.t_params, self.e_params)])\n        self.t_replace_counter += 1\n        return abs_td\n\n\nclass SumTree(object):\n    \"\"\"\n    This SumTree code is modified version and the original code is from:\n    https://github.com/jaara/AI-blog/blob/master/SumTree.py\n\n    Story the data with it priority in tree and data frameworks.\n    \"\"\"\n    data_pointer = 0\n\n    def __init__(self, capacity):\n        self.capacity = capacity  # for all priority values\n        self.tree = np.zeros(2 * capacity - 1)+1e-5\n        # [--------------Parent nodes-------------][-------leaves to recode priority-------]\n        #             size: capacity - 1                       size: capacity\n        self.data = np.zeros(capacity, dtype=object)  # for all transitions\n        # [--------------data frame-------------]\n        #             size: capacity\n\n    def add_new_priority(self, p, data):\n        leaf_idx = self.data_pointer + self.capacity - 1\n\n        self.data[self.data_pointer] = data  # update data_frame\n        self.update(leaf_idx, p)  # update tree_frame\n        self.data_pointer += 1\n        if self.data_pointer >= self.capacity:  # replace when exceed the capacity\n            self.data_pointer = 0\n\n    def update(self, tree_idx, p):\n        change = p - self.tree[tree_idx]\n\n        self.tree[tree_idx] = p\n        self._propagate_change(tree_idx, change)\n\n    def _propagate_change(self, tree_idx, change):\n        \"\"\"change the sum of priority value in all parent nodes\"\"\"\n        parent_idx = (tree_idx - 1) // 2\n        self.tree[parent_idx] += change\n        if parent_idx != 0:\n            self._propagate_change(parent_idx, change)\n\n    def get_leaf(self, lower_bound):\n        leaf_idx = self._retrieve(lower_bound)  # search the max leaf priority based on the lower_bound\n        data_idx = leaf_idx - self.capacity + 1\n        return [leaf_idx, self.tree[leaf_idx], self.data[data_idx]]\n\n    def _retrieve(self, lower_bound, parent_idx=0):\n        \"\"\"\n        Tree structure and array storage:\n\n        Tree index:\n             0         -> storing priority sum\n            / \\\n          1     2\n         / \\   / \\\n        3   4 5   6    -> storing priority for transitions\n\n        Array type for storing:\n        [0,1,2,3,4,5,6]\n        \"\"\"\n        left_child_idx = 2 * parent_idx + 1\n        right_child_idx = left_child_idx + 1\n\n        if left_child_idx >= len(self.tree):  # end search when no more child\n            return parent_idx\n\n        if self.tree[left_child_idx] == self.tree[right_child_idx]:\n            return self._retrieve(lower_bound, np.random.choice([left_child_idx, right_child_idx]))\n        if lower_bound <= self.tree[left_child_idx]:  # downward search, always search for a higher priority node\n            return self._retrieve(lower_bound, left_child_idx)\n        else:\n            return self._retrieve(lower_bound - self.tree[left_child_idx], right_child_idx)\n\n    @property\n    def root_priority(self):\n        return self.tree[0]  # the root\n\n\nclass Memory(object):  # stored as ( s, a, r, s_ ) in SumTree\n    \"\"\"\n    This SumTree code is modified version and the original code is from:\n    https://github.com/jaara/AI-blog/blob/master/Seaquest-DDQN-PER.py\n    \"\"\"\n    epsilon = 0.001  # small amount to avoid zero priority\n    alpha = 0.6  # [0~1] convert the importance of TD error to priority\n    beta = 0.4  # importance-sampling, from initial value increasing to 1\n    beta_increment_per_sampling = 1e-5  # annealing the bias\n    abs_err_upper = 1   # for stability refer to paper\n\n    def __init__(self, capacity):\n        self.tree = SumTree(capacity)\n\n    def store(self, error, transition):\n        p = self._get_priority(error)\n        self.tree.add_new_priority(p, transition)\n\n    def prio_sample(self, n):\n        batch_idx, batch_memory, ISWeights = [], [], []\n        segment = self.tree.root_priority / n\n        self.beta = np.min([1, self.beta + self.beta_increment_per_sampling])  # max = 1\n\n        min_prob = np.min(self.tree.tree[-self.tree.capacity:]) / self.tree.root_priority\n        maxiwi = np.power(self.tree.capacity * min_prob, -self.beta)  # for later normalizing ISWeights\n        for i in range(n):\n            a = segment * i\n            b = segment * (i + 1)\n            lower_bound = np.random.uniform(a, b)\n            while True:\n                idx, p, data = self.tree.get_leaf(lower_bound)\n                if type(data) is int:\n                    i -= 1\n                    lower_bound = np.random.uniform(segment * i, segment * (i+1))\n                else:\n                    break\n            prob = p / self.tree.root_priority\n            ISWeights.append(self.tree.capacity * prob)\n            batch_idx.append(idx)\n            batch_memory.append(data)\n\n        ISWeights = np.vstack(ISWeights)\n        ISWeights = np.power(ISWeights, -self.beta) / maxiwi  # normalize\n        return batch_idx, np.vstack(batch_memory), ISWeights\n\n    def random_sample(self, n):\n        idx = np.random.randint(0, self.tree.capacity, size=n, dtype=np.int)\n        return np.vstack(self.tree.data[idx])\n\n    def update(self, idx, error):\n        p = self._get_priority(error)\n        self.tree.update(idx, p)\n\n    def _get_priority(self, error):\n        error += self.epsilon   # avoid 0\n        clipped_error = np.clip(error, 0, self.abs_err_upper)\n        return np.power(clipped_error, self.alpha)\n\n\nsess = tf.Session()\n\n# Create actor and critic.\nactor = Actor(sess, ACTION_DIM, ACTION_BOUND, LR_A, REPLACE_ITER_A)\ncritic = Critic(sess, STATE_DIM, ACTION_DIM, LR_C, GAMMA, REPLACE_ITER_C, actor.a, actor.a_)\nactor.add_grad_to_graph(critic.a_grads)\n\nM = Memory(MEMORY_CAPACITY)\n\nsaver = tf.train.Saver(max_to_keep=100)\n\nif LOAD_MODEL:\n    all_ckpt = tf.train.get_checkpoint_state('./data', 'checkpoint').all_model_checkpoint_paths\n    saver.restore(sess, all_ckpt[-1])\nelse:\n    if os.path.isdir(DATA_PATH): shutil.rmtree(DATA_PATH)\n    os.mkdir(DATA_PATH)\n    sess.run(tf.global_variables_initializer())\n\nif OUTPUT_GRAPH:\n    tf.summary.FileWriter('logs', graph=sess.graph)\n\nvar = 3  # control exploration\nvar_min = 0.01\n\nfor i_episode in range(MAX_EPISODES):\n    # s = (hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangefinder measurements.)\n    s = env.reset()\n    ep_r = 0\n    while True:\n        if RENDER:\n            env.render()\n        a = actor.choose_action(s)\n        a = np.clip(np.random.normal(a, var), -1, 1)    # add randomness to action selection for exploration\n        s_, r, done, _ = env.step(a)    # r = total 300+ points up to the far end. If the robot falls, it gets -100.\n\n        if r == -100: r = -2\n        ep_r += r\n\n        transition = np.hstack((s, a, [r], s_))\n        max_p = np.max(M.tree.tree[-M.tree.capacity:])\n        M.store(max_p, transition)\n\n        if GLOBAL_STEP.eval(sess) > MEMORY_CAPACITY/20:\n            var = max([var*0.9999, var_min])  # decay the action randomness\n            tree_idx, b_M, ISWeights = M.prio_sample(BATCH_SIZE)    # for critic update\n            b_s = b_M[:, :STATE_DIM]\n            b_a = b_M[:, STATE_DIM: STATE_DIM + ACTION_DIM]\n            b_r = b_M[:, -STATE_DIM - 1: -STATE_DIM]\n            b_s_ = b_M[:, -STATE_DIM:]\n\n            abs_td = critic.learn(b_s, b_a, b_r, b_s_, ISWeights)\n            actor.learn(b_s)\n            for i in range(len(tree_idx)):  # update priority\n                idx = tree_idx[i]\n                M.update(idx, abs_td[i])\n        if GLOBAL_STEP.eval(sess) % SAVE_MODEL_ITER == 0:\n            ckpt_path = os.path.join(DATA_PATH, 'DDPG.ckpt')\n            save_path = saver.save(sess, ckpt_path, global_step=GLOBAL_STEP, write_meta_graph=False)\n            print(\"\\nSave Model %s\\n\" % save_path)\n\n        if done:\n            if \"running_r\" not in globals():\n                running_r = ep_r\n            else:\n                running_r = 0.95*running_r + 0.05*ep_r\n            if running_r > DISPLAY_THRESHOLD: RENDER = True\n            else: RENDER = False\n\n            done = '| Achieve ' if env.unwrapped.hull.position[0] >= END_POINT else '| -----'\n            print('Episode:', i_episode,\n                  done,\n                  '| Running_r: %i' % int(running_r),\n                  '| Epi_r: %.2f' % ep_r,\n                  '| Exploration: %.3f' % var,\n                  '| Pos: %.i' % int(env.unwrapped.hull.position[0]),\n                  '| LR_A: %.6f' % sess.run(LR_A),\n                  '| LR_C: %.6f' % sess.run(LR_C),\n                  )\n            break\n\n        s = s_\n        sess.run(INCREASE_GS)"
  },
  {
    "path": "experiments/Solve_LunarLander/A3C.py",
    "content": "\"\"\"\nAsynchronous Advantage Actor Critic (A3C) with continuous action space, Reinforcement Learning.\n\nThe Pendulum example. Convergence promised, but difficult environment, this code hardly converge.\n\nView more on [莫烦Python] : https://morvanzhou.github.io/tutorials/\n\nUsing:\ntensorflow 1.0\ngym 0.8.0\n\"\"\"\n\nimport multiprocessing\nimport threading\nimport tensorflow as tf\nimport numpy as np\nimport gym\nimport os\nimport shutil\nimport matplotlib.pyplot as plt\n\n\nGAME = 'LunarLander-v2'\nOUTPUT_GRAPH = False\nLOG_DIR = './log'\nN_WORKERS = multiprocessing.cpu_count()\nMAX_GLOBAL_EP = 5000\nGLOBAL_NET_SCOPE = 'Global_Net'\nUPDATE_GLOBAL_ITER = 5\nGAMMA = 0.99\nENTROPY_BETA = 0.001   # not useful in this case\nLR_A = 0.0005    # learning rate for actor\nLR_C = 0.001    # learning rate for critic\nGLOBAL_RUNNING_R = []\nGLOBAL_EP = 0\n\nenv = gym.make(GAME)\n\nN_S = env.observation_space.shape[0]\nN_A = env.action_space.n\ndel env\n\n\nclass ACNet(object):\n    def __init__(self, scope, globalAC=None):\n        if scope == GLOBAL_NET_SCOPE:   # get global network\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self._build_net(N_A)\n                self.a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\n                self.c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\n        else:   # local net, calculate losses\n            with tf.variable_scope(scope):\n                self.s = tf.placeholder(tf.float32, [None, N_S], 'S')\n                self.a_his = tf.placeholder(tf.int32, [None, ], 'A')\n                self.v_target = tf.placeholder(tf.float32, [None, 1], 'Vtarget')\n\n                self.a_prob, self.v = self._build_net(N_A)\n\n                td = tf.subtract(self.v_target, self.v, name='TD_error')\n                with tf.name_scope('c_loss'):\n                    self.c_loss = tf.reduce_mean(tf.square(td))\n\n                with tf.name_scope('a_loss'):\n                    log_prob = tf.reduce_sum(tf.log(self.a_prob) * tf.one_hot(self.a_his, N_A, dtype=tf.float32), axis=1, keep_dims=True)\n                    exp_v = log_prob * td\n                    entropy = -tf.reduce_sum(self.a_prob * tf.log(self.a_prob), axis=1, keep_dims=True)  # encourage exploration\n                    self.exp_v = ENTROPY_BETA * entropy + exp_v\n                    self.a_loss = tf.reduce_mean(-self.exp_v)\n\n                with tf.name_scope('local_grad'):\n                    self.a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\n                    self.c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\n                    self.a_grads = tf.gradients(self.a_loss, self.a_params)\n                    self.c_grads = tf.gradients(self.c_loss, self.c_params)\n\n            with tf.name_scope('sync'):\n                with tf.name_scope('pull'):\n                    self.pull_a_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.a_params, globalAC.a_params)]\n                    self.pull_c_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.c_params, globalAC.c_params)]\n                with tf.name_scope('push'):\n                    self.update_a_op = OPT_A.apply_gradients(zip(self.a_grads, globalAC.a_params))\n                    self.update_c_op = OPT_C.apply_gradients(zip(self.c_grads, globalAC.c_params))\n\n    def _build_net(self, n_a):\n        w_init = tf.random_normal_initializer(0., .01)\n        with tf.variable_scope('critic'):\n            cell_size = 64\n            s = tf.expand_dims(self.s, axis=1,\n                               name='timely_input')  # [time_step, feature] => [time_step, batch, feature]\n            rnn_cell = tf.contrib.rnn.BasicRNNCell(cell_size)\n            self.init_state = rnn_cell.zero_state(batch_size=1, dtype=tf.float32)\n            outputs, self.final_state = tf.nn.dynamic_rnn(\n                cell=rnn_cell, inputs=s, initial_state=self.init_state, time_major=True)\n            cell_out = tf.reshape(outputs, [-1, cell_size], name='flatten_rnn_outputs')  # joined state representation\n            l_c = tf.layers.dense(cell_out, 200, tf.nn.relu6, kernel_initializer=w_init, name='lc')\n            v = tf.layers.dense(l_c, 1, kernel_initializer=w_init, name='v')  # state value\n        with tf.variable_scope('actor'):\n            cell_out = tf.stop_gradient(cell_out, name='c_cell_out')\n            l_a = tf.layers.dense(cell_out, 300, tf.nn.relu6, kernel_initializer=w_init, name='la')\n            a_prob = tf.layers.dense(l_a, n_a, tf.nn.softmax, kernel_initializer=w_init, name='ap')\n\n        return a_prob, v\n\n    def update_global(self, feed_dict):  # run by a local\n        SESS.run([self.update_a_op, self.update_c_op], feed_dict)  # local grads applies to global net\n\n    def pull_global(self):  # run by a local\n        SESS.run([self.pull_a_params_op, self.pull_c_params_op])\n\n    def choose_action(self, s, cell_state):  # run by a local\n        prob_weights, cell_state = SESS.run([self.a_prob, self.final_state], feed_dict={self.s: s[np.newaxis, :],\n                                                                            self.init_state: cell_state})\n        action = np.random.choice(range(prob_weights.shape[1]),\n                                  p=prob_weights.ravel())  # select action w.r.t the actions prob\n        return action, cell_state\n\n\nclass Worker(object):\n    def __init__(self, name, globalAC):\n        self.env = gym.make(GAME)\n        self.name = name\n        self.AC = ACNet(name, globalAC)\n\n    def work(self):\n        global GLOBAL_RUNNING_R, GLOBAL_EP\n        total_step = 1\n        r_scale = 100\n        buffer_s, buffer_a, buffer_r = [], [], []\n        while not COORD.should_stop() and GLOBAL_EP < MAX_GLOBAL_EP:\n            s = self.env.reset()\n            ep_r = 0\n            ep_t = 0\n            rnn_state = SESS.run(self.AC.init_state)  # zero rnn state at beginning\n            keep_state = rnn_state.copy()  # keep rnn state for updating global net\n            while True:\n                # if self.name == 'W_0' and total_step % 10 == 0:\n                #     self.env.render()\n                a, rnn_state_ = self.AC.choose_action(s, rnn_state)  # get the action and next rnn state\n                s_, r, done, info = self.env.step(a)\n                if r == -100: r = -10\n                ep_r += r\n                buffer_s.append(s)\n                buffer_a.append(a)\n                buffer_r.append(r/r_scale)\n\n                if total_step % UPDATE_GLOBAL_ITER == 0 or done:   # update global and assign to local net\n                    if done:\n                        v_s_ = 0   # terminal\n                    else:\n                        v_s_ = SESS.run(self.AC.v, {self.AC.s: s_[np.newaxis, :], self.AC.init_state: rnn_state_})[0,0]\n                    buffer_v_target = []\n                    for r in buffer_r[::-1]:    # reverse buffer r\n                        v_s_ = r + GAMMA * v_s_\n                        buffer_v_target.append(v_s_)\n                    buffer_v_target.reverse()\n\n                    buffer_s, buffer_a, buffer_v_target = np.vstack(buffer_s), np.array(buffer_a), np.vstack(buffer_v_target)\n                    feed_dict = {\n                        self.AC.s: buffer_s,\n                        self.AC.a_his: buffer_a,\n                        self.AC.v_target: buffer_v_target,\n                        self.AC.init_state: keep_state,\n                    }\n\n                    self.AC.update_global(feed_dict)\n\n                    buffer_s, buffer_a, buffer_r = [], [], []\n                    self.AC.pull_global()\n                    keep_state = rnn_state_.copy()  # replace the keep_state as the new initial rnn state_\n\n                s = s_\n                total_step += 1\n                rnn_state = rnn_state_  # renew rnn state\n                ep_t += 1\n                if done:\n                    if len(GLOBAL_RUNNING_R) == 0:  # record running episode reward\n                        GLOBAL_RUNNING_R.append(ep_r)\n                    else:\n                        GLOBAL_RUNNING_R.append(0.99 * GLOBAL_RUNNING_R[-1] + 0.01 * ep_r)\n                    if not self.env.unwrapped.lander.awake: solve = '| Landed'\n                    else: solve = '| ------'\n                    print(\n                        self.name,\n                        \"Ep:\", GLOBAL_EP,\n                        solve,\n                        \"| Ep_r: %i\" % GLOBAL_RUNNING_R[-1],\n                          )\n                    GLOBAL_EP += 1\n                    break\n\nif __name__ == \"__main__\":\n    SESS = tf.Session()\n\n    with tf.device(\"/cpu:0\"):\n        OPT_A = tf.train.RMSPropOptimizer(LR_A, name='RMSPropA')\n        OPT_C = tf.train.RMSPropOptimizer(LR_C, name='RMSPropC')\n        GLOBAL_AC = ACNet(GLOBAL_NET_SCOPE)  # we only need its params\n        workers = []\n        # Create worker\n        for i in range(N_WORKERS):\n            i_name = 'W_%i' % i   # worker name\n            workers.append(Worker(i_name, GLOBAL_AC))\n\n    COORD = tf.train.Coordinator()\n    SESS.run(tf.global_variables_initializer())\n\n    if OUTPUT_GRAPH:\n        if os.path.exists(LOG_DIR):\n            shutil.rmtree(LOG_DIR)\n        tf.summary.FileWriter(LOG_DIR, SESS.graph)\n\n    worker_threads = []\n    for worker in workers:\n        job = lambda: worker.work()\n        t = threading.Thread(target=job)\n        t.start()\n        worker_threads.append(t)\n    COORD.join(worker_threads)\n\n    plt.plot(np.arange(len(GLOBAL_RUNNING_R)), GLOBAL_RUNNING_R)\n    plt.xlabel('step')\n    plt.ylabel('Total moving reward')\n    plt.show()\n"
  },
  {
    "path": "experiments/Solve_LunarLander/DuelingDQNPrioritizedReplay.py",
    "content": "\"\"\"\nThe DQN improvement: Prioritized Experience Replay (based on https://arxiv.org/abs/1511.05952)\n\nView more on 莫烦Python: https://morvanzhou.github.io/tutorials/\n\nUsing:\nTensorflow: 1.0\n\"\"\"\n\nimport numpy as np\nimport tensorflow as tf\n\nnp.random.seed(1)\ntf.set_random_seed(1)\n\n\nclass SumTree(object):\n    \"\"\"\n    This SumTree code is modified version and the original code is from:\n    https://github.com/jaara/AI-blog/blob/master/SumTree.py\n\n    Story the data with it priority in tree and data frameworks.\n    \"\"\"\n    data_pointer = 0\n\n    def __init__(self, capacity):\n        self.capacity = capacity  # for all priority values\n        self.tree = np.zeros(2 * capacity - 1)\n        # [--------------Parent nodes-------------][-------leaves to recode priority-------]\n        #             size: capacity - 1                       size: capacity\n        self.data = np.zeros(capacity, dtype=object)  # for all transitions\n        # [--------------data frame-------------]\n        #             size: capacity\n\n    def add_new_priority(self, p, data):\n        leaf_idx = self.data_pointer + self.capacity - 1\n\n        self.data[self.data_pointer] = data  # update data_frame\n        self.update(leaf_idx, p)  # update tree_frame\n        self.data_pointer += 1\n        if self.data_pointer >= self.capacity:  # replace when exceed the capacity\n            self.data_pointer = 0\n\n    def update(self, tree_idx, p):\n        change = p - self.tree[tree_idx]\n\n        self.tree[tree_idx] = p\n        self._propagate_change(tree_idx, change)\n\n    def _propagate_change(self, tree_idx, change):\n        \"\"\"change the sum of priority value in all parent nodes\"\"\"\n        parent_idx = (tree_idx - 1) // 2\n        self.tree[parent_idx] += change\n        if parent_idx != 0:\n            self._propagate_change(parent_idx, change)\n\n    def get_leaf(self, lower_bound):\n        leaf_idx = self._retrieve(lower_bound)  # search the max leaf priority based on the lower_bound\n        data_idx = leaf_idx - self.capacity + 1\n        return [leaf_idx, self.tree[leaf_idx], self.data[data_idx]]\n\n    def _retrieve(self, lower_bound, parent_idx=0):\n        \"\"\"\n        Tree structure and array storage:\n\n        Tree index:\n             0         -> storing priority sum\n            / \\\n          1     2\n         / \\   / \\\n        3   4 5   6    -> storing priority for transitions\n\n        Array type for storing:\n        [0,1,2,3,4,5,6]\n        \"\"\"\n        left_child_idx = 2 * parent_idx + 1\n        right_child_idx = left_child_idx + 1\n\n        if left_child_idx >= len(self.tree):  # end search when no more child\n            return parent_idx\n\n        if self.tree[left_child_idx] == self.tree[right_child_idx]:\n            return self._retrieve(lower_bound, np.random.choice([left_child_idx, right_child_idx]))\n        if lower_bound <= self.tree[left_child_idx]:  # downward search, always search for a higher priority node\n            return self._retrieve(lower_bound, left_child_idx)\n        else:\n            return self._retrieve(lower_bound - self.tree[left_child_idx], right_child_idx)\n\n    @property\n    def root_priority(self):\n        return self.tree[0]  # the root\n\n\nclass Memory(object):  # stored as ( s, a, r, s_ ) in SumTree\n    \"\"\"\n    This SumTree code is modified version and the original code is from:\n    https://github.com/jaara/AI-blog/blob/master/Seaquest-DDQN-PER.py\n    \"\"\"\n    epsilon = 0.001  # small amount to avoid zero priority\n    alpha = 0.6  # [0~1] convert the importance of TD error to priority\n    beta = 0.4  # importance-sampling, from initial value increasing to 1\n    beta_increment_per_sampling = 1e-4  # annealing the bias\n    abs_err_upper = 1   # for stability refer to paper\n\n    def __init__(self, capacity):\n        self.tree = SumTree(capacity)\n\n    def store(self, error, transition):\n        p = self._get_priority(error)\n        self.tree.add_new_priority(p, transition)\n\n    def sample(self, n):\n        batch_idx, batch_memory, ISWeights = [], [], []\n        segment = self.tree.root_priority / n\n        self.beta = np.min([1, self.beta + self.beta_increment_per_sampling])  # max = 1\n\n        min_prob = np.min(self.tree.tree[-self.tree.capacity:]) / self.tree.root_priority\n        maxiwi = np.power(self.tree.capacity * min_prob, -self.beta)  # for later normalizing ISWeights\n        for i in range(n):\n            a = segment * i\n            b = segment * (i + 1)\n            lower_bound = np.random.uniform(a, b)\n            idx, p, data = self.tree.get_leaf(lower_bound)\n            prob = p / self.tree.root_priority\n            ISWeights.append(self.tree.capacity * prob)\n            batch_idx.append(idx)\n            batch_memory.append(data)\n\n        ISWeights = np.vstack(ISWeights)\n        ISWeights = np.power(ISWeights, -self.beta) / maxiwi  # normalize\n        return batch_idx, np.vstack(batch_memory), ISWeights\n\n    def update(self, idx, error):\n        p = self._get_priority(error)\n        self.tree.update(idx, p)\n\n    def _get_priority(self, error):\n        error += self.epsilon   # avoid 0\n        clipped_error = np.clip(error, 0, self.abs_err_upper)\n        return np.power(clipped_error, self.alpha)\n\n\nclass DuelingDQNPrioritizedReplay:\n    def __init__(\n            self,\n            n_actions,\n            n_features,\n            learning_rate=0.005,\n            reward_decay=0.9,\n            e_greedy=0.9,\n            replace_target_iter=500,\n            memory_size=10000,\n            batch_size=32,\n            e_greedy_increment=None,\n            hidden=[100, 50],\n            output_graph=False,\n            sess=None,\n    ):\n        self.n_actions = n_actions\n        self.n_features = n_features\n        self.lr = learning_rate\n        self.gamma = reward_decay\n        self.epsilon_max = e_greedy\n        self.replace_target_iter = replace_target_iter\n        self.memory_size = memory_size\n        self.batch_size = batch_size\n        self.hidden = hidden\n        self.epsilon_increment = e_greedy_increment\n        self.epsilon = 0.5 if e_greedy_increment is not None else self.epsilon_max\n\n        self.learn_step_counter = 0\n        self._build_net()\n        self.memory = Memory(capacity=memory_size)\n\n        if sess is None:\n            self.sess = tf.Session()\n            self.sess.run(tf.global_variables_initializer())\n        else:\n            self.sess = sess\n\n        if output_graph:\n            tf.summary.FileWriter(\"logs/\", self.sess.graph)\n\n        self.cost_his = []\n\n    def _build_net(self):\n        def build_layers(s, c_names, w_initializer, b_initializer):\n            for i, h in enumerate(self.hidden):\n                if i == 0:\n                    in_units, out_units, inputs = self.n_features, self.hidden[i], s\n                else:\n                    in_units, out_units, inputs = self.hidden[i-1], self.hidden[i], l\n                with tf.variable_scope('l%i' % i):\n                    w = tf.get_variable('w', [in_units, out_units], initializer=w_initializer, collections=c_names)\n                    b = tf.get_variable('b', [1, out_units], initializer=b_initializer, collections=c_names)\n                    l = tf.nn.relu(tf.matmul(inputs, w) + b)\n\n            with tf.variable_scope('Value'):\n                w = tf.get_variable('w', [self.hidden[-1], 1], initializer=w_initializer, collections=c_names)\n                b = tf.get_variable('b', [1, 1], initializer=b_initializer, collections=c_names)\n                self.V = tf.matmul(l, w) + b\n\n            with tf.variable_scope('Advantage'):\n                w = tf.get_variable('w', [self.hidden[-1], self.n_actions], initializer=w_initializer, collections=c_names)\n                b = tf.get_variable('b', [1, self.n_actions], initializer=b_initializer, collections=c_names)\n                self.A = tf.matmul(l, w) + b\n\n            with tf.variable_scope('Q'):\n                out = self.V + (self.A - tf.reduce_mean(self.A, axis=1, keep_dims=True))  # Q = V(s) + A(s,a)\n\n            # with tf.variable_scope('out'):\n            #     w = tf.get_variable('w', [self.hidden[-1], self.n_actions], initializer=w_initializer, collections=c_names)\n            #     b = tf.get_variable('b', [1, self.n_actions], initializer=b_initializer, collections=c_names)\n            #     out = tf.matmul(l, w) + b\n            return out\n\n        # ------------------ build evaluate_net ------------------\n        self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')  # input\n        self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')  # for calculating loss\n        self.ISWeights = tf.placeholder(tf.float32, [None, 1], name='IS_weights')\n        with tf.variable_scope('eval_net'):\n            c_names, w_initializer, b_initializer = \\\n                ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], \\\n                tf.random_normal_initializer(0., 0.01), tf.constant_initializer(0.01)  # config of layers\n\n            self.q_eval = build_layers(self.s, c_names, w_initializer, b_initializer)\n\n        with tf.variable_scope('loss'):\n            self.abs_errors = tf.abs(tf.reduce_sum(self.q_target - self.q_eval, axis=1))  # for updating Sumtree\n            self.loss = tf.reduce_mean(self.ISWeights * tf.squared_difference(self.q_target, self.q_eval))\n\n        with tf.variable_scope('train'):\n            self._train_op = tf.train.AdamOptimizer(self.lr).minimize(self.loss)\n\n        # ------------------ build target_net ------------------\n        self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_')  # input\n        with tf.variable_scope('target_net'):\n            c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]\n            self.q_next = build_layers(self.s_, c_names, w_initializer, b_initializer)\n\n    def store_transition(self, s, a, r, s_):\n        transition = np.hstack((s, [a, r], s_))\n        max_p = np.max(self.memory.tree.tree[-self.memory.tree.capacity:])\n        self.memory.store(max_p, transition)\n\n    def choose_action(self, observation):\n        observation = observation[np.newaxis, :]\n        if np.random.uniform() < self.epsilon:\n            actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})\n            action = np.argmax(actions_value)\n        else:\n            action = np.random.randint(0, self.n_actions)\n        return action\n\n    def _replace_target_params(self):\n        t_params = tf.get_collection('target_net_params')\n        e_params = tf.get_collection('eval_net_params')\n        self.sess.run([tf.assign(t, e) for t, e in zip(t_params, e_params)])\n\n    def learn(self):\n        if self.learn_step_counter % self.replace_target_iter == 0:\n            self._replace_target_params()\n\n        tree_idx, batch_memory, ISWeights = self.memory.sample(self.batch_size)\n\n        # double DQN\n        q_next, q_eval4next = self.sess.run(\n            [self.q_next, self.q_eval],\n            feed_dict={self.s_: batch_memory[:, -self.n_features:],  # next observation\n                       self.s: batch_memory[:, -self.n_features:]})  # next observation\n        q_eval = self.sess.run(self.q_eval, {self.s: batch_memory[:, :self.n_features]})\n\n        q_target = q_eval.copy()\n\n        batch_index = np.arange(self.batch_size, dtype=np.int32)\n        eval_act_index = batch_memory[:, self.n_features].astype(int)\n        reward = batch_memory[:, self.n_features + 1]\n        max_act4next = np.argmax(q_eval4next,\n                                 axis=1)  # the action that brings the highest value is evaluated by q_eval\n        selected_q_next = q_next[batch_index, max_act4next]  # Double DQN, select q_next depending on above actions\n\n        q_target[batch_index, eval_act_index] = reward + self.gamma * selected_q_next\n\n        # q_next, q_eval = self.sess.run(\n        #     [self.q_next, self.q_eval],\n        #     feed_dict={self.s_: batch_memory[:, -self.n_features:],\n        #                self.s: batch_memory[:, :self.n_features]})\n        #\n        # q_target = q_eval.copy()\n        # batch_index = np.arange(self.batch_size, dtype=np.int32)\n        # eval_act_index = batch_memory[:, self.n_features].astype(int)\n        # reward = batch_memory[:, self.n_features + 1]\n        #\n        # q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)\n\n        _, abs_errors, self.cost = self.sess.run([self._train_op, self.abs_errors, self.loss],\n                                                 feed_dict={self.s: batch_memory[:, :self.n_features],\n                                                            self.q_target: q_target,\n                                                            self.ISWeights: ISWeights})\n        for i in range(len(tree_idx)):  # update priority\n            idx = tree_idx[i]\n            self.memory.update(idx, abs_errors[i])\n\n        self.cost_his.append(self.cost)\n\n        self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max\n        self.learn_step_counter += 1\n"
  },
  {
    "path": "experiments/Solve_LunarLander/run_LunarLander.py",
    "content": "\"\"\"\nDeep Q network,\n\nLunarLander-v2 example\n\nUsing:\nTensorflow: 1.0\ngym: 0.8.0\n\"\"\"\n\n\nimport gym\nfrom gym import wrappers\nfrom DuelingDQNPrioritizedReplay import DuelingDQNPrioritizedReplay\n\nenv = gym.make('LunarLander-v2')\n# env = env.unwrapped\nenv.seed(1)\n\nN_A = env.action_space.n\nN_S = env.observation_space.shape[0]\nMEMORY_CAPACITY = 50000\nTARGET_REP_ITER = 2000\nMAX_EPISODES = 900\nE_GREEDY = 0.95\nE_INCREMENT = 0.00001\nGAMMA = 0.99\nLR = 0.0001\nBATCH_SIZE = 32\nHIDDEN = [400, 400]\nRENDER = True\n\nRL = DuelingDQNPrioritizedReplay(\n    n_actions=N_A, n_features=N_S, learning_rate=LR, e_greedy=E_GREEDY, reward_decay=GAMMA,\n    hidden=HIDDEN, batch_size=BATCH_SIZE, replace_target_iter=TARGET_REP_ITER,\n    memory_size=MEMORY_CAPACITY, e_greedy_increment=E_INCREMENT,)\n\n\ntotal_steps = 0\nrunning_r = 0\nr_scale = 100\nfor i_episode in range(MAX_EPISODES):\n    s = env.reset()  # (coord_x, coord_y, vel_x, vel_y, angle, angular_vel, l_leg_on_ground, r_leg_on_ground)\n    ep_r = 0\n    while True:\n        if total_steps > MEMORY_CAPACITY: env.render()\n        a = RL.choose_action(s)\n        s_, r, done, _ = env.step(a)\n        if r == -100: r = -30\n        r /= r_scale\n\n        ep_r += r\n        RL.store_transition(s, a, r, s_)\n        if total_steps > MEMORY_CAPACITY:\n            RL.learn()\n        if done:\n            land = '| Landed' if r == 100/r_scale else '| ------'\n            running_r = 0.99 * running_r + 0.01 * ep_r\n            print('Epi: ', i_episode,\n                  land,\n                  '| Epi_R: ', round(ep_r, 2),\n                  '| Running_R: ', round(running_r, 2),\n                  '| Epsilon: ', round(RL.epsilon, 3))\n            break\n\n        s = s_\n        total_steps += 1\n\n"
  }
]