[
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2017 Liang Zeng\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "## My Solution to Assignments of CS234\nThis is my solution to three assignments of CS234.<br>\n[CS234: Deep Reinforcement Learning](http://cs234.stanford.edu/) is\nan interesting class, which teaches you what is the reinforcement learning:\n Learn to make good sequences of decisions. This class provides some basic knowledge and insights of cutting-edge research in reinforcement learning. More details are as follows:\n* Define the key features of RL vs AI & other ML \n* Define MDP, POMDP, bandit, batch offline RL, online RL\n* Describe the exploration vs exploitation challenge and compare and contrast 2 or more approaches\n* Given an application problem (e.g. from computer vision, robotics, etc) decide if it should be formulated as a RL problem, if yes how to formulate, what algorithm (from class) is best suited to address, and justify an answer\n* Implement several RL algorithms incl. a deep RL approach\n* Describe multiple criteria for analyzing RL algorithms and evaluate algorithms on these metrics: e.g. regret, sample complexity, computational complexity, convergence, etc. \n* List at least two open challenges or hot topics in RL\n******\n**Note:** If you consult my source codes that you may want to incorporate into your algorithm or system, you should clearly cite references in your codes.\n******\n\n## Table of Contents\n* [Assignment 1](https://github.com/zlpure/CS234/tree/master/assignment1)\n  * Bellman Operator Properties\n  * Value Iteration\n  * Grid Policies\n  * Frozen Lake MDP\n  * Frozen Lake Reinforcement Learning\n* [Assignment 2](https://github.com/zlpure/CS234/tree/master/assignment2)\n  * Q-learning\n  * Linear Approximation\n  * Deepmind's DQN\n  * (Bonus) Double DQN\n  * (Bonus) Dueling DQN\n* [Assignment 3](https://github.com/zlpure/CS234/tree/master/assignment3)\n  * R-max algorithm\n  * epsilon-greedy q-learning\n  * Expected Regret Bounds\n\n## Dependencies\n* Anaconda\n* tensorflow>=0.12\n* matplotlib\n* scipy\n* numpy\n* sklearn\n* six\n\n## Author\n[@zlpure](github.com/zlpure)\n"
  },
  {
    "path": "assignment1/Makefile",
    "content": "submit:\n\tsh collect_submission.sh\n\nclean:\n\trm -f assignment1.zip\n\trm -f *.pyc *.png *.npy utils/*.pyc\n\n"
  },
  {
    "path": "assignment1/collect_submission.sh",
    "content": "rm -f assignment1.zip\nzip -r assignment1.zip *.py *.ipynb\n"
  },
  {
    "path": "assignment1/lake_envs.py",
    "content": "# coding: utf-8\n\"\"\"Defines some frozen lake maps.\"\"\"\nfrom gym.envs.toy_text import frozen_lake, discrete\nfrom gym.envs.registration import register\n\n\nregister(\n    id='Deterministic-4x4-FrozenLake-v0',\n    entry_point='gym.envs.toy_text.frozen_lake:FrozenLakeEnv',\n    kwargs={'map_name': '4x4',\n            'is_slippery': False})\n\nregister(\n    id='Deterministic-8x8-FrozenLake-v0',\n    entry_point='gym.envs.toy_text.frozen_lake:FrozenLakeEnv',\n    kwargs={'map_name': '8x8',\n            'is_slippery': False})\n\nregister(\n    id='Stochastic-4x4-FrozenLake-v0',\n    entry_point='gym.envs.toy_text.frozen_lake:FrozenLakeEnv',\n    kwargs={'map_name': '4x4',\n            'is_slippery': True})\n"
  },
  {
    "path": "assignment1/log",
    "content": "\n    Winter is here. You and your friends were tossing around a frisbee at the park\n    when you made a wild throw that left the frisbee out in the middle of the lake.\n    The water is mostly frozen, but there are a few holes where the ice has melted.\n    If you step into one of those holes, you'll fall into the freezing water.\n    At this time, there's an international frisbee shortage, so it's absolutely imperative that\n    you navigate across the lake and retrieve the disc.\n    However, the ice is slippery, so you won't always move in the direction you intend.\n    The surface is described using a grid like the following\n\n        SFFF\n        FHFH\n        FFFH\n        HFFG\n\n    S : starting point, safe\n    F : frozen surface, safe\n    H : hole, fall to your doom\n    G : goal, where the frisbee is located\n\n    The episode ends when you reach the goal or fall in a hole.\n    You receive a reward of 1 if you reach the goal, and zero otherwise.\n\n    \n[(0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 14, 0.0, False), (0.3333333333333333, 9, 0.0, False)]\n[(0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 14, 0.0, False), (0.3333333333333333, 15, 1.0, True)]\n1.0 1\n0.3 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.09 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.027 0\n0.0081 0\n0.0081 0\n0.09 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n0.000729 0\n0.00243 0\n0.00243 0\n0.0081 1\n0.0081 0\n0.027 0\n0.00243 0\n0.0081 0\n0.000729 0\n0.00243 0\n0.0081 0\n0.0081 0\n0.09 0\n0.00243 0\n0.0081 0\n0.027 1\n0.0081 0\n0.09 0\n0.027 0\n0.3 0\n0.027 0\n0.09 0\n0.3 1\n0.09 0\n0.3 0\n1.0 1\n[ 0.002  0.008  0.027  0.008  0.008  0.     0.09   0.     0.027  0.09   0.3\n  0.     0.     0.3    1.     0.   ]\n[0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0]\n"
  },
  {
    "path": "assignment1/model_based_learning.py",
    "content": "### Episodic Model Based Learning using Maximum Likelihood Estimate of the Environment\n\n# Do not change the arguments and output types of any of the functions provided! You may debug in Main and elsewhere.\n\nimport numpy as np\nimport gym\nimport time\nfrom lake_envs import *\nimport matplotlib.pyplot as plt\nfrom tqdm import *\n\nfrom vi_and_pi import value_iteration\nfrom vi_and_pi import policy_iteration\n\ndef initialize_P(nS, nA):\n    \"\"\"Initializes a uniformly random model of the environment with 0 rewards.\n\n    Parameters\n    ----------\n    nS: int\n        Number of states\n    nA: int\n        Number of actions\n\n    Returns\n    -------\n    P: np.array of shape [nS x nA x nS x 4] where items are tuples representing transition information\n        P[state][action] is a list of (prob, next_state, reward, done) tuples.\n    \"\"\"\n    P = [[[(1.0/nS, i, 0, False) for i in range(nS)] for _ in range(nA)] for _ in range(nS)]\n\n    return P\n\ndef initialize_counts(nS, nA):\n    \"\"\"Initializes a counts array.\n\n    Parameters\n    ----------\n    nS: int\n        Number of states\n    nA: int\n        Number of actions\n\n    Returns\n    -------\n    counts: np.array of shape [nS x nA x nS]\n        counts[state][action][next_state] is the number of times that doing \"action\" at state \"state\" transitioned to \"next_state\"\n    \"\"\"\n    counts = [[[0 for _ in range(nS)] for _ in range(nA)] for _ in range(nS)]\n\n    return counts\n\ndef initialize_rewards(nS, nA):\n    \"\"\"Initializes a rewards array. Values represent running averages.\n\n    Parameters\n    ----------\n    nS: int\n        Number of states\n    nA: int\n        Number of actions\n\n    Returns\n    -------\n    rewards: array of shape [nS x nA x nS]\n        counts[state][action][next_state] is the running average of rewards of doing \"action\" at \"state\" transtioned to \"next_state\"\n    \"\"\"\n    rewards = [[[0 for _ in range (nS)] for _ in range(nA)] for _ in range(nS)]\n\n    return rewards\n\ndef counts_and_rewards_to_P(counts, rewards, terminal_state):\n    \"\"\"Converts counts and rewards arrays to a P array consistent with the Gym environment data structure for a model of the environment.\n    Use this function to convert your counts and rewards arrays to a P that you can use in value iteration.\n\n    Parameters\n    ----------\n    counts: array of shape [nS x nA x nS]\n        counts[state][action][next_state] is the number of times that doing \"action\" at state \"state\" transitioned to \"next_state\"\n    rewards: array of shape [nS x nA x nS]\n        counts[state][action][next_state] is the running average of rewards of doing \"action\" at \"state\" transtioned to \"next_state\"\n\n    Returns\n    -------\n    P: np.array of shape [nS x nA x nS' x 4] where items are tuples representing transition information\n        P[state][action] is a list of (prob, next_state, reward, done) tuples.\n    \"\"\"\n    nS = len(counts)\n    nA = len(counts[0])\n    P = [[[] for _ in range(nA)] for _ in range(nS)]\n   \n    for state in range(nS):\n        for action in range(nA):\n            if sum(counts[state][action]) != 0:\n                for next_state in range(nS):\n                    if counts[state][action][next_state] != 0:\n                        prob = float(counts[state][action][next_state]) / float(sum(counts[state][action]))\n                        reward = rewards[state][action][next_state]\n                        if next_state in terminal_state:\n                            P[state][action].append((prob, next_state, reward, True))\n                        else:\n                            P[state][action].append((prob, next_state, reward, False))\n            else:\n                prob = 1.0 / float(nS)\n                for next_state in range(nS):\n                    P[state][action].append((prob, next_state, 0, False))\n    \n    #for action in range(nA):\n    #P[nS-2][2][nS-1] = (1.0, nS-1, 1, True)\n\n    return P\n\ndef update_mdp_model_with_history(counts, rewards, history):\n    \"\"\"Given a history of an entire episode, update the count and rewards arrays\n\n    Parameters\n    ----------\n    counts: array of shape [nS x nA x nS]\n        counts[state][action][next_state] is the number of times that doing \"action\" at state \"state\" transitioned to \"next_state\"\n    rewards: array of shape [nS x nA x nS]\n        counts[state][action][next_state] is the running average of rewards of doing \"action\" at \"state\" transtioned to \"next_state\"\n    history: \n        a list of [state, action, reward, next_state, done]\n    \"\"\"\n\n    # HINT: For terminal states, we define that the probability of any action returning the state to itself is 1 (with zero reward)\n    # Make sure you record this information in your counts array by updating the counts for this accordingly for your\n    # value iteration to work.\n\n    ############################\n    # YOUR IMPLEMENTATION HERE #\n    for item in history:\n        #print item\n        (state, action, reward, next_state, done) = item\n        #if not done:\n        #    counts[state][action][next_state] += 1\n        #    rewards[state][action][next_state] = float(rewards[state][action][next_state]+reward) / counts[state][action][next_state]\n        #else:\n        #    counts[state][action][next_state] = 1\n        #    rewards[state][action][next_state] = float(rewards[state][action][next_state]+reward) / counts[state][action][next_state]\n        counts[state][action][next_state] += 1\n        all_reward = float(rewards[state][action][next_state]*(counts[state][action][next_state]-1)+reward)\n        rewards[state][action][next_state] = all_reward / counts[state][action][next_state]\n    ############################\n    return counts, rewards\n\ndef learn_with_mdp_model(env, method=None, num_episodes=5000, gamma = 0.95, e = 0.8, decay_rate = 0.99):\n    \"\"\"Build a model of the environment and use value iteration to learn a policy. In the next episode, play with the new \n    policy using epsilon-greedy exploration. \n\n    Your model of the environment should be based on updating counts and rewards arrays. The counts array counts the number\n    of times that \"state\" with \"action\" led to \"next_state\", and the rewards array is the running average of rewards for \n    going from at \"state\" with \"action\" leading to \"next_state\". \n\n    For a single episode, create a list called \"history\" with all the experience\n    from that episode, then update the \"counts\" and \"rewards\" arrays using the function \"update_mdp_model_with_history\". \n\n    You may then call the prewritten function \"counts_and_rewards_to_P\" to convert your counts and rewards arrays to \n    an environment data structure P consistent with the Gym environment's one. You may then call on value_iteration(P, nS, nA) \n    to get a policy.\n\n    Parameters\n    ----------\n    env: gym.core.Environment\n        Environment to compute Q function for. Must have nS, nA, and P as\n        attributes.\n    num_episodes: int \n        Number of episodes of training.\n    gamma: float\n        Discount factor. Number in range [0, 1)\n    learning_rate: float\n        Learning rate. Number in range [0, 1)\n    e: float\n        Epsilon value used in the epsilon-greedy method. \n    decay_rate: float\n        Rate at which epsilon falls. Number in range [0, 1)\n\n    Returns\n    -------\n    policy: np.array\n        An array of shape [env.nS] representing the action to take at a given state.\n    \"\"\"\n\n    P = initialize_P(env.nS, env.nA)\n    counts = initialize_counts(env.nS, env.nA)\n    rewards = initialize_rewards(env.nS, env.nA)\n\n    ############################\n    # YOUR IMPLEMENTATION HERE #\n    new_policy = np.zeros((env.nS)).astype(int)\n    terminal_state = []\n    for i in range(num_episodes):\n        done = False\n        state = env.reset()\n        his = []\n        while not done:\n            if np.random.rand() > e:\n                action = new_policy[state]\n            else:\n                action = np.random.randint(env.nA)\n            nextstate, reward, done, _ = env.step(action)\n            his.append([state, action, reward, nextstate, done])\n            state = nextstate\n        if state not in terminal_state:\n            terminal_state.append(state)\n        counts, rewards = update_mdp_model_with_history(counts, rewards, his)\n        P = counts_and_rewards_to_P(counts, rewards, terminal_state)\n        _, new_policy = method(P, env.nS, env.nA, gamma)\n\n        if i%10 == 0:\n            e *= decay_rate\n    ############################\n\n    return new_policy\n\ndef render_single(env, policy):\n    \"\"\"Renders policy once on environment. Watch your agent play!\n\n    Parameters\n    ----------\n    env: gym.core.Environment\n        Environment to play on. Must have nS, nA, and P as\n        attributes.\n    Policy: np.array of shape [env.nS]\n        The action to take at a given state\n    \"\"\"\n\n    episode_reward = 0\n    state = env.reset()\n    done = False\n    while not done:\n        #env.render()\n        #time.sleep(0.5) # Seconds between frames. Modify as you wish.\n        action = policy[state]\n        state, reward, done, _ = env.step(action)\n        episode_reward += reward\n\n    #print \"Episode reward: %f\" % episode_reward\n    return episode_reward\n\n# Feel free to run your own debug code in main!\ndef main():\n    env = gym.make('Stochastic-4x4-FrozenLake-v0')\n    #render_single(env, policy)\n    #print policy\n    \n    score1 = []\n    score2 = []\n    average_score1 = []\n    average_score2 = []\n    for i in tqdm(np.arange(1, 5000, 50)):\n        policy1 = learn_with_mdp_model(env, method=value_iteration, num_episodes=i+1)\n        policy2 = learn_with_mdp_model(env, method=policy_iteration, num_episodes=i+1)\n        episode_reward1 = render_single(env, policy1)\n        episode_reward2 = render_single(env, policy2)\n        score1.append(episode_reward1)\n        score2.append(episode_reward2)\n    for i in range(100):\n        average_score1[i] = np.mean(score1[:i+1])\n        average_score2[i] = np.mean(score2[:i+1])\n    plt.plot(np.arange(1, 5000, 50),np.array(average_score1))\n    plt.plot(np.arange(1, 5000, 50),np.array(average_score2))\n    plt.title('The running average score of the model-based learning agent')\n    plt.xlabel('traning episodes')\n    plt.ylabel('score')\n    plt.legend(['value-iteration', 'policy_iteration'], loc='upper right')\n    #plt.show()\n    plt.savefig('model-based.jpg')\n    \nif __name__ == '__main__':\n    main()"
  },
  {
    "path": "assignment1/model_free_learning.py",
    "content": "### Episode model free learning using Q-learning and SARSA\n\n# Do not change the arguments and output types of any of the functions provided! You may debug in Main and elsewhere.\n\nimport numpy as np\nimport gym\nimport time\nfrom lake_envs import *\nimport matplotlib.pyplot as plt\nfrom tqdm import *\n\ndef learn_Q_QLearning(env, num_episodes=2000, gamma=0.95, lr=0.1, e=0.8, decay_rate=0.99):\n    \"\"\"Learn state-action values using the Q-learning algorithm with epsilon-greedy exploration strategy.\n    Update Q at the end of every episode.\n\n    Parameters\n    ----------\n    env: gym.core.Environment\n        Environment to compute Q function for. Must have nS, nA, and P as\n        attributes.\n    num_episodes: int \n        Number of episodes of training.\n    gamma: float\n        Discount factor. Number in range [0, 1)\n    learning_rate: float\n        Learning rate. Number in range [0, 1)\n    e: float\n        Epsilon value used in the epsilon-greedy method. \n    decay_rate: float\n        Rate at which epsilon falls. Number in range [0, 1)\n\n    Returns\n    -------\n    np.array\n        An array of shape [env.nS x env.nA] representing state, action values\n    \"\"\"\n    ############################\n    # YOUR IMPLEMENTATION HERE #\n    q_value = np.zeros([env.nS, env.nA])  \n    for i in range(num_episodes):\n        done = False\n        state = env.reset()\n        while not done:\n            if np.random.rand() > e:\n                action = np.argmax(q_value[state])\n            else:\n                action = np.random.randint(env.nA)\n            nextstate, reward, done, _ = env.step(action)\n            q_value[state][action] = (1-lr)*q_value[state][action]+lr*(reward+gamma*np.max(q_value[nextstate]))\n            state = nextstate\n        if i%10 == 0:\n            e *= decay_rate        \n    '''\n    print np.mean(q_value)\n    \n    plt.plot(np.arange(num_episodes),np.array(score))\n    plt.title('The running average score of the Q-learning agent')\n    plt.xlabel('traning episodes')\n    plt.ylabel('score')\n    #plt.show()\n    plt.savefig('c.jpg')\n    '''\n    ############################\n    return q_value\n\ndef learn_Q_SARSA(env, num_episodes=2000, gamma=0.95, lr=0.1, e=0.8, decay_rate=0.99):\n    \"\"\"Learn state-action values using the SARSA algorithm with epsilon-greedy exploration strategy\n    Update Q at the end of every episode.\n\n    Parameters\n    ----------\n    env: gym.core.Environment\n        Environment to compute Q function for. Must have nS, nA, and P as\n        attributes.\n    num_episodes: int \n        Number of episodes of training.\n    gamma: float\n        Discount factor. Number in range [0, 1)\n    learning_rate: float\n        Learning rate. Number in range [0, 1)\n    e: float\n        Epsilon value used in the epsilon-greedy method. \n    decay_rate: float\n        Rate at which epsilon falls. Number in range [0, 1)\n\n    Returns\n    -------\n    np.array\n        An array of shape [env.nS x env.nA] representing state-action values\n    \"\"\"\n    ############################\n    # YOUR IMPLEMENTATION HERE #\n    q_value = np.zeros([env.nS, env.nA])  \n    for i in range(num_episodes):\n        done = False\n        state = env.reset()\n        if np.random.rand() > e:\n            action = np.argmax(q_value[state])\n        else:\n            action = np.random.randint(env.nA)\n        while not done:\n            nextstate, reward, done, _ = env.step(action)\n            if np.random.rand() > e:\n                nextaction = np.argmax(q_value[nextstate])\n            else:\n                nextaction = np.random.randint(env.nA)\n            q_value[state][action] = (1-lr)*q_value[state][action]+lr*(reward+gamma*q_value[nextstate][nextaction])\n            state = nextstate\n            action = nextaction\n        if i%10 == 0:\n            e *= decay_rate\n    ############################\n\n    return q_value\n\ndef render_single_Q(env, Q):\n    \"\"\"Renders Q function once on environment. Watch your agent play!\n\n    Parameters\n    ----------\n    env: gym.core.Environment\n        Environment to play Q function on. Must have nS, nA, and P as\n        attributes.\n    Q: np.array of shape [env.nS x env.nA]\n        state-action values.\n    \"\"\"\n\n    episode_reward = 0\n    state = env.reset()\n    done = False\n    while not done:\n        #env.render()  #show frames \n        #time.sleep(0.5) # Seconds between frames. Modify as you wish.\n        action = np.argmax(Q[state])\n        state, reward, done, _ = env.step(action)\n        episode_reward += reward\n\n    #print \"Episode reward: %f\" % episode_reward\n    return episode_reward\n    \n# Feel free to run your own debug code in main!\ndef main():\n    env = gym.make('Stochastic-4x4-FrozenLake-v0')\n    score1 = []\n    score2 = []\n    average_score1 = []\n    average_score2 = []\n    for i in tqdm(range(4000)):\n        Q1 = learn_Q_QLearning(env, num_episodes=i+1)\n        Q2 = learn_Q_SARSA(env, num_episodes=i+1)\n        episode_reward1 = render_single_Q(env, Q1)\n        episode_reward2 = render_single_Q(env, Q2)\n        score1.append(episode_reward1)\n        score2.append(episode_reward2)\n    for i in range(4000):\n        average_score1.append(np.mean(score1[:i+1]))\n        average_score2.append(np.mean(score2[:i+1]))\n    plt.plot(np.arange(4000),np.array(average_score1))\n    plt.plot(np.arange(4000),np.array(average_score2))\n    plt.title('The running average score of the Q-learning agent')\n    plt.xlabel('traning episodes')\n    plt.ylabel('score')\n    plt.legend(['q-learning', 'sarsa'], loc='upper right')\n    #plt.show()\n    plt.savefig('model-free.jpg')\n           \nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "assignment1/requirements.txt",
    "content": "matplotlib\nnumpy"
  },
  {
    "path": "assignment1/vi_and_pi.py",
    "content": "### MDP Value Iteration and Policy Iteratoin\n# You might not need to use all parameters\n\nimport numpy as np\nimport gym\nimport time\nfrom lake_envs import *\n\nnp.set_printoptions(precision=3)\n\n\ndef value_iteration(P, nS, nA, gamma=0.9, max_iteration=20, tol=1e-3):\n    \"\"\"\n    Learn value function and policy by using value iteration method for a given\n    gamma and environment.\n\n    Parameters:\n    ----------\n    P: dictionary\n        It is from gym.core.Environment\n        P[state][action] is tuples with (probability, nextstate, reward, terminal)\n    nS: int\n        number of states\n    nA: int\n        number of actions\n    gamma: float\n        Discount factor. Number in range [0, 1)\n    max_iteration: int\n        The maximum number of iterations to run before stopping. Feel free to change it.\n    tol: float\n        Determines when value function has converged.\n    Returns:\n    ----------\n    value function: np.ndarray\n    policy: np.ndarray\n    \"\"\"\n    V = np.zeros(nS)\n    policy = np.zeros(nS, dtype=int)\n    ############################\n    # YOUR IMPLEMENTATION HERE #\n    idx = 1\n    new_V = V.copy()\n    #print P[14][2]\n    while idx<=max_iteration or np.sum(np.sqrt(np.square(new_V-V)))>tol:\n        idx += 1\n        V = new_V\n        for state in range(nS):\n            max_result = -10\n            max_idx = 0\n            for action in range(nA):\n                result = P[state][action]\n                temp = np.array(result)[:,2].mean()\n                #temp = result[0][2]\n                for num in range(len(result)):\n                    (probability, nextstate, reward, terminal) = result[num]\n                    temp += gamma*probability*V[nextstate]\n                    if max_result < temp:\n                        max_result = temp\n                        max_idx = action\n            new_V[state] = max_result\n            policy[state] = max_idx\n        #print new_V\n        #print policy\n    ############################\n    return V, policy\n\n\ndef policy_evaluation(P, nS, nA, policy, gamma=0.9, max_iteration=100, tol=1e-3):\n    \"\"\"Evaluate the value function from a given policy.\n\n    Parameters\n    ----------\n    P: dictionary\n        It is from gym.core.Environment\n        P[state][action] is tuples with (probability, nextstate, reward, terminal)\n    nS: int\n        number of states\n    nA: int\n        number of actions\n    gamma: float\n        Discount factor. Number in range [0, 1)\n    policy: np.array\n        The policy to evaluate. Maps states to actions.\n    max_iteration: int\n        The maximum number of iterations to run before stopping. Feel free to change it.\n    tol: float\n        Determines when value function has converged.\n    Returns\n    -------\n    value function: np.ndarray\n    The value function from the given policy.\n    \"\"\"\n    ############################\n    # YOUR IMPLEMENTATION HERE #\n    value_function = np.zeros(nS)\n    new_value_function = value_function.copy()\n    i = 0\n    while i<=max_iteration or np.sum(np.sqrt(np.square(new_value_function-value_function)))>tol:\n        i += 1\n        value_function = new_value_function.copy()\n        for state in range(nS):\n            result = P[state][policy[state]]\n            new_value_function[state] = np.array(result)[:,2].mean()\n            for num in range(len(result)):\n                (probability, nextstate, reward, terminal) = result[num]\n                new_value_function[state] += (gamma * probability * value_function[nextstate])\n    ############################\n    return new_value_function\n\n\ndef policy_improvement(P, nS, nA, value_from_policy, policy, gamma=0.9):\n    \"\"\"Given the value function from policy improve the policy.\n\n    Parameters\n    ----------\n    P: dictionary\n        It is from gym.core.Environment\n        P[state][action] is tuples with (probability, nextstate, reward, terminal)\n    nS: int\n        number of states\n    nA: int\n        \tnumber of actions\n    gamma: float\n        Discount factor. Number in range [0, 1)\n    value_from_policy: np.ndarray\n        The value calculated from the policy\n    policy: np.array\n        The previous policy.\n\n    Returns\n    -------\n    new policy: np.ndarray\n        An array of integers. Each integer is the optimal action to take\n        in that state according to the environment dynamics and the\n        given value function.\n    \"\"\"    \n    ############################\n    # YOUR IMPLEMENTATION HERE #\n    q_function = np.zeros([nS,nA])\n    for state in range(nS):\n        for action in range(nA):\n            result = P[state][action]\n            for num in range(len(result)):\n                (probability, nextstate, reward, terminal) = result[num]\n                q_function[state][action] = reward\n                q_function[state][action] += (gamma*probability*value_from_policy[nextstate])\n    new_policy = np.argmax(q_function, axis=1)\n    ############################\n    return new_policy\n\n\ndef policy_iteration(P, nS, nA, gamma=0.9, max_iteration=200, tol=1e-3):\n    \"\"\"Runs policy iteration.\n\n    You should use the policy_evaluation and policy_improvement methods to\n    implement this method.\n\n    Parameters\n    ----------\n    P: dictionary\n        It is from gym.core.Environment\n        P[state][action] is tuples with (probability, nextstate, reward, terminal)\n    nS: int\n        number of states\n    nA: int\n        number of actions\n    gamma: float\n        Discount factor. Number in range [0, 1)\n    max_iteration: int\n        The maximum number of iterations to run before stopping. Feel free to change it.\n    tol: float\n        Determines when value function has converged.\n    Returns:\n    ----------\n    value function: np.ndarray\n    policy: np.ndarray\n    \"\"\"\n    V = np.zeros(nS)\n    policy = np.zeros(nS, dtype=int)\n    ############################\n    # YOUR IMPLEMENTATION HERE #\n    i = 0 \n    new_policy= policy.copy()\n    while i<=max_iteration or np.sum(np.sqrt(np.square(new_policy-policy)))>tol:\n        i += 1\n        policy = new_policy\n        V = policy_evaluation(P, nS, nA, policy)\n        new_policy = policy_improvement(P, nS, nA, V, policy)\n    ############################\n    return V, policy\n\n\n\ndef example(env):\n    \"\"\"Show an example of gym\n    Parameters\n    \t----------\n    env: gym.core.Environment\n        Environment to play on. Must have nS, nA, and P as\n        attributes.\n    \"\"\"\n    env.seed(0); \n    from gym.spaces import prng; prng.seed(10) # for print the location\n    # Generate the episode\n    ob = env.reset()\n    for t in range(100):\n        env.render()\n        a = env.action_space.sample()\n        ob, rew, done, _ = env.step(a)\n        if done:\n            break\n    assert done\n    env.render();\n\ndef render_single(env, policy):\n    \"\"\"Renders policy once on environment. Watch your agent play!\n\n    Parameters\n    ----------\n    env: gym.core.Environment\n        Environment to play on. Must have nS, nA, and P as\n        attributes.\n    Policy: np.array of shape [env.nS]\n        The action to take at a given state\n    \"\"\"\n\n    episode_reward = 0\n    ob = env.reset()\n    for t in range(100):\n        env.render()\n        #time.sleep(0.5) # Seconds between frames. Modify as you wish.\n        a = policy[ob]\n        ob, rew, done, _ = env.step(a)\n        episode_reward += rew\n        if done:\n            break\n    assert done\n    env.render();\n    print \"Episode reward: %f\" % episode_reward\n\n\n# Feel free to run your own debug code in main!\n# Play around with these hyperparameters.\nif __name__ == \"__main__\":\n    env = gym.make(\"Stochastic-4x4-FrozenLake-v0\")\n    print env.__doc__\n    #print \"Here is an example of state, action, reward, and next state\"\n    #example(env)\n    V_vi, p_vi = value_iteration(env.P, env.nS, env.nA, gamma=0.9, max_iteration=20, tol=1e-3)\n    #V_pi, p_pi = policy_iteration(env.P, env.nS, env.nA, gamma=0.9, max_iteration=20, tol=1e-3)\n    render_single(env, p_vi)\n\t"
  },
  {
    "path": "assignment2/.gitignore",
    "content": "/results"
  },
  {
    "path": "assignment2/Makefile",
    "content": "submit:\r\n\tsh collect_submission.sh\r\n\r\nclean:\r\n\trm -f assignment1.zip\r\n\trm -f *.pyc *.png *.npy utils/*.pyc\r\n\r\n"
  },
  {
    "path": "assignment2/README.md",
    "content": "# RL with Atari\r\n\r\n## Install\r\n\r\nFirst, install gym and atari environments. You may need to install other dependencies depending on your system.\r\n\r\n```\r\npip install gym\r\n```\r\n\r\nand then install atari with one of the following commands\r\n```\r\npip install \"gym[atari]\"\r\npip install gym[atari]\r\n```\r\n\r\nWe also require you to use a version greater than 1 for Tensorflow.\r\n\r\n\r\n## Environment\r\n\r\n### Pong-v0\r\n\r\n- We play against a decent AI player.\r\n- One player wins if the ball pass through the other player and gets reward +1 else -1.\r\n- Episode is over when one of the player reaches 21 wins\r\n- final score is between -21 or +21 (lost all or won all)\r\n\r\n```python\r\n# action = int in [0, 6)\r\n# state  = (210, 160, 3) array\r\n# reward = 0 during the game, 1 if we win, -1 else\r\n```\r\n\r\nWe use a modified env where the dimension of the input is reduced to\r\n\r\n```python\r\n# state = (80, 80, 1)\r\n```\r\n\r\nwith downsampling and greyscale.\r\n\r\n## Training\r\n\r\nOnce done with implementing `q2_linear.py` (setup of the tensorflow necessary op) and `q3_nature` make sure you test your implementation by launching `python q2_linear.py` and `python q3_nature.py` that will run your code on the Test environment.\r\n\r\nYou can launch the training of DeepMind's DQN on pong with\r\n\r\n```\r\npython q5_train_atari_nature.py\r\n```\r\n\r\nThe default config file should be sufficient to reach good performance after 5 million steps.\r\n\r\nYou can monitor your training with Tensorboard by doing, on Azure\r\n\r\n```\r\ntensorboard --logdir=results\r\n```\r\n\r\nand then connect to `ip-of-you-machine:6006`\r\n\r\n\r\n\r\n\r\n**Credits**\r\nAssignment code written by Guillaume Genthial and Shuhui Qu."
  },
  {
    "path": "assignment2/collect_submission.sh",
    "content": "rm -f assignment2.zip \r\nzip -r assignment2.zip . -x \"*.pyc\" \"*.git*\" \"*weights/*\" \"*README.md\" \"*collect_submission.sh\" \"*events.out*\" \"*/monitor/*\"\r\n"
  },
  {
    "path": "assignment2/configs/__init__.py",
    "content": ""
  },
  {
    "path": "assignment2/configs/frozen_lake.py",
    "content": "class config():\r\n    # env config\r\n    render_train     = False\r\n    render_test      = False\r\n    env_name         = \"Pong-v0\"\r\n    RGB              = True\r\n    overwrite_render = True\r\n\r\n    # output config\r\n    output_path  = \"results/test/\"\r\n    model_output = output_path + \"model.weights/\"\r\n    log_path     = output_path + \"log.txt\"\r\n    plot_output  = output_path + \"scores.png\"\r\n    training_path = \"results/train/\"\r\n\r\n    # model and training config\r\n    num_episodes_test = 20\r\n    grad_clip         = True\r\n    clip_val          = 10\r\n    saving_freq       = 500\r\n    log_freq          = 50\r\n    eval_freq         = 50000\r\n    soft_epsilon      = 0.05\r\n\r\n    # nature paper hyper params\r\n    nsteps_train       = 2000*200\r\n    batch_size         = 32\r\n    buffer_size        = 50000\r\n    target_update_freq = 5000\r\n    gamma              = 0.99\r\n    learning_freq      = 1\r\n    state_history      = 1\r\n    skip_frame         = 1\r\n    lr                 = 0.1\r\n    eps_begin          = 0.1\r\n    eps_end            = 0.01\r\n    eps_nsteps         = nsteps_train\r\n    learning_start     = 5000\r\n"
  },
  {
    "path": "assignment2/configs/q2_linear.py",
    "content": "class config():\r\n    # env config\r\n    render_train     = False\r\n    render_test      = False\r\n    overwrite_render = True\r\n    record           = False\r\n    high             = 255.\r\n\r\n    # output config\r\n    output_path  = \"results/q2_linear/\"\r\n    model_output = output_path + \"model.weights/\"\r\n    log_path     = output_path + \"log.txt\"\r\n    plot_output  = output_path + \"scores.png\"\r\n\r\n    # model and training config\r\n    num_episodes_test = 20\r\n    grad_clip         = True\r\n    clip_val          = 10\r\n    saving_freq       = 5000\r\n    log_freq          = 50\r\n    eval_freq         = 1000\r\n    soft_epsilon      = 0\r\n\r\n    # hyper params\r\n    nsteps_train       = 10000\r\n    batch_size         = 32\r\n    buffer_size        = 1000\r\n    target_update_freq = 500\r\n    gamma              = 0.99\r\n    learning_freq      = 4\r\n    state_history      = 4\r\n    lr_begin           = 0.005\r\n    lr_end             = 0.001\r\n    lr_nsteps          = nsteps_train/2\r\n    eps_begin          = 1\r\n    eps_end            = 0.01\r\n    eps_nsteps         = nsteps_train/2\r\n    learning_start     = 200\r\n"
  },
  {
    "path": "assignment2/configs/q3_nature.py",
    "content": "class config():\r\n    # env config\r\n    render_train     = False\r\n    render_test      = False\r\n    overwrite_render = True\r\n    record           = False\r\n    high             = 255.\r\n\r\n    # output config\r\n    output_path  = \"results/q3_nature/\"\r\n    model_output = output_path + \"model.weights/\"\r\n    log_path     = output_path + \"log.txt\"\r\n    plot_output  = output_path + \"scores.png\"\r\n\r\n    # model and training config\r\n    num_episodes_test = 20\r\n    grad_clip         = True\r\n    clip_val          = 10\r\n    saving_freq       = 5000\r\n    log_freq          = 50\r\n    eval_freq         = 100\r\n    soft_epsilon      = 0\r\n\r\n    # hyper params\r\n    nsteps_train       = 1000\r\n    batch_size         = 32\r\n    buffer_size        = 500\r\n    target_update_freq = 500\r\n    gamma              = 0.99\r\n    learning_freq      = 4\r\n    state_history      = 4\r\n    lr_begin           = 0.00025\r\n    lr_end             = 0.0001\r\n    lr_nsteps          = nsteps_train/2\r\n    eps_begin          = 1\r\n    eps_end            = 0.01\r\n    eps_nsteps         = nsteps_train/2\r\n    learning_start     = 200\r\n    "
  },
  {
    "path": "assignment2/configs/q4_train_atari_linear.py",
    "content": "class config():\r\n    # env config\r\n    render_train     = False\r\n    render_test      = False\r\n    env_name         = \"Pong-v0\"\r\n    overwrite_render = True\r\n    record           = True\r\n    high             = 255.\r\n\r\n    # output config\r\n    output_path  = \"results/q4_train_atari_linear/\"\r\n    model_output = output_path + \"model.weights/\"\r\n    log_path     = output_path + \"log.txt\"\r\n    plot_output  = output_path + \"scores.png\"\r\n    record_path  = output_path + \"monitor/\"\r\n\r\n    # model and training config\r\n    num_episodes_test = 50\r\n    grad_clip         = True\r\n    clip_val          = 10\r\n    saving_freq       = 250000\r\n    log_freq          = 50\r\n    eval_freq         = 250000\r\n    record_freq       = 250000\r\n    soft_epsilon      = 0.05\r\n\r\n    # nature paper hyper params\r\n    nsteps_train       = 5000000\r\n    batch_size         = 32\r\n    buffer_size        = 1000000\r\n    target_update_freq = 10000\r\n    gamma              = 0.99\r\n    learning_freq      = 4\r\n    state_history      = 4\r\n    skip_frame         = 4\r\n    lr_begin           = 0.00025\r\n    lr_end             = 0.00005\r\n    lr_nsteps          = nsteps_train/2\r\n    eps_begin          = 1\r\n    eps_end            = 0.1\r\n    eps_nsteps         = 1000000\r\n    learning_start     = 50000\r\n"
  },
  {
    "path": "assignment2/configs/q5_train_atari_nature.py",
    "content": "class config():\r\n    # env config\r\n    render_train     = False\r\n    render_test      = False\r\n    env_name         = \"Pong-v0\"\r\n    overwrite_render = True\r\n    record           = True\r\n    high             = 255.\r\n\r\n    # output config\r\n    output_path  = \"results/q5_train_atari_nature/\"\r\n    model_output = output_path + \"model.weights/\"\r\n    log_path     = output_path + \"log.txt\"\r\n    plot_output  = output_path + \"scores.png\"\r\n    record_path  = output_path + \"monitor/\"\r\n\r\n    # model and training config\r\n    num_episodes_test = 50\r\n    grad_clip         = True\r\n    clip_val          = 10\r\n    saving_freq       = 250000\r\n    log_freq          = 50\r\n    eval_freq         = 250000\r\n    record_freq       = 250000\r\n    soft_epsilon      = 0.05\r\n\r\n    # nature paper hyper params\r\n    nsteps_train       = 5000000\r\n    batch_size         = 32\r\n    buffer_size        = 1000000\r\n    target_update_freq = 10000\r\n    gamma              = 0.99\r\n    learning_freq      = 4\r\n    state_history      = 4\r\n    skip_frame         = 4\r\n    lr_begin           = 0.00025\r\n    lr_end             = 0.00005\r\n    lr_nsteps          = nsteps_train/2\r\n    eps_begin          = 1\r\n    eps_end            = 0.1\r\n    eps_nsteps         = 1000000\r\n    learning_start     = 50000\r\n"
  },
  {
    "path": "assignment2/configs/q6_bonus_question.py",
    "content": "class config():\r\n    # env config\r\n    render_train     = False\r\n    render_test      = False\r\n    env_name         = \"Pong-v0\"\r\n    overwrite_render = True\r\n    record           = True\r\n    high             = 255.\r\n\r\n    # output config\r\n    output_path  = \"results/q6_bonus_question/\"\r\n    model_output = output_path + \"model.weights/\"\r\n    log_path     = output_path + \"log.txt\"\r\n    plot_output  = output_path + \"scores.png\"\r\n    record_path  = output_path + \"monitor/\"\r\n\r\n    # model and training config\r\n    num_episodes_test = 50\r\n    grad_clip         = True\r\n    clip_val          = 10\r\n    saving_freq       = 250000\r\n    log_freq          = 50\r\n    eval_freq         = 250000\r\n    record_freq       = 250000\r\n    soft_epsilon      = 0.05\r\n\r\n    # nature paper hyper params\r\n    nsteps_train       = 10000000\r\n    batch_size         = 32\r\n    buffer_size        = 1000000\r\n    target_update_freq = 10000\r\n    gamma              = 0.99\r\n    learning_freq      = 4\r\n    state_history      = 4\r\n    skip_frame         = 4\r\n    lr_begin           = 0.00025\r\n    lr_end             = 0.00005\r\n    lr_nsteps          = nsteps_train/2\r\n    eps_begin          = 1\r\n    eps_end            = 0.1\r\n    eps_nsteps         = 1000000\r\n    learning_start     = 50000\r\n    "
  },
  {
    "path": "assignment2/configs/test.py",
    "content": "class config():\r\n    # env config\r\n    render_train     = True\r\n    render_test      = False\r\n    env_name         = \"Pong-v0\"\r\n    overwrite_render = True\r\n    record           = True\r\n    high             = 255.\r\n\r\n    # output config\r\n    output_path  = \"results/test/\"\r\n    model_output = output_path + \"model.weights/\"\r\n    log_path     = output_path + \"log.txt\"\r\n    plot_output  = output_path + \"scores.png\"\r\n    record_path  = output_path + \"video/\"\r\n\r\n\r\n    # model and training config\r\n    num_episodes_test = 10\r\n    grad_clip         = True\r\n    clip_val          = 10\r\n    saving_freq       = 1000\r\n    log_freq          = 50\r\n    eval_freq         = 1000\r\n    record_freq       = 1000\r\n    soft_epsilon      = 0.05\r\n\r\n    # nature paper hyper params\r\n    nsteps_train       = 10000\r\n    batch_size         = 32\r\n    buffer_size        = 1000\r\n    target_update_freq = 1000\r\n    gamma              = 0.99\r\n    learning_freq      = 4\r\n    state_history      = 4\r\n    skip_frame         = 4\r\n    lr                 = 0.0001\r\n    eps_begin          = 1\r\n    eps_end            = 0.1\r\n    eps_nsteps         = 1000\r\n    learning_start     = 500\r\n"
  },
  {
    "path": "assignment2/core/__init__.py",
    "content": ""
  },
  {
    "path": "assignment2/core/deep_q_learning.py",
    "content": "import os\r\nimport numpy as np\r\nimport tensorflow as tf\r\nimport time\r\n\r\nfrom q_learning import QN\r\n\r\n\r\nclass DQN(QN):\r\n    \"\"\"\r\n    Abstract class for Deep Q Learning\r\n    \"\"\"\r\n    def add_placeholders_op(self):\r\n        raise NotImplementedError\r\n\r\n\r\n    def get_q_values_op(self, scope, reuse=False):\r\n        \"\"\"\r\n        set Q values, of shape = (batch_size, num_actions)\r\n        \"\"\"\r\n        raise NotImplementedError\r\n\r\n\r\n    def add_update_target_op(self, q_scope, target_q_scope):\r\n        \"\"\"\r\n        Update_target_op will be called periodically \r\n        to copy Q network to target Q network\r\n    \r\n        Args:\r\n            q_scope: name of the scope of variables for q\r\n            target_q_scope: name of the scope of variables for the target\r\n                network\r\n        \"\"\"\r\n        raise NotImplementedError\r\n\r\n\r\n    def add_loss_op(self, q, target_q):\r\n        \"\"\"\r\n        Set (Q_target - Q)^2\r\n        \"\"\"\r\n        raise NotImplementedError\r\n\r\n\r\n    def add_optimizer_op(self, scope):\r\n        \"\"\"\r\n        Set training op wrt to loss for variable in scope\r\n        \"\"\"\r\n        raise NotImplementedError\r\n\r\n\r\n    def process_state(self, state):\r\n        \"\"\"\r\n        Processing of state\r\n\r\n        State placeholders are tf.uint8 for fast transfer to GPU\r\n        Need to cast it to float32 for the rest of the tf graph.\r\n\r\n        Args:\r\n            state: node of tf graph of shape = (batch_size, height, width, nchannels)\r\n                    of type tf.uint8.\r\n                    if , values are between 0 and 255 -> 0 and 1\r\n        \"\"\"\r\n        state = tf.cast(state, tf.float32)\r\n        state /= self.config.high\r\n\r\n        return state\r\n\r\n\r\n    def build(self):\r\n        \"\"\"\r\n        Build model by adding all necessary variables\r\n        \"\"\"\r\n        # add placeholders\r\n        self.add_placeholders_op()\r\n\r\n        # compute Q values of state\r\n        s = self.process_state(self.s)\r\n        self.q = self.get_q_values_op(s, scope=\"q\", reuse=False)\r\n\r\n        # compute Q values of next state\r\n        sp = self.process_state(self.sp)\r\n        self.target_q = self.get_q_values_op(sp, scope=\"target_q\", reuse=False)\r\n\r\n        # add update operator for target network\r\n        self.add_update_target_op(\"q\", \"target_q\")\r\n\r\n        # add square loss\r\n        self.add_loss_op(self.q, self.target_q)\r\n\r\n        # add optmizer for the main networks\r\n        self.add_optimizer_op(\"q\")\r\n\r\n\r\n    def initialize(self):\r\n        \"\"\"\r\n        Assumes the graph has been constructed\r\n        Creates a tf Session and run initializer of variables\r\n        \"\"\"\r\n        # create tf session\r\n        self.sess = tf.Session()\r\n\r\n        # tensorboard stuff\r\n        self.add_summary()\r\n\r\n        # initiliaze all variables\r\n        init = tf.global_variables_initializer()\r\n        self.sess.run(init)\r\n\r\n        # synchronise q and target_q networks\r\n        self.sess.run(self.update_target_op)\r\n\r\n        # for saving networks weights\r\n        self.saver = tf.train.Saver()\r\n\r\n       \r\n    def add_summary(self):\r\n        \"\"\"\r\n        Tensorboard stuff\r\n        \"\"\"\r\n        # extra placeholders to log stuff from python\r\n        self.avg_reward_placeholder = tf.placeholder(tf.float32, shape=(), name=\"avg_reward\")\r\n        self.max_reward_placeholder = tf.placeholder(tf.float32, shape=(), name=\"max_reward\")\r\n        self.std_reward_placeholder = tf.placeholder(tf.float32, shape=(), name=\"std_reward\")\r\n\r\n        self.avg_q_placeholder  = tf.placeholder(tf.float32, shape=(), name=\"avg_q\")\r\n        self.max_q_placeholder  = tf.placeholder(tf.float32, shape=(), name=\"max_q\")\r\n        self.std_q_placeholder  = tf.placeholder(tf.float32, shape=(), name=\"std_q\")\r\n\r\n        self.eval_reward_placeholder = tf.placeholder(tf.float32, shape=(), name=\"eval_reward\")\r\n\r\n        # add placeholders from the graph\r\n        tf.summary.scalar(\"loss\", self.loss)\r\n        tf.summary.scalar(\"grads norm\", self.grad_norm)\r\n\r\n        # extra summaries from python -> placeholders\r\n        tf.summary.scalar(\"Avg Reward\", self.avg_reward_placeholder)\r\n        tf.summary.scalar(\"Max Reward\", self.max_reward_placeholder)\r\n        tf.summary.scalar(\"Std Reward\", self.std_reward_placeholder)\r\n\r\n        tf.summary.scalar(\"Avg Q\", self.avg_q_placeholder)\r\n        tf.summary.scalar(\"Max Q\", self.max_q_placeholder)\r\n        tf.summary.scalar(\"Std Q\", self.std_q_placeholder)\r\n\r\n        tf.summary.scalar(\"Eval Reward\", self.eval_reward_placeholder)\r\n            \r\n        # logging\r\n        self.merged = tf.summary.merge_all()\r\n        self.file_writer = tf.summary.FileWriter(self.config.output_path, \r\n                                                self.sess.graph)\r\n\r\n\r\n\r\n    def save(self):\r\n        \"\"\"\r\n        Saves session\r\n        \"\"\"\r\n        if not os.path.exists(self.config.model_output):\r\n            os.makedirs(self.config.model_output)\r\n\r\n        self.saver.save(self.sess, self.config.model_output)\r\n\r\n\r\n    def get_best_action(self, state):\r\n        \"\"\"\r\n        Return best action\r\n\r\n        Args:\r\n            state: 4 consecutive observations from gym\r\n        Returns:\r\n            action: (int)\r\n            action_values: (np array) q values for all actions\r\n        \"\"\"\r\n        action_values = self.sess.run(self.q, feed_dict={self.s: [state]})[0]\r\n        return np.argmax(action_values), action_values\r\n\r\n\r\n    def update_step(self, t, replay_buffer, lr):\r\n        \"\"\"\r\n        Performs an update of parameters by sampling from replay_buffer\r\n\r\n        Args:\r\n            t: number of iteration (episode and move)\r\n            replay_buffer: ReplayBuffer instance .sample() gives batches\r\n            lr: (float) learning rate\r\n        Returns:\r\n            loss: (Q - Q_target)^2\r\n        \"\"\"\r\n\r\n        s_batch, a_batch, r_batch, sp_batch, done_mask_batch = replay_buffer.sample(\r\n            self.config.batch_size)\r\n\r\n\r\n        fd = {\r\n            # inputs\r\n            self.s: s_batch,\r\n            self.a: a_batch,\r\n            self.r: r_batch,\r\n            self.sp: sp_batch, \r\n            self.done_mask: done_mask_batch,\r\n            self.lr: lr, \r\n            # extra info\r\n            self.avg_reward_placeholder: self.avg_reward, \r\n            self.max_reward_placeholder: self.max_reward, \r\n            self.std_reward_placeholder: self.std_reward, \r\n            self.avg_q_placeholder: self.avg_q, \r\n            self.max_q_placeholder: self.max_q, \r\n            self.std_q_placeholder: self.std_q, \r\n            self.eval_reward_placeholder: self.eval_reward, \r\n        }\r\n\r\n        loss_eval, grad_norm_eval, summary, _ = self.sess.run([self.loss, self.grad_norm, \r\n                                                 self.merged, self.train_op], feed_dict=fd)\r\n\r\n\r\n        # tensorboard stuff\r\n        self.file_writer.add_summary(summary, t)\r\n        \r\n        return loss_eval, grad_norm_eval\r\n\r\n\r\n    def update_target_params(self):\r\n        \"\"\"\r\n        Update parametes of Q' with parameters of Q\r\n        \"\"\"\r\n        self.sess.run(self.update_target_op)\r\n\r\n"
  },
  {
    "path": "assignment2/core/q_learning.py",
    "content": "import os\r\nimport gym\r\nimport numpy as np\r\nimport logging\r\nimport time\r\nimport sys\r\nfrom gym import wrappers\r\nfrom collections import deque\r\n\r\nfrom utils.general import get_logger, Progbar, export_plot\r\nfrom utils.replay_buffer import ReplayBuffer\r\nfrom utils.preprocess import greyscale\r\nfrom utils.wrappers import PreproWrapper, MaxAndSkipEnv\r\n\r\n\r\nclass QN(object):\r\n    \"\"\"\r\n    Abstract Class for implementing a Q Network\r\n    \"\"\"\r\n    def __init__(self, env, config, logger=None):\r\n        \"\"\"\r\n        Initialize Q Network and env\r\n\r\n        Args:\r\n            config: class with hyperparameters\r\n            logger: logger instance from logging module\r\n        \"\"\"\r\n        # directory for training outputs\r\n        if not os.path.exists(config.output_path):\r\n            os.makedirs(config.output_path)\r\n            \r\n        # store hyper params\r\n        self.config = config\r\n        self.logger = logger\r\n        if logger is None:\r\n            self.logger = get_logger(config.log_path)\r\n        self.env = env\r\n\r\n        # build model\r\n        self.build()\r\n\r\n\r\n    def build(self):\r\n        \"\"\"\r\n        Build model\r\n        \"\"\"\r\n        pass\r\n\r\n\r\n    @property\r\n    def policy(self):\r\n        \"\"\"\r\n        model.policy(state) = action\r\n        \"\"\"\r\n        return lambda state: self.get_action(state)\r\n\r\n\r\n    def save(self):\r\n        \"\"\"\r\n        Save model parameters\r\n\r\n        Args:\r\n            model_path: (string) directory\r\n        \"\"\"\r\n        pass\r\n\r\n\r\n    def initialize(self):\r\n        \"\"\"\r\n        Initialize variables if necessary\r\n        \"\"\"\r\n        pass\r\n\r\n\r\n    def get_best_action(self, state):\r\n        \"\"\"\r\n        Returns best action according to the network\r\n    \r\n        Args:\r\n            state: observation from gym\r\n        Returns:\r\n            tuple: action, q values\r\n        \"\"\"\r\n        raise NotImplementedError\r\n\r\n\r\n    def get_action(self, state):\r\n        \"\"\"\r\n        Returns action with some epsilon strategy\r\n\r\n        Args:\r\n            state: observation from gym\r\n        \"\"\"\r\n        if np.random.random() < self.config.soft_epsilon:\r\n            return self.env.action_space.sample()\r\n        else:\r\n            return self.get_best_action(state)[0]\r\n\r\n\r\n    def update_target_params(self):\r\n        \"\"\"\r\n        Update params of Q' with params of Q\r\n        \"\"\"\r\n        raise NotImplementedError\r\n\r\n\r\n    def init_averages(self):\r\n        \"\"\"\r\n        Defines extra attributes for tensorboard\r\n        \"\"\"\r\n        self.avg_reward = -21.\r\n        self.max_reward = -21.\r\n        self.std_reward = 0\r\n\r\n        self.avg_q = 0\r\n        self.max_q = 0\r\n        self.std_q = 0\r\n        \r\n        self.eval_reward = -21.\r\n\r\n\r\n    def update_averages(self, rewards, max_q_values, q_values, scores_eval):\r\n        \"\"\"\r\n        Update the averages\r\n\r\n        Args:\r\n            rewards: deque\r\n            max_q_values: deque\r\n            q_values: deque\r\n            scores_eval: list\r\n        \"\"\"\r\n        self.avg_reward = np.mean(rewards)\r\n        self.max_reward = np.max(rewards)\r\n        self.std_reward = np.sqrt(np.var(rewards) / len(rewards))\r\n\r\n        self.max_q      = np.mean(max_q_values)\r\n        self.avg_q      = np.mean(q_values)\r\n        self.std_q      = np.sqrt(np.var(q_values) / len(q_values))\r\n\r\n        if len(scores_eval) > 0:\r\n            self.eval_reward = scores_eval[-1]\r\n\r\n\r\n    def train(self, exp_schedule, lr_schedule):\r\n        \"\"\"\r\n        Performs training of Q\r\n\r\n        Args:\r\n            exp_schedule: Exploration instance s.t.\r\n                exp_schedule.get_action(best_action) returns an action\r\n            lr_schedule: Schedule for learning rate\r\n        \"\"\"\r\n\r\n        # initialize replay buffer and variables\r\n        replay_buffer = ReplayBuffer(self.config.buffer_size, self.config.state_history)\r\n        rewards = deque(maxlen=self.config.num_episodes_test)\r\n        max_q_values = deque(maxlen=1000)\r\n        q_values = deque(maxlen=1000)\r\n        self.init_averages()\r\n\r\n        t = last_eval = last_record = 0 # time control of nb of steps\r\n        scores_eval = [] # list of scores computed at iteration time\r\n        scores_eval += [self.evaluate()]\r\n        \r\n        prog = Progbar(target=self.config.nsteps_train)\r\n\r\n        # interact with environment\r\n        while t < self.config.nsteps_train:\r\n            total_reward = 0\r\n            state = self.env.reset()\r\n            while True:\r\n                t += 1\r\n                last_eval += 1\r\n                last_record += 1\r\n                if self.config.render_train: self.env.render()\r\n                # replay memory stuff\r\n                idx      = replay_buffer.store_frame(state)\r\n                q_input = replay_buffer.encode_recent_observation()\r\n\r\n                # chose action according to current Q and exploration\r\n                best_action, q_values = self.get_best_action(q_input)\r\n                action                = exp_schedule.get_action(best_action)\r\n\r\n                # store q values\r\n                max_q_values.append(max(q_values))\r\n                q_values += list(q_values)\r\n\r\n                # perform action in env\r\n                new_state, reward, done, info = self.env.step(action)\r\n\r\n                # store the transition\r\n                replay_buffer.store_effect(idx, action, reward, done)\r\n                state = new_state\r\n\r\n                # perform a training step\r\n                loss_eval, grad_eval = self.train_step(t, replay_buffer, lr_schedule.epsilon)\r\n\r\n                # logging stuff\r\n                if ((t > self.config.learning_start) and (t % self.config.log_freq == 0) and\r\n                   (t % self.config.learning_freq == 0)):\r\n                    self.update_averages(rewards, max_q_values, q_values, scores_eval)\r\n                    exp_schedule.update(t)\r\n                    lr_schedule.update(t)\r\n                    if len(rewards) > 0:\r\n                        prog.update(t + 1, exact=[(\"Loss\", loss_eval), (\"Avg R\", self.avg_reward), \r\n                                        (\"Max R\", np.max(rewards)), (\"eps\", exp_schedule.epsilon), \r\n                                        (\"Grads\", grad_eval), (\"Max Q\", self.max_q), \r\n                                        (\"lr\", lr_schedule.epsilon)])\r\n\r\n                elif (t < self.config.learning_start) and (t % self.config.log_freq == 0):\r\n                    sys.stdout.write(\"\\rPopulating the memory {}/{}...\".format(t, \r\n                                                        self.config.learning_start))\r\n                    sys.stdout.flush()\r\n\r\n                # count reward\r\n                total_reward += reward\r\n                if done or t >= self.config.nsteps_train:\r\n                    break\r\n\r\n            # updates to perform at the end of an episode\r\n            rewards.append(total_reward)          \r\n\r\n            if (t > self.config.learning_start) and (last_eval > self.config.eval_freq):\r\n                # evaluate our policy\r\n                last_eval = 0\r\n                print(\"\")\r\n                scores_eval += [self.evaluate()]\r\n\r\n            if (t > self.config.learning_start) and self.config.record and (last_record > self.config.record_freq):\r\n                self.logger.info(\"Recording...\")\r\n                last_record =0\r\n                self.record()\r\n\r\n        # last words\r\n        self.logger.info(\"- Training done.\")\r\n        self.save()\r\n        scores_eval += [self.evaluate()]\r\n        export_plot(scores_eval, \"Scores\", self.config.plot_output)\r\n\r\n\r\n    def train_step(self, t, replay_buffer, lr):\r\n        \"\"\"\r\n        Perform training step\r\n\r\n        Args:\r\n            t: (int) nths step\r\n            replay_buffer: buffer for sampling\r\n            lr: (float) learning rate\r\n        \"\"\"\r\n        loss_eval, grad_eval = 0, 0\r\n\r\n        # perform training step\r\n        if (t > self.config.learning_start and t % self.config.learning_freq == 0):\r\n            loss_eval, grad_eval = self.update_step(t, replay_buffer, lr)\r\n\r\n        # occasionaly update target network with q network\r\n        if t % self.config.target_update_freq == 0:\r\n            self.update_target_params()\r\n            \r\n        # occasionaly save the weights\r\n        if (t % self.config.saving_freq == 0):\r\n            self.save()\r\n\r\n        return loss_eval, grad_eval\r\n\r\n\r\n    def evaluate(self, env=None, num_episodes=None):\r\n        \"\"\"\r\n        Evaluation with same procedure as the training\r\n        \"\"\"\r\n        # log our activity only if default call\r\n        if num_episodes is None:\r\n            self.logger.info(\"Evaluating...\")\r\n\r\n        # arguments defaults\r\n        if num_episodes is None:\r\n            num_episodes = self.config.num_episodes_test\r\n\r\n        if env is None:\r\n            env = self.env\r\n\r\n        # replay memory to play\r\n        replay_buffer = ReplayBuffer(self.config.buffer_size, self.config.state_history)\r\n        rewards = []\r\n\r\n        for i in range(num_episodes):\r\n            total_reward = 0\r\n            state = env.reset()\r\n            while True:\r\n                if self.config.render_test: env.render()\r\n\r\n                # store last state in buffer\r\n                idx     = replay_buffer.store_frame(state)\r\n                q_input = replay_buffer.encode_recent_observation()\r\n\r\n                action = self.get_action(q_input)\r\n\r\n                # perform action in env\r\n                new_state, reward, done, info = env.step(action)\r\n\r\n                # store in replay memory\r\n                replay_buffer.store_effect(idx, action, reward, done)\r\n                state = new_state\r\n\r\n                # count reward\r\n                total_reward += reward\r\n                if done:\r\n                    break\r\n\r\n            # updates to perform at the end of an episode\r\n            rewards.append(total_reward)     \r\n\r\n        avg_reward = np.mean(rewards)\r\n        sigma_reward = np.sqrt(np.var(rewards) / len(rewards))\r\n\r\n        if num_episodes > 1:\r\n            msg = \"Average reward: {:04.2f} +/- {:04.2f}\".format(avg_reward, sigma_reward)\r\n            self.logger.info(msg)\r\n\r\n        return avg_reward\r\n\r\n\r\n    def record(self):\r\n        \"\"\"\r\n        Re create an env and record a video for one episode\r\n        \"\"\"\r\n        env = gym.make(self.config.env_name)\r\n        env = gym.wrappers.Monitor(env, self.config.record_path, video_callable=lambda x: True, resume=True)\r\n        env = MaxAndSkipEnv(env, skip=self.config.skip_frame)\r\n        env = PreproWrapper(env, prepro=greyscale, shape=(80, 80, 1), \r\n                        overwrite_render=self.config.overwrite_render)\r\n        self.evaluate(env, 1)\r\n\r\n\r\n    def run(self, exp_schedule, lr_schedule):\r\n        \"\"\"\r\n        Apply procedures of training for a QN\r\n\r\n        Args:\r\n            exp_schedule: exploration strategy for epsilon\r\n            lr_schedule: schedule for learning rate\r\n        \"\"\"\r\n        # initialize\r\n        self.initialize()\r\n\r\n        # record one game at the beginning\r\n        if self.config.record:\r\n            self.record()\r\n\r\n        # model\r\n        self.train(exp_schedule, lr_schedule)\r\n\r\n        # record one game at the end\r\n        if self.config.record:\r\n            self.record()\r\n        \r\n"
  },
  {
    "path": "assignment2/q1_schedule.py",
    "content": "import numpy as np\r\nfrom utils.test_env import EnvTest\r\n\r\n\r\nclass LinearSchedule(object):\r\n    def __init__(self, eps_begin, eps_end, nsteps):\r\n        \"\"\"\r\n        Args:\r\n            eps_begin: initial exploration\r\n            eps_end: end exploration\r\n            nsteps: number of steps between the two values of eps\r\n        \"\"\"\r\n        self.epsilon        = eps_begin\r\n        self.eps_begin      = eps_begin\r\n        self.eps_end        = eps_end\r\n        self.nsteps         = nsteps\r\n\r\n\r\n    def update(self, t):\r\n        \"\"\"\r\n        Updates epsilon\r\n\r\n        Args:\r\n            t: (int) nth frames\r\n        \"\"\"\r\n        ##############################################################\r\n        \"\"\"\r\n        TODO: modify self.epsilon such that \r\n               for t = 0, self.epsilon = self.eps_begin\r\n               for t = self.nsteps, self.epsilon = self.eps_end\r\n               linear decay between the two\r\n\r\n              self.epsilon should never go under self.eps_end\r\n        \"\"\"\r\n        ##############################################################\r\n        ################ YOUR CODE HERE - 3-4 lines ################## \r\n\r\n        value = np.linspace(self.eps_end, self.eps_begin, self.nsteps+1)\r\n        #if t > self.nsteps:\r\n        #    self.epsilon = self.eps_end\r\n        #else:\r\n        #    self.epsilon = value[t]\r\n        self.epsilon = value[t] if t <= self.nsteps else self.eps_end \r\n        ##############################################################\r\n        ######################## END YOUR CODE ############## ########\r\n\r\n\r\nclass LinearExploration(LinearSchedule):\r\n    def __init__(self, env, eps_begin, eps_end, nsteps):\r\n        \"\"\"\r\n        Args:\r\n            env: gym environment\r\n            eps_begin: initial exploration\r\n            eps_end: end exploration\r\n            nsteps: number of steps between the two values of eps\r\n        \"\"\"\r\n        self.env = env\r\n        super(LinearExploration, self).__init__(eps_begin, eps_end, nsteps)\r\n\r\n\r\n    def get_action(self, best_action):\r\n        \"\"\"\r\n        Returns a random action with prob epsilon, otherwise return the best_action\r\n\r\n        Args:\r\n            best_action: (int) best action according some policy\r\n        Returns:\r\n            an action\r\n        \"\"\"\r\n        ##############################################################\r\n        \"\"\"\r\n        TODO: with probability self.epsilon, return a random action\r\n               else, return best_action\r\n\r\n               you can access the environment stored in self.env\r\n               and epsilon with self.epsilon\r\n        \"\"\"\r\n        ##############################################################\r\n        ################ YOUR CODE HERE - 4-5 lines ##################\r\n\r\n        temp = np.random.rand()\r\n        if temp < self.epsilon:\r\n            best_action = np.random.randint(self.env.action_space.n)\r\n        return best_action\r\n\r\n        ##############################################################\r\n        ######################## END YOUR CODE ############## ########\r\n\r\n\r\n\r\ndef test1():\r\n    env = EnvTest((5, 5, 1))\r\n    exp_strat = LinearExploration(env, 1, 0, 10)\r\n    \r\n    found_diff = False\r\n    for i in range(10):\r\n        rnd_act = exp_strat.get_action(0)\r\n        if rnd_act != 0 and rnd_act is not None:\r\n            found_diff = True\r\n    assert found_diff, \"Test 1 failed.\"\r\n    print(\"Test1: ok\")\r\n\r\n\r\ndef test2():\r\n    env = EnvTest((5, 5, 1))\r\n    exp_strat = LinearExploration(env, 1, 0, 10)\r\n    exp_strat.update(5)\r\n    assert exp_strat.epsilon == 0.5, \"Test 2 failed\"\r\n    print(\"Test2: ok\")\r\n\r\n\r\ndef test3():\r\n    env = EnvTest((5, 5, 1))\r\n    exp_strat = LinearExploration(env, 1, 0.5, 10)\r\n    exp_strat.update(20)\r\n    assert exp_strat.epsilon == 0.5, \"Test 3 failed\"\r\n    print(\"Test3: ok\")\r\n\r\n\r\ndef your_test():\r\n    \"\"\"\r\n    Use this to implement your own tests\r\n    \"\"\"\r\n    pass\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    test1()\r\n    test2()\r\n    test3()\r\n    your_test()"
  },
  {
    "path": "assignment2/q2_linear.py",
    "content": "import tensorflow as tf\r\nimport tensorflow.contrib.layers as layers\r\n\r\nfrom utils.general import get_logger\r\nfrom utils.test_env import EnvTest\r\nfrom core.deep_q_learning import DQN\r\nfrom q1_schedule import LinearExploration, LinearSchedule\r\n\r\nfrom configs.q2_linear import config\r\n\r\n\r\nclass Linear(DQN):\r\n    \"\"\"\r\n    Implement Fully Connected with Tensorflow\r\n    \"\"\"\r\n    def add_placeholders_op(self):\r\n        \"\"\"\r\n        Adds placeholders to the graph\r\n\r\n        These placeholders are used as inputs by the rest of the model building and will be fed\r\n        data during training.  Note that when \"None\" is in a placeholder's shape, it's flexible\r\n        (so we can use different batch sizes without rebuilding the model\r\n        \"\"\"\r\n        # this information might be useful\r\n        # here, typically, a state shape is (80, 80, 1)\r\n        state_shape = list(self.env.observation_space.shape)\r\n\r\n        ##############################################################\r\n        \"\"\"\r\n        TODO: add placeholders:\r\n              Remember that we stack 4 consecutive frames together, ending up with an input of shape\r\n              (80, 80, 4).\r\n               - self.s: batch of states, type = uint8\r\n                         shape = (batch_size, img height, img width, nchannels x config.state_history)\r\n               - self.a: batch of actions, type = int32\r\n                         shape = (batch_size)\r\n               - self.r: batch of rewards, type = float32\r\n                         shape = (batch_size)\r\n               - self.sp: batch of next states, type = uint8\r\n                         shape = (batch_size, img height, img width, nchannels x config.state_history)\r\n               - self.done_mask: batch of done, type = bool\r\n                         shape = (batch_size)\r\n                         note that this placeholder contains bool = True only if we are done in \r\n                         the relevant transition\r\n               - self.lr: learning rate, type = float32\r\n        \r\n        (Don't change the variable names!)\r\n        \r\n        HINT: variables from config are accessible with self.config.variable_name\r\n              Also, you may want to use a dynamic dimension for the batch dimension.\r\n              Check the use of None for tensorflow placeholders.\r\n\r\n              you can also use the state_shape computed above.\r\n        \"\"\"\r\n        ##############################################################\r\n        ################YOUR CODE HERE (6-15 lines) ##################\r\n\r\n        img_height, img_width, nchannels = state_shape[0], state_shape[1], state_shape[2]\r\n        self.s = tf.placeholder(dtype=tf.uint8, shape=[None, img_height, img_width, nchannels*self.config.state_history],\r\n                                name='state')\r\n        self.a = tf.placeholder(dtype=tf.int32, shape=[None], name='action')\r\n        self.r = tf.placeholder(dtype=tf.float32, shape=[None], name='reward')\r\n        self.sp = tf.placeholder(dtype=tf.uint8, shape=[None, img_height, img_width, nchannels*self.config.state_history],\r\n                                 name='next_state')\r\n        self.done_mask = tf.placeholder(dtype=tf.bool, shape=[None], name='done_mask')\r\n        self.lr = tf.placeholder(dtype=tf.float32, shape=(), name='lr')\r\n\r\n        ##############################################################\r\n        ######################## END YOUR CODE #######################\r\n\r\n\r\n    def get_q_values_op(self, state, scope, reuse=False):\r\n        \"\"\"\r\n        Returns Q values for all actions\r\n\r\n        Args:\r\n            state: (tf tensor) \r\n                shape = (batch_size, img height, img width, nchannels)\r\n            scope: (string) scope name, that specifies if target network or not\r\n            reuse: (bool) reuse of variables in the scope\r\n\r\n        Returns:\r\n            out: (tf tensor) of shape = (batch_size, num_actions)\r\n        \"\"\"\r\n        # this information might be useful\r\n        num_actions = self.env.action_space.n\r\n        out = state\r\n\r\n        ##############################################################\r\n        \"\"\"\r\n        TODO: implement a fully connected with no hidden layer (linear\r\n            approximation) using tensorflow. In other words, if your state s\r\n            has a flattened shape of n, and you have m actions, the result of \r\n            your computation sould be equal to\r\n                W s where W is a matrix of shape m x n\r\n\r\n        HINT: you may find tensorflow.contrib.layers useful (imported)\r\n              make sure to understand the use of the scope param\r\n\r\n              you can use any other methods from tensorflow\r\n              you are not allowed to import extra packages (like keras,\r\n              lasagne, cafe, etc.)\r\n        \"\"\"\r\n        ##############################################################\r\n        ################ YOUR CODE HERE - 2-3 lines ################## \r\n        \r\n        state_flatten = layers.flatten(state, scope=scope)\r\n        out = layers.fully_connected(state_flatten, num_actions, reuse=reuse, \r\n                                     scope=scope, activation_fn=None)\r\n\r\n        ##############################################################\r\n        ######################## END YOUR CODE #######################\r\n\r\n        return out\r\n\r\n\r\n    def add_update_target_op(self, q_scope, target_q_scope):\r\n        \"\"\"\r\n        update_target_op will be called periodically \r\n        to copy Q network weights to target Q network\r\n\r\n        Remember that in DQN, we maintain two identical Q networks with\r\n        2 different set of weights. In tensorflow, we distinguish them\r\n        with two different scopes. One for the target network, one for the\r\n        regular network. If you're not familiar with the scope mechanism\r\n        in tensorflow, read the docs\r\n        https://www.tensorflow.org/programmers_guide/variable_scope\r\n\r\n        Periodically, we need to update all the weights of the Q network \r\n        and assign them with the values from the regular network. Thus,\r\n        what we need to do is to build a tf op, that, when called, will \r\n        assign all variables in the target network scope with the values of \r\n        the corresponding variables of the regular network scope.\r\n    \r\n        Args:\r\n            q_scope: (string) name of the scope of variables for q\r\n            target_q_scope: (string) name of the scope of variables\r\n                        for the target network\r\n        \"\"\"\r\n        ##############################################################\r\n        \"\"\"\r\n        TODO: add an operator self.update_target_op that assigns variables\r\n            from target_q_scope with the values of the corresponding var \r\n            in q_scope\r\n\r\n        HINT: you may find the following functions useful:\r\n            - tf.get_collection #list\r\n            - tf.assign #return tensor\r\n            - tf.group\r\n\r\n        (be sure that you set self.update_target_op)\r\n        \"\"\"\r\n        ##############################################################\r\n        ################### YOUR CODE HERE - 5-10 lines #############\r\n        \r\n        q_collection = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=q_scope)\r\n        target_q_collection = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=target_q_scope)\r\n        op = [tf.assign(target_q_collection[i], q_collection[i]) for i in range(len(q_collection))]\r\n        self.update_target_op = tf.group(*op)\r\n\r\n        ##############################################################\r\n        ######################## END YOUR CODE #######################\r\n\r\n\r\n    def add_loss_op(self, q, target_q):\r\n        \"\"\"\r\n        Sets the loss of a batch, self.loss is a scalar\r\n\r\n        Args:\r\n            q: (tf tensor) shape = (batch_size, num_actions)\r\n            target_q: (tf tensor) shape = (batch_size, num_actions)\r\n        \"\"\"\r\n        # you may need this variable\r\n        num_actions = self.env.action_space.n\r\n\r\n        ##############################################################\r\n        \"\"\"\r\n        TODO: The loss for an example is defined as:\r\n                Q_samp(s) = r if done\r\n                          = r + gamma * max_a' Q_target(s', a')\r\n                loss = (Q_samp(s) - Q(s, a))^2 \r\n\r\n              You need to compute the average of the loss over the minibatch\r\n              and store the resulting scalar into self.loss\r\n\r\n        HINT: - config variables are accessible through self.config\r\n              - you can access placeholders like self.a (for actions)\r\n                self.r (rewards) or self.done_mask for instance\r\n              - you may find the following functions useful\r\n                    - tf.cast\r\n                    - tf.reduce_max / reduce_sum\r\n                    - tf.one_hot\r\n                    - ...\r\n\r\n        (be sure that you set self.loss)\r\n        \"\"\"\r\n        ##############################################################\r\n        ##################### YOUR CODE HERE - 4-5 lines #############\r\n\r\n        #done = tf.cast(self.done_mask, tf.float32)\r\n        temp = self.r + self.config.gamma*tf.reduce_max(target_q, axis=1)\r\n        q_samp = tf.where(self.done_mask, self.r, temp)\r\n        action = tf.one_hot(self.a, num_actions)\r\n        q_new = tf.reduce_sum(tf.multiply(action,q), axis=1)\r\n        self.loss = tf.reduce_mean(tf.square(q_new - q_samp))\r\n\r\n        ##############################################################\r\n        ######################## END YOUR CODE #######################\r\n\r\n\r\n    def add_optimizer_op(self, scope):\r\n        \"\"\"\r\n        Set self.train_op and self.grad_norm\r\n        \"\"\"\r\n\r\n        ##############################################################\r\n        \"\"\"\r\n        TODO: 1. get Adam Optimizer (remember that we defined self.lr in the placeholders\r\n                section)\r\n              2. compute grads wrt to variables in scope for self.loss\r\n              3. clip the grads by norm with self.config.clip_val if self.config.grad_clip\r\n                is True\r\n              4. apply the gradients and store the train op in self.train_op\r\n               (sess.run(train_op) must update the variables)\r\n              5. compute the global norm of the gradients and store this scalar\r\n                in self.grad_norm\r\n\r\n        HINT: you may find the following functinos useful\r\n            - tf.get_collection\r\n            - optimizer.compute_gradients\r\n            - tf.clip_by_norm\r\n            - optimizer.apply_gradients\r\n            - tf.global_norm\r\n             \r\n             you can access config variable by writing self.config.variable_name\r\n\r\n        (be sure that you set self.train_op and self.grad_norm)\r\n        \"\"\"\r\n        ##############################################################\r\n        #################### YOUR CODE HERE - 8-12 lines #############\r\n\r\n        optimizer = tf.train.AdamOptimizer(learning_rate=self.lr)\r\n        scope_variable = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope)\r\n        grads_and_vars = optimizer.compute_gradients(self.loss, scope_variable)\r\n        if self.config.grad_clip:\r\n           clipped_grads_and_vars = [(tf.clip_by_norm(item[0],self.config.clip_val),item[1]) for item in grads_and_vars] \r\n        self.train_op = optimizer.apply_gradients(clipped_grads_and_vars)\r\n        self.grad_norm = tf.global_norm([item[0] for item in grads_and_vars])\r\n        \r\n        ##############################################################\r\n        ######################## END YOUR CODE #######################\r\n    \r\n\r\n\r\nif __name__ == '__main__':\r\n    env = EnvTest((5, 5, 1))\r\n\r\n    # exploration strategy\r\n    exp_schedule = LinearExploration(env, config.eps_begin, \r\n            config.eps_end, config.eps_nsteps)\r\n\r\n    # learning rate schedule\r\n    lr_schedule  = LinearSchedule(config.lr_begin, config.lr_end,\r\n            config.lr_nsteps)\r\n\r\n    # train model\r\n    model = Linear(env, config)\r\n    model.run(exp_schedule, lr_schedule)\r\n"
  },
  {
    "path": "assignment2/q3_nature.py",
    "content": "import tensorflow as tf\r\nimport tensorflow.contrib.layers as layers\r\n\r\nfrom utils.general import get_logger\r\nfrom utils.test_env import EnvTest\r\nfrom q1_schedule import LinearExploration, LinearSchedule\r\nfrom q2_linear import Linear\r\n\r\n\r\nfrom configs.q3_nature import config\r\n\r\n\r\nclass NatureQN(Linear):\r\n    \"\"\"\r\n    Implementing DeepMind's Nature paper. Here are the relevant urls.\r\n    https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf\r\n    https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf\r\n    \"\"\"\r\n    def get_q_values_op(self, state, scope, reuse=False):\r\n        \"\"\"\r\n        Returns Q values for all actions\r\n\r\n        Args:\r\n            state: (tf tensor) \r\n                shape = (batch_size, img height, img width, nchannels)\r\n            scope: (string) scope name, that specifies if target network or not\r\n            reuse: (bool) reuse of variables in the scope\r\n\r\n        Returns:\r\n            out: (tf tensor) of shape = (batch_size, num_actions)\r\n        \"\"\"\r\n        # this information might be useful\r\n        num_actions = self.env.action_space.n\r\n        out = state\r\n        ##############################################################\r\n        \"\"\"\r\n        TODO: implement the computation of Q values like in the paper\r\n                https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf\r\n                https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf\r\n\r\n              you may find the section \"model architecture\" of the appendix of the \r\n              nature paper particulary useful.\r\n\r\n              store your result in out of shape = (batch_size, num_actions)\r\n\r\n        HINT: you may find tensorflow.contrib.layers useful (imported)\r\n              make sure to understand the use of the scope param\r\n\r\n              you can use any other methods from tensorflow\r\n              you are not allowed to import extra packages (like keras,\r\n              lasagne, cafe, etc.)\r\n\r\n        \"\"\"\r\n        ##############################################################\r\n        ################ YOUR CODE HERE - 10-15 lines ################ \r\n\r\n        with tf.variable_scope(scope, reuse=reuse) as _:\r\n            out = layers.conv2d(out, num_outputs=32, kernel_size=8, stride=4)\r\n            out = layers.conv2d(out, num_outputs=64, kernel_size=4, stride=2)\r\n            out = layers.conv2d(out, num_outputs=64, kernel_size=3, stride=1)\r\n            out = layers.flatten(out)\r\n            out = layers.fully_connected(out, num_outputs=512)\r\n            out = layers.fully_connected(out, num_outputs=num_actions, activation_fn=None) \r\n            \r\n        ##############################################################\r\n        ######################## END YOUR CODE #######################\r\n        return out\r\n\r\n\r\n\"\"\"\r\nUse deep Q network for test environment.\r\n\"\"\"\r\nif __name__ == '__main__':\r\n    env = EnvTest((80, 80, 1))\r\n\r\n    # exploration strategy\r\n    exp_schedule = LinearExploration(env, config.eps_begin, \r\n            config.eps_end, config.eps_nsteps)\r\n\r\n    # learning rate schedule\r\n    lr_schedule  = LinearSchedule(config.lr_begin, config.lr_end,\r\n            config.lr_nsteps)\r\n\r\n    # train model\r\n    model = NatureQN(env, config)\r\n    model.run(exp_schedule, lr_schedule)\r\n"
  },
  {
    "path": "assignment2/q4_train_atari_linear.py",
    "content": "import gym\r\nfrom utils.preprocess import greyscale\r\nfrom utils.wrappers import PreproWrapper, MaxAndSkipEnv\r\n\r\nfrom q1_schedule import LinearExploration, LinearSchedule\r\nfrom q2_linear import Linear\r\n\r\nfrom configs.q4_train_atari_linear import config\r\n\r\n\"\"\"\r\nUse linear approximation for the Atari game. Please report the final result.\r\nFeel free to change the configurations (in the configs/ folder). \r\nIf so, please report your hyperparameters.\r\n\r\nYou'll find the results, log and video recordings of your agent every 250k under\r\nthe corresponding file in the results folder. A good way to monitor the progress\r\nof the training is to use Tensorboard. The starter code writes summaries of different\r\nvariables.\r\n\r\nTo launch tensorboard, open a Terminal window and run \r\ntensorboard --logdir=results/\r\nThen, connect remotely to \r\naddress-ip-of-the-server:6006 \r\n6006 is the default port used by tensorboard.\r\n\"\"\"\r\nif __name__ == '__main__':\r\n    # make env\r\n    env = gym.make(config.env_name)\r\n    env = MaxAndSkipEnv(env, skip=config.skip_frame)\r\n    env = PreproWrapper(env, prepro=greyscale, shape=(80, 80, 1), \r\n                        overwrite_render=config.overwrite_render)\r\n\r\n    # exploration strategy\r\n    exp_schedule = LinearExploration(env, config.eps_begin, \r\n            config.eps_end, config.eps_nsteps)\r\n\r\n    # learning rate schedule\r\n    lr_schedule  = LinearSchedule(config.lr_begin, config.lr_end,\r\n            config.lr_nsteps)\r\n\r\n    # train model\r\n    model = Linear(env, config)\r\n    model.run(exp_schedule, lr_schedule)\r\n"
  },
  {
    "path": "assignment2/q5_train_atari_nature.py",
    "content": "import gym\r\nfrom utils.preprocess import greyscale\r\nfrom utils.wrappers import PreproWrapper, MaxAndSkipEnv\r\n\r\nfrom q1_schedule import LinearExploration, LinearSchedule\r\nfrom q3_nature import NatureQN\r\n\r\nfrom configs.q5_train_atari_nature import config\r\n\r\n\"\"\"\r\nUse deep Q network for the Atari game. Please report the final result.\r\nFeel free to change the configurations (in the configs/ folder). \r\nIf so, please report your hyperparameters.\r\n\r\nYou'll find the results, log and video recordings of your agent every 250k under\r\nthe corresponding file in the results folder. A good way to monitor the progress\r\nof the training is to use Tensorboard. The starter code writes summaries of different\r\nvariables.\r\n\r\nTo launch tensorboard, open a Terminal window and run \r\ntensorboard --logdir=results/\r\nThen, connect remotely to \r\naddress-ip-of-the-server:6006 \r\n6006 is the default port used by tensorboard.\r\n\"\"\"\r\nif __name__ == '__main__':\r\n    # make env\r\n    env = gym.make(config.env_name)\r\n    env = MaxAndSkipEnv(env, skip=config.skip_frame)\r\n    env = PreproWrapper(env, prepro=greyscale, shape=(80, 80, 1), \r\n                        overwrite_render=config.overwrite_render)\r\n\r\n    # exploration strategy\r\n    exp_schedule = LinearExploration(env, config.eps_begin, \r\n            config.eps_end, config.eps_nsteps)\r\n\r\n    # learning rate schedule\r\n    lr_schedule  = LinearSchedule(config.lr_begin, config.lr_end,\r\n            config.lr_nsteps)\r\n\r\n    # train model\r\n    model = NatureQN(env, config)\r\n    model.run(exp_schedule, lr_schedule)\r\n"
  },
  {
    "path": "assignment2/q6_double_q_learning.py",
    "content": "import gym\r\nfrom utils.preprocess import greyscale\r\nfrom utils.wrappers import PreproWrapper, MaxAndSkipEnv\r\n\r\nimport tensorflow as tf\r\nimport tensorflow.contrib.layers as layers\r\n\r\nfrom utils.general import get_logger\r\nfrom utils.test_env import EnvTest\r\nfrom q1_schedule import LinearExploration, LinearSchedule\r\nfrom q2_linear import Linear\r\nfrom q3_nature import NatureQN\r\n\r\nfrom configs.q6_bonus_question import config\r\n\r\n\r\nclass MyDQN(NatureQN):\r\n    \"\"\"\r\n    Going beyond - implement your own Deep Q Network to find the perfect\r\n    balance between depth, complexity, number of parameters, etc.\r\n    You can change the way the q-values are computed, the exploration\r\n    strategy, or the learning rate schedule. You can also create your own\r\n    wrapper of environment and transform your input to something that you\r\n    think we'll help to solve the task. Ideally, your network would run faster\r\n    than DeepMind's and achieve similar performance!\r\n\r\n    You can also change the optimizer (by overriding the functions defined\r\n    in TFLinear), or even change the sampling strategy from the replay buffer.\r\n\r\n    If you prefer not to build on the current architecture, you're welcome to\r\n    write your own code.\r\n\r\n    You may also try more recent approaches, like double Q learning\r\n    (see https://arxiv.org/pdf/1509.06461.pdf) or dueling networks \r\n    (see https://arxiv.org/abs/1511.06581), but this would be for extra\r\n    extra bonus points.\r\n    \"\"\"\r\n    def add_loss_op(self, q, target_q):\r\n        \"\"\"\r\n        Sets the loss of a batch, self.loss is a scalar\r\n\r\n        Args:\r\n            q: (tf tensor) shape = (batch_size, num_actions)\r\n            target_q: (tf tensor) shape = (batch_size, num_actions)\r\n        \"\"\"\r\n        # you may need this variable\r\n        num_actions = self.env.action_space.n\r\n\r\n        ##############################################################\r\n        \"\"\"\r\n        TODO: The loss for an example is defined as:\r\n                Q_samp(s) = r if done\r\n                          = r + gamma * Q_target(s', max_a'Q(s',a'))\r\n                loss = (Q_samp(s) - Q(s, a))^2 \r\n\r\n              You need to compute the average of the loss over the minibatch\r\n              and store the resulting scalar into self.loss\r\n\r\n        HINT: - config variables are accessible through self.config\r\n              - you can access placeholders like self.a (for actions)\r\n                self.r (rewards) or self.done_mask for instance\r\n              - you may find the following functions useful\r\n                    - tf.cast\r\n                    - tf.reduce_max / reduce_sum\r\n                    - tf.one_hot\r\n                    - ...\r\n\r\n        (be sure that you set self.loss)\r\n        \"\"\"\r\n        ##############################################################\r\n        ##################### YOUR CODE HERE - 4-5 lines #############\r\n\r\n        #done = tf.cast(self.done_mask, tf.float32)\r\n        idx = tf.arg_max(q, dimension=1)\r\n        idx_one_hot = tf.one_hot(idx, num_actions)\r\n        temp = self.r + self.config.gamma*tf.reduce_sum(tf.multiply(target_q, idx_one_hot), axis=1)\r\n        q_samp = tf.where(self.done_mask, self.r, temp)\r\n        action = tf.one_hot(self.a, num_actions)\r\n        q_new = tf.reduce_sum(tf.multiply(action,q), axis=1)\r\n        self.loss = tf.reduce_mean(tf.square(q_new - q_samp))\r\n\r\n        ##############################################################\r\n        ######################## END YOUR CODE #######################\r\n\r\n\"\"\"\r\nUse a different architecture for the Atari game. Please report the final result.\r\nFeel free to change the configuration. If so, please report your hyperparameters.\r\n\"\"\"\r\nif __name__ == '__main__':\r\n    # make env\r\n    env = gym.make(config.env_name)\r\n    env = MaxAndSkipEnv(env, skip=config.skip_frame)\r\n    env = PreproWrapper(env, prepro=greyscale, shape=(80, 80, 1), \r\n                        overwrite_render=config.overwrite_render)\r\n\r\n    # exploration strategy\r\n    # you may want to modify this schedule\r\n    exp_schedule = LinearExploration(env, config.eps_begin, \r\n            config.eps_end, config.eps_nsteps)\r\n\r\n    # you may want to modify this schedule\r\n    # learning rate schedule\r\n    lr_schedule  = LinearSchedule(config.lr_begin, config.lr_end,\r\n            config.lr_nsteps)\r\n\r\n    # train model\r\n    model = MyDQN(env, config)\r\n    model.run(exp_schedule, lr_schedule)\r\n"
  },
  {
    "path": "assignment2/q6_dueling.py",
    "content": "import gym\r\nfrom utils.preprocess import greyscale\r\nfrom utils.wrappers import PreproWrapper, MaxAndSkipEnv\r\n\r\nimport tensorflow as tf\r\nimport tensorflow.contrib.layers as layers\r\n\r\nfrom utils.general import get_logger\r\nfrom utils.test_env import EnvTest\r\nfrom q1_schedule import LinearExploration, LinearSchedule\r\nfrom q2_linear import Linear\r\n\r\nfrom configs.q6_bonus_question import config\r\n\r\n\r\nclass MyDQN(Linear):\r\n    \"\"\"\r\n    Going beyond - implement your own Deep Q Network to find the perfect\r\n    balance between depth, complexity, number of parameters, etc.\r\n    You can change the way the q-values are computed, the exploration\r\n    strategy, or the learning rate schedule. You can also create your own\r\n    wrapper of environment and transform your input to something that you\r\n    think we'll help to solve the task. Ideally, your network would run faster\r\n    than DeepMind's and achieve similar performance!\r\n\r\n    You can also change the optimizer (by overriding the functions defined\r\n    in TFLinear), or even change the sampling strategy from the replay buffer.\r\n\r\n    If you prefer not to build on the current architecture, you're welcome to\r\n    write your own code.\r\n\r\n    You may also try more recent approaches, like double Q learning\r\n    (see https://arxiv.org/pdf/1509.06461.pdf) or dueling networks \r\n    (see https://arxiv.org/abs/1511.06581), but this would be for extra\r\n    extra bonus points.\r\n    \"\"\"\r\n    def get_q_values_op(self, state, scope, reuse=False):\r\n        \"\"\"\r\n        Returns Q values for all actions\r\n\r\n        Args:\r\n            state: (tf tensor) \r\n                shape = (batch_size, img height, img width, nchannels)\r\n            scope: (string) scope name, that specifies if target network or not\r\n            reuse: (bool) reuse of variables in the scope\r\n\r\n        Returns:\r\n            out: (tf tensor) of shape = (batch_size, num_actions)\r\n        \"\"\"\r\n        # this information might be useful\r\n        num_actions = self.env.action_space.n\r\n        out = state\r\n        ##############################################################\r\n        \"\"\"\r\n        TODO: implement the computation of Q values like in the paper\r\n                    https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf\r\n\r\n        HINT: you may find tensorflow.contrib.layers useful (imported)\r\n              make sure to understand the use of the scope param\r\n\r\n              you can use any other methods from tensorflow\r\n              you are not allowed to import extra packages (like keras,\r\n              lasagne, cafe, etc.)\r\n       \r\n        L1: 32 8x8 filters with stride 4  +  RELU\r\n        L2: 64 4x4 filters with stride 2  +  RELU\r\n        L3: 64 3x3 fitlers with stride 1  +  RELU\r\n        L4a: 512 unit Fully-Connected layer  +  RELU\r\n        L4b: 512 unit Fully-Connected layer  +  RELU\r\n        L5a: 1 unit FC  (State Value)\r\n        L5b: #actions FC (Advantage Value)\r\n        L6: Aggregate V(s)+A(s,a)\r\n        \"\"\"\r\n        ##############################################################\r\n        ################ YOUR CODE HERE - 10-15 lines ################ \r\n        \r\n        with tf.variable_scope(scope, reuse=reuse) as _:\r\n            out = layers.conv2d(out, num_outputs=32, kernel_size=8, stride=4)\r\n            out = layers.conv2d(out, num_outputs=64, kernel_size=4, stride=2)\r\n            out = layers.conv2d(out, num_outputs=64, kernel_size=3, stride=1)\r\n            out = layers.flatten(out)\r\n            out = layers.fully_connected(out, num_outputs=512)\r\n            out1 = layers.fully_connected(out, num_outputs=1, activation_fn=None)\r\n            out2 = layers.fully_connected(out, num_outputs=num_actions, activation_fn=None)\r\n            out = out2 - tf.tile(tf.expand_dims(tf.reduce_mean(out2, axis=1),-1), [1,num_actions])\r\n            out = out + tf.tile(out1, [1,num_actions])\r\n            \r\n        ##############################################################\r\n        ######################## END YOUR CODE #######################\r\n        return out\r\n\r\n\r\n\"\"\"\r\nUse a different architecture for the Atari game. Please report the final result.\r\nFeel free to change the configuration. If so, please report your hyperparameters.\r\n\"\"\"\r\nif __name__ == '__main__':\r\n    # make env\r\n    env = gym.make(config.env_name)\r\n    env = MaxAndSkipEnv(env, skip=config.skip_frame)\r\n    env = PreproWrapper(env, prepro=greyscale, shape=(80, 80, 1), \r\n                        overwrite_render=config.overwrite_render)\r\n\r\n    # exploration strategy\r\n    # you may want to modify this schedule\r\n    exp_schedule = LinearExploration(env, config.eps_begin, \r\n            config.eps_end, config.eps_nsteps)\r\n\r\n    # you may want to modify this schedule\r\n    # learning rate schedule\r\n    lr_schedule  = LinearSchedule(config.lr_begin, config.lr_end,\r\n            config.lr_nsteps)\r\n\r\n    # train model\r\n    model = MyDQN(env, config)\r\n    model.run(exp_schedule, lr_schedule)\r\n"
  },
  {
    "path": "assignment2/requirements.txt",
    "content": "matplotlib\r\nnumpy\r\nsix"
  },
  {
    "path": "assignment2/results/q2_linear/log.txt",
    "content": "2017-11-28 20:52:49,822:INFO: Evaluating...\n2017-11-28 20:52:50,064:INFO: Average reward: -0.50 +/- 0.00\n2017-11-28 20:52:50,983:INFO: Evaluating...\n2017-11-28 20:52:51,013:INFO: Average reward: -0.50 +/- 0.00\n2017-11-28 20:52:51,772:INFO: Evaluating...\n2017-11-28 20:52:51,803:INFO: Average reward: -0.10 +/- 0.00\n2017-11-28 20:52:52,561:INFO: Evaluating...\n2017-11-28 20:52:52,592:INFO: Average reward: -0.10 +/- 0.00\n2017-11-28 20:52:53,356:INFO: Evaluating...\n2017-11-28 20:52:53,386:INFO: Average reward: -0.30 +/- 0.00\n2017-11-28 20:52:54,208:INFO: Evaluating...\n2017-11-28 20:52:54,240:INFO: Average reward: -0.10 +/- 0.00\n2017-11-28 20:52:54,996:INFO: Evaluating...\n2017-11-28 20:52:55,026:INFO: Average reward: -0.10 +/- 0.00\n2017-11-28 20:52:55,779:INFO: Evaluating...\n2017-11-28 20:52:55,809:INFO: Average reward: -0.50 +/- 0.00\n2017-11-28 20:52:56,576:INFO: Evaluating...\n2017-11-28 20:52:56,604:INFO: Average reward: -0.10 +/- 0.00\n2017-11-28 20:52:57,366:INFO: Evaluating...\n2017-11-28 20:52:57,394:INFO: Average reward: -0.10 +/- 0.00\n2017-11-28 20:52:58,138:INFO: - Training done.\n2017-11-28 20:52:58,161:INFO: Evaluating...\n2017-11-28 20:52:58,194:INFO: Average reward: -0.10 +/- 0.00\n2017-11-28 21:10:09,597:INFO: Evaluating...\n2017-11-28 21:10:09,634:INFO: Average reward: -0.30 +/- 0.00\n2017-11-28 21:10:10,317:INFO: Evaluating...\n2017-11-28 21:10:10,347:INFO: Average reward: 0.50 +/- 0.00\n2017-11-28 21:10:11,113:INFO: Evaluating...\n2017-11-28 21:10:11,145:INFO: Average reward: 0.10 +/- 0.00\n2017-11-28 21:10:11,894:INFO: Evaluating...\n2017-11-28 21:10:11,925:INFO: Average reward: -0.10 +/- 0.00\n2017-11-28 21:10:12,685:INFO: Evaluating...\n2017-11-28 21:10:12,717:INFO: Average reward: 0.50 +/- 0.00\n2017-11-28 21:10:13,506:INFO: Evaluating...\n2017-11-28 21:10:13,539:INFO: Average reward: 1.90 +/- 0.00\n2017-11-28 21:10:14,291:INFO: Evaluating...\n2017-11-28 21:10:14,322:INFO: Average reward: 2.10 +/- 0.00\n2017-11-28 21:10:15,084:INFO: Evaluating...\n2017-11-28 21:10:15,114:INFO: Average reward: 2.00 +/- 0.00\n2017-11-28 21:10:15,876:INFO: Evaluating...\n2017-11-28 21:10:15,907:INFO: Average reward: 2.10 +/- 0.00\n2017-11-28 21:10:16,665:INFO: Evaluating...\n2017-11-28 21:10:16,695:INFO: Average reward: 2.10 +/- 0.00\n2017-11-28 21:10:17,432:INFO: - Training done.\n2017-11-28 21:10:17,453:INFO: Evaluating...\n2017-11-28 21:10:17,486:INFO: Average reward: 2.10 +/- 0.00\n"
  },
  {
    "path": "assignment2/results/q2_linear/model.weights/checkpoint",
    "content": "model_checkpoint_path: \".\"\nall_model_checkpoint_paths: \".\"\n"
  },
  {
    "path": "assignment2/results/q3_nature/log.txt",
    "content": "2017-11-28 21:36:35,366:INFO: Evaluating...\n2017-11-28 21:36:35,752:INFO: Average reward: 0.00 +/- 0.00\n2017-11-28 21:36:36,569:INFO: Evaluating...\n2017-11-28 21:36:36,868:INFO: Average reward: -0.50 +/- 0.00\n2017-11-28 21:36:40,918:INFO: Evaluating...\n2017-11-28 21:36:41,207:INFO: Average reward: 0.00 +/- 0.00\n2017-11-28 21:36:45,230:INFO: Evaluating...\n2017-11-28 21:36:45,520:INFO: Average reward: 0.50 +/- 0.00\n2017-11-28 21:36:49,710:INFO: Evaluating...\n2017-11-28 21:36:50,002:INFO: Average reward: 2.00 +/- 0.00\n2017-11-28 21:36:54,073:INFO: Evaluating...\n2017-11-28 21:36:54,361:INFO: Average reward: 2.00 +/- 0.00\n2017-11-28 21:36:58,412:INFO: Evaluating...\n2017-11-28 21:36:58,698:INFO: Average reward: 2.00 +/- 0.00\n2017-11-28 21:37:02,752:INFO: Evaluating...\n2017-11-28 21:37:03,044:INFO: Average reward: 2.10 +/- 0.00\n2017-11-28 21:37:07,233:INFO: Evaluating...\n2017-11-28 21:37:07,513:INFO: Average reward: 2.10 +/- 0.00\n2017-11-28 21:37:09,855:INFO: - Training done.\n2017-11-28 21:37:09,959:INFO: Evaluating...\n2017-11-28 21:37:10,247:INFO: Average reward: 2.10 +/- 0.00\n"
  },
  {
    "path": "assignment2/results/q3_nature/model.weights/checkpoint",
    "content": "model_checkpoint_path: \".\"\nall_model_checkpoint_paths: \".\"\n"
  },
  {
    "path": "assignment2/results/q4_train_atari_linear/log.txt",
    "content": "2017-11-29 16:06:16,994:INFO: Making new env: Pong-v0\n2017-11-29 16:06:17,179:INFO: Creating monitor directory results/q4_train_atari_linear/monitor/\n2017-11-29 16:06:17,187:INFO: Starting new video recorder writing to /home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.5469.video000000.mp4\n2017-11-29 16:06:18,628:INFO: Finished writing results. You can upload them to the scoreboard via gym.upload('/home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor')\n2017-11-29 16:06:18,629:INFO: Evaluating...\n2017-11-29 16:07:00,357:INFO: Average reward: -20.98 +/- 0.02\n2017-11-29 16:30:31,583:INFO: Evaluating...\n2017-11-30 12:01:58,705:INFO: Making new env: Pong-v0\n2017-11-30 12:01:58,917:INFO: Starting new video recorder writing to /home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.3758.video000000.mp4\n2017-11-30 12:02:01,397:INFO: Finished writing results. You can upload them to the scoreboard via gym.upload('/home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor')\n2017-11-30 12:02:01,397:INFO: Evaluating...\n2017-11-30 12:02:40,550:INFO: Average reward: -20.98 +/- 0.02\n2017-11-30 14:37:22,473:INFO: Making new env: Pong-v0\n2017-11-30 14:37:22,717:INFO: Starting new video recorder writing to /home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.2799.video000000.mp4\n2017-11-30 14:37:26,391:INFO: Finished writing results. You can upload them to the scoreboard via gym.upload('/home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor')\n2017-11-30 14:37:26,392:INFO: Evaluating...\n2017-11-30 14:38:24,987:INFO: Average reward: -20.90 +/- 0.06\n2017-11-30 15:03:46,854:INFO: Evaluating...\n"
  },
  {
    "path": "assignment2/results/q4_train_atari_linear/model.weights/checkpoint",
    "content": "model_checkpoint_path: \".\"\nall_model_checkpoint_paths: \".\"\n"
  },
  {
    "path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.episode_batch.0.2799.stats.json",
    "content": "{\"timestamps\": [1512023846.375136], \"initial_reset_timestamp\": 1512023842.709429, \"episode_types\": [\"t\"], \"episode_lengths\": [1254], \"episode_rewards\": [-21.0]}"
  },
  {
    "path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.episode_batch.0.3758.stats.json",
    "content": "{\"timestamps\": [1512014521.383348], \"initial_reset_timestamp\": 1512014518.909645, \"episode_types\": [\"t\"], \"episode_lengths\": [1005], \"episode_rewards\": [-21.0]}"
  },
  {
    "path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.episode_batch.0.5469.stats.json",
    "content": "{\"timestamps\": [1511942778.615624], \"initial_reset_timestamp\": 1511942777.179417, \"episode_types\": [\"t\"], \"episode_lengths\": [1195], \"episode_rewards\": [-21.0]}"
  },
  {
    "path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.manifest.0.2799.manifest.json",
    "content": "{\"env_info\": {\"env_id\": \"Pong-v0\", \"gym_version\": \"0.9.3\"}, \"stats\": \"openaigym.episode_batch.0.2799.stats.json\", \"videos\": [[\"openaigym.video.0.2799.video000000.mp4\", \"openaigym.video.0.2799.video000000.meta.json\"]]}"
  },
  {
    "path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.manifest.0.3758.manifest.json",
    "content": "{\"env_info\": {\"env_id\": \"Pong-v0\", \"gym_version\": \"0.9.3\"}, \"stats\": \"openaigym.episode_batch.0.3758.stats.json\", \"videos\": [[\"openaigym.video.0.3758.video000000.mp4\", \"openaigym.video.0.3758.video000000.meta.json\"]]}"
  },
  {
    "path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.manifest.0.5469.manifest.json",
    "content": "{\"env_info\": {\"env_id\": \"Pong-v0\", \"gym_version\": \"0.9.3\"}, \"stats\": \"openaigym.episode_batch.0.5469.stats.json\", \"videos\": [[\"openaigym.video.0.5469.video000000.mp4\", \"openaigym.video.0.5469.video000000.meta.json\"]]}"
  },
  {
    "path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.2799.video000000.meta.json",
    "content": "{\"encoder_version\": {\"cmdline\": [\"avconv\", \"-nostats\", \"-loglevel\", \"error\", \"-y\", \"-r\", \"30\", \"-f\", \"rawvideo\", \"-s:v\", \"160x210\", \"-pix_fmt\", \"rgb24\", \"-i\", \"-\", \"-vcodec\", \"libx264\", \"-pix_fmt\", \"yuv420p\", \"/home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.2799.video000000.mp4\"], \"version\": \"avconv version 9.18-6:9.18-0ubuntu0.14.04.1, Copyright (c) 2000-2014 the Libav developers\\n  built on Mar 16 2015 13:19:10 with gcc 4.8 (Ubuntu 4.8.2-19ubuntu1)\\navconv 9.18-6:9.18-0ubuntu0.14.04.1\\nlibavutil     52.  3. 0 / 52.  3. 0\\nlibavcodec    54. 35. 0 / 54. 35. 0\\nlibavformat   54. 20. 4 / 54. 20. 4\\nlibavdevice   53.  2. 0 / 53.  2. 0\\nlibavfilter    3.  3. 0 /  3.  3. 0\\nlibavresample  1.  0. 1 /  1.  0. 1\\nlibswscale     2.  1. 1 /  2.  1. 1\\n\", \"backend\": \"avconv\"}, \"content_type\": \"video/mp4\", \"episode_id\": 0}"
  },
  {
    "path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.3758.video000000.meta.json",
    "content": "{\"encoder_version\": {\"cmdline\": [\"avconv\", \"-nostats\", \"-loglevel\", \"error\", \"-y\", \"-r\", \"30\", \"-f\", \"rawvideo\", \"-s:v\", \"160x210\", \"-pix_fmt\", \"rgb24\", \"-i\", \"-\", \"-vcodec\", \"libx264\", \"-pix_fmt\", \"yuv420p\", \"/home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.3758.video000000.mp4\"], \"version\": \"avconv version 9.18-6:9.18-0ubuntu0.14.04.1, Copyright (c) 2000-2014 the Libav developers\\n  built on Mar 16 2015 13:19:10 with gcc 4.8 (Ubuntu 4.8.2-19ubuntu1)\\navconv 9.18-6:9.18-0ubuntu0.14.04.1\\nlibavutil     52.  3. 0 / 52.  3. 0\\nlibavcodec    54. 35. 0 / 54. 35. 0\\nlibavformat   54. 20. 4 / 54. 20. 4\\nlibavdevice   53.  2. 0 / 53.  2. 0\\nlibavfilter    3.  3. 0 /  3.  3. 0\\nlibavresample  1.  0. 1 /  1.  0. 1\\nlibswscale     2.  1. 1 /  2.  1. 1\\n\", \"backend\": \"avconv\"}, \"content_type\": \"video/mp4\", \"episode_id\": 0}"
  },
  {
    "path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.5469.video000000.meta.json",
    "content": "{\"encoder_version\": {\"cmdline\": [\"avconv\", \"-nostats\", \"-loglevel\", \"error\", \"-y\", \"-r\", \"30\", \"-f\", \"rawvideo\", \"-s:v\", \"160x210\", \"-pix_fmt\", \"rgb24\", \"-i\", \"-\", \"-vcodec\", \"libx264\", \"-pix_fmt\", \"yuv420p\", \"/home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.5469.video000000.mp4\"], \"version\": \"avconv version 9.18-6:9.18-0ubuntu0.14.04.1, Copyright (c) 2000-2014 the Libav developers\\n  built on Mar 16 2015 13:19:10 with gcc 4.8 (Ubuntu 4.8.2-19ubuntu1)\\navconv 9.18-6:9.18-0ubuntu0.14.04.1\\nlibavutil     52.  3. 0 / 52.  3. 0\\nlibavcodec    54. 35. 0 / 54. 35. 0\\nlibavformat   54. 20. 4 / 54. 20. 4\\nlibavdevice   53.  2. 0 / 53.  2. 0\\nlibavfilter    3.  3. 0 /  3.  3. 0\\nlibavresample  1.  0. 1 /  1.  0. 1\\nlibswscale     2.  1. 1 /  2.  1. 1\\n\", \"backend\": \"avconv\"}, \"content_type\": \"video/mp4\", \"episode_id\": 0}"
  },
  {
    "path": "assignment2/utils/__init__.py",
    "content": ""
  },
  {
    "path": "assignment2/utils/general.py",
    "content": "import time\r\nimport sys\r\nimport logging\r\nimport numpy as np\r\nfrom collections import deque\r\nimport matplotlib\r\nmatplotlib.use('agg')\r\nimport matplotlib.pyplot as plt\r\n\r\n\r\ndef export_plot(ys, ylabel, filename):\r\n    \"\"\"\r\n    Export a plot in filename\r\n\r\n    Args:\r\n        ys: (list) of float / int to plot\r\n        filename: (string) directory\r\n    \"\"\"\r\n    plt.figure()\r\n    plt.plot(range(len(ys)), ys)\r\n    plt.xlabel(\"Epoch\")\r\n    plt.ylabel(ylabel)\r\n    plt.savefig(filename)\r\n    plt.close()\r\n\r\n\r\ndef get_logger(filename):\r\n    \"\"\"\r\n    Return a logger instance to a file\r\n    \"\"\"\r\n    logger = logging.getLogger('logger')\r\n    logger.setLevel(logging.DEBUG)\r\n    logging.basicConfig(format='%(message)s', level=logging.DEBUG)\r\n    handler = logging.FileHandler(filename)\r\n    handler.setLevel(logging.DEBUG)\r\n    handler.setFormatter(logging.Formatter('%(asctime)s:%(levelname)s: %(message)s'))\r\n    logging.getLogger().addHandler(handler)\r\n    return logger\r\n\r\n\r\nclass Progbar(object):\r\n    \"\"\"Progbar class copied from keras (https://github.com/fchollet/keras/)\r\n    \r\n    Displays a progress bar.\r\n    Small edit : added strict arg to update\r\n    # Arguments\r\n        target: Total number of steps expected.\r\n        interval: Minimum visual progress update interval (in seconds).\r\n    \"\"\"\r\n\r\n    def __init__(self, target, width=30, verbose=1, discount=0.9):\r\n        self.width = width\r\n        self.target = target\r\n        self.sum_values = {}\r\n        self.exp_avg = {}\r\n        self.unique_values = []\r\n        self.start = time.time()\r\n        self.total_width = 0\r\n        self.seen_so_far = 0\r\n        self.verbose = verbose\r\n        self.discount = discount\r\n\r\n    def update(self, current, values=[], exact=[], strict=[], exp_avg=[]):\r\n        \"\"\"\r\n        Updates the progress bar.\r\n        # Arguments\r\n            current: Index of current step.\r\n            values: List of tuples (name, value_for_last_step).\r\n                The progress bar will display averages for these values.\r\n            exact: List of tuples (name, value_for_last_step).\r\n                The progress bar will display these values directly.\r\n        \"\"\"\r\n\r\n        for k, v in values:\r\n            if k not in self.sum_values:\r\n                self.sum_values[k] = [v * (current - self.seen_so_far), current - self.seen_so_far]\r\n                self.unique_values.append(k)\r\n            else:\r\n                self.sum_values[k][0] += v * (current - self.seen_so_far)\r\n                self.sum_values[k][1] += (current - self.seen_so_far)\r\n        for k, v in exact:\r\n            if k not in self.sum_values:\r\n                self.unique_values.append(k)\r\n            self.sum_values[k] = [v, 1]\r\n        for k, v in strict:\r\n            if k not in self.sum_values:\r\n                self.unique_values.append(k)\r\n            self.sum_values[k] = v\r\n        for k, v in exp_avg:\r\n            if k not in self.exp_avg:\r\n                self.exp_avg[k] = v\r\n            else:\r\n                self.exp_avg[k] *= self.discount\r\n                self.exp_avg[k] += (1-self.discount)*v\r\n\r\n        self.seen_so_far = current\r\n\r\n        now = time.time()\r\n        if self.verbose == 1:\r\n            prev_total_width = self.total_width\r\n            sys.stdout.write(\"\\b\" * prev_total_width)\r\n            sys.stdout.write(\"\\r\")\r\n\r\n            numdigits = int(np.floor(np.log10(self.target))) + 1\r\n            barstr = '%%%dd/%%%dd [' % (numdigits, numdigits)\r\n            bar = barstr % (current, self.target)\r\n            prog = float(current)/self.target\r\n            prog_width = int(self.width*prog)\r\n            if prog_width > 0:\r\n                bar += ('='*(prog_width-1))\r\n                if current < self.target:\r\n                    bar += '>'\r\n                else:\r\n                    bar += '='\r\n            bar += ('.'*(self.width-prog_width))\r\n            bar += ']'\r\n            sys.stdout.write(bar)\r\n            self.total_width = len(bar)\r\n\r\n            if current:\r\n                time_per_unit = (now - self.start) / current\r\n            else:\r\n                time_per_unit = 0\r\n            eta = time_per_unit*(self.target - current)\r\n            info = ''\r\n            if current < self.target:\r\n                info += ' - ETA: %ds' % eta\r\n            else:\r\n                info += ' - %ds' % (now - self.start)\r\n            for k in self.unique_values:\r\n                if type(self.sum_values[k]) is list:\r\n                    info += ' - %s: %.4f' % (k, self.sum_values[k][0] / max(1, self.sum_values[k][1]))\r\n                else:\r\n                    info += ' - %s: %s' % (k, self.sum_values[k])\r\n\r\n            for k, v in self.exp_avg.iteritems():\r\n                info += ' - %s: %.4f' % (k, v)\r\n\r\n            self.total_width += len(info)\r\n            if prev_total_width > self.total_width:\r\n                info += ((prev_total_width-self.total_width) * \" \")\r\n\r\n            sys.stdout.write(info)\r\n            sys.stdout.flush()\r\n\r\n            if current >= self.target:\r\n                sys.stdout.write(\"\\n\")\r\n\r\n        if self.verbose == 2:\r\n            if current >= self.target:\r\n                info = '%ds' % (now - self.start)\r\n                for k in self.unique_values:\r\n                    info += ' - %s: %.4f' % (k, self.sum_values[k][0] / max(1, self.sum_values[k][1]))\r\n                sys.stdout.write(info + \"\\n\")\r\n\r\n    def add(self, n, values=[]):\r\n        self.update(self.seen_so_far+n, values)\r\n"
  },
  {
    "path": "assignment2/utils/preprocess.py",
    "content": "import numpy as np\r\n\r\ndef greyscale(state):\r\n    \"\"\"\r\n    Preprocess state (210, 160, 3) image into\r\n    a (80, 80, 1) image in grey scale\r\n    \"\"\"\r\n    state = np.reshape(state, [210, 160, 3]).astype(np.float32)\r\n\r\n    # grey scale\r\n    state = state[:, :, 0] * 0.299 + state[:, :, 1] * 0.587 + state[:, :, 2] * 0.114\r\n\r\n    # karpathy\r\n    state = state[35:195]  # crop\r\n    state = state[::2,::2] # downsample by factor of 2\r\n\r\n    state = state[:, :, np.newaxis]\r\n\r\n    return state.astype(np.uint8)\r\n\r\n\r\ndef blackandwhite(state):\r\n    \"\"\"\r\n    Preprocess state (210, 160, 3) image into\r\n    a (80, 80, 1) image in grey scale\r\n    \"\"\"\r\n    # erase background\r\n    state[state==144] = 0\r\n    state[state==109] = 0\r\n    state[state!=0] = 1\r\n\r\n    # karpathy\r\n    state = state[35:195]  # crop\r\n    state = state[::2,::2, 0] # downsample by factor of 2\r\n\r\n    state = state[:, :, np.newaxis]\r\n\r\n    return state.astype(np.uint8)"
  },
  {
    "path": "assignment2/utils/replay_buffer.py",
    "content": "import numpy as np\r\nimport random\r\n\r\ndef sample_n_unique(sampling_f, n):\r\n    \"\"\"Helper function. Given a function `sampling_f` that returns\r\n    comparable objects, sample n such unique objects.\r\n    \"\"\"\r\n    res = []\r\n    while len(res) < n:\r\n        candidate = sampling_f()\r\n        if candidate not in res:\r\n            res.append(candidate)\r\n    return res\r\n\r\nclass ReplayBuffer(object):\r\n    \"\"\"\r\n    Taken from Berkeley's Assignment\r\n    \"\"\"\r\n    def __init__(self, size, frame_history_len):\r\n        \"\"\"This is a memory efficient implementation of the replay buffer.\r\n\r\n        The sepecific memory optimizations use here are:\r\n            - only store each frame once rather than k times\r\n              even if every observation normally consists of k last frames\r\n            - store frames as np.uint8 (actually it is most time-performance\r\n              to cast them back to float32 on GPU to minimize memory transfer\r\n              time)\r\n            - store frame_t and frame_(t+1) in the same buffer.\r\n\r\n        For the tipical use case in Atari Deep RL buffer with 1M frames the total\r\n        memory footprint of this buffer is 10^6 * 84 * 84 bytes ~= 7 gigabytes\r\n\r\n        Warning! Assumes that returning frame of zeros at the beginning\r\n        of the episode, when there is less frames than `frame_history_len`,\r\n        is acceptable.\r\n\r\n        Parameters\r\n        ----------\r\n        size: int\r\n            Max number of transitions to store in the buffer. When the buffer\r\n            overflows the old memories are dropped.\r\n        frame_history_len: int\r\n            Number of memories to be retried for each observation.\r\n        \"\"\"\r\n        self.size = size\r\n        self.frame_history_len = frame_history_len\r\n\r\n        self.next_idx      = 0\r\n        self.num_in_buffer = 0\r\n\r\n        self.obs      = None\r\n        self.action   = None\r\n        self.reward   = None\r\n        self.done     = None\r\n\r\n    def can_sample(self, batch_size):\r\n        \"\"\"Returns true if `batch_size` different transitions can be sampled from the buffer.\"\"\"\r\n        return batch_size + 1 <= self.num_in_buffer\r\n\r\n    def _encode_sample(self, idxes):\r\n        obs_batch      = np.concatenate([self._encode_observation(idx)[None] for idx in idxes], 0)\r\n        act_batch      = self.action[idxes]\r\n        rew_batch      = self.reward[idxes]\r\n        next_obs_batch = np.concatenate([self._encode_observation(idx + 1)[None] for idx in idxes], 0)\r\n        done_mask      = np.array([1.0 if self.done[idx] else 0.0 for idx in idxes], dtype=np.float32)\r\n\r\n        return obs_batch, act_batch, rew_batch, next_obs_batch, done_mask\r\n\r\n\r\n    def sample(self, batch_size):\r\n        \"\"\"Sample `batch_size` different transitions.\r\n\r\n        i-th sample transition is the following:\r\n\r\n        when observing `obs_batch[i]`, action `act_batch[i]` was taken,\r\n        after which reward `rew_batch[i]` was received and subsequent\r\n        observation  next_obs_batch[i] was observed, unless the epsiode\r\n        was done which is represented by `done_mask[i]` which is equal\r\n        to 1 if episode has ended as a result of that action.\r\n\r\n        Parameters\r\n        ----------\r\n        batch_size: int\r\n            How many transitions to sample.\r\n\r\n        Returns\r\n        -------\r\n        obs_batch: np.array\r\n            Array of shape\r\n            (batch_size, img_h, img_w, img_c * frame_history_len)\r\n            and dtype np.uint8\r\n        act_batch: np.array\r\n            Array of shape (batch_size,) and dtype np.int32\r\n        rew_batch: np.array\r\n            Array of shape (batch_size,) and dtype np.float32\r\n        next_obs_batch: np.array\r\n            Array of shape\r\n            (batch_size, img_h, img_w, img_c * frame_history_len)\r\n            and dtype np.uint8\r\n        done_mask: np.array\r\n            Array of shape (batch_size,) and dtype np.float32\r\n        \"\"\"\r\n        assert self.can_sample(batch_size)\r\n        idxes = sample_n_unique(lambda: random.randint(0, self.num_in_buffer - 2), batch_size)\r\n        return self._encode_sample(idxes)\r\n\r\n    def encode_recent_observation(self):\r\n        \"\"\"Return the most recent `frame_history_len` frames.\r\n\r\n        Returns\r\n        -------\r\n        observation: np.array\r\n            Array of shape (img_h, img_w, img_c * frame_history_len)\r\n            and dtype np.uint8, where observation[:, :, i*img_c:(i+1)*img_c]\r\n            encodes frame at time `t - frame_history_len + i`\r\n        \"\"\"\r\n        assert self.num_in_buffer > 0\r\n        return self._encode_observation((self.next_idx - 1) % self.size)\r\n\r\n    def _encode_observation(self, idx):\r\n        end_idx   = idx + 1 # make noninclusive\r\n        start_idx = end_idx - self.frame_history_len\r\n        # this checks if we are using low-dimensional observations, such as RAM\r\n        # state, in which case we just directly return the latest RAM.\r\n        # if len(self.obs.shape) <= 2:\r\n        #     return self.obs[end_idx-1]\r\n        # if there weren't enough frames ever in the buffer for context\r\n        if start_idx < 0 and self.num_in_buffer != self.size:\r\n            start_idx = 0\r\n        for idx in range(start_idx, end_idx - 1):\r\n            if self.done[idx % self.size]:\r\n                start_idx = idx + 1\r\n        missing_context = self.frame_history_len - (end_idx - start_idx)\r\n        # if zero padding is needed for missing context\r\n        # or we are on the boundry of the buffer\r\n        if start_idx < 0 or missing_context > 0:\r\n            frames = [np.zeros_like(self.obs[0]) for _ in range(missing_context)]\r\n            for idx in range(start_idx, end_idx):\r\n                frames.append(self.obs[idx % self.size])\r\n            return np.concatenate(frames, 2)\r\n        else:\r\n            # this optimization has potential to saves about 30% compute time \\o/\r\n            img_h, img_w = self.obs.shape[1], self.obs.shape[2]\r\n            return self.obs[start_idx:end_idx].transpose(1, 2, 0, 3).reshape(img_h, img_w, -1)\r\n\r\n    def store_frame(self, frame):\r\n        \"\"\"Store a single frame in the buffer at the next available index, overwriting\r\n        old frames if necessary.\r\n\r\n        Parameters\r\n        ----------\r\n        frame: np.array\r\n            Array of shape (img_h, img_w, img_c) and dtype np.uint8\r\n            the frame to be stored\r\n\r\n        Returns\r\n        -------\r\n        idx: int\r\n            Index at which the frame is stored. To be used for `store_effect` later.\r\n        \"\"\"\r\n        if self.obs is None:\r\n            self.obs      = np.empty([self.size] + list(frame.shape), dtype=np.uint8)\r\n            self.action   = np.empty([self.size],                     dtype=np.int32)\r\n            self.reward   = np.empty([self.size],                     dtype=np.float32)\r\n            self.done     = np.empty([self.size],                     dtype=np.bool)\r\n        self.obs[self.next_idx] = frame\r\n\r\n        ret = self.next_idx\r\n        self.next_idx = (self.next_idx + 1) % self.size\r\n        self.num_in_buffer = min(self.size, self.num_in_buffer + 1)\r\n\r\n        return ret\r\n\r\n    def store_effect(self, idx, action, reward, done):\r\n        \"\"\"Store effects of action taken after obeserving frame stored\r\n        at index idx. The reason `store_frame` and `store_effect` is broken\r\n        up into two functions is so that once can call `encode_recent_observation`\r\n        in between.\r\n\r\n        Paramters\r\n        ---------\r\n        idx: int\r\n            Index in buffer of recently observed frame (returned by `store_frame`).\r\n        action: int\r\n            Action that was performed upon observing this frame.\r\n        reward: float\r\n            Reward that was received when the actions was performed.\r\n        done: bool\r\n            True if episode was finished after performing that action.\r\n        \"\"\"\r\n        self.action[idx] = action\r\n        self.reward[idx] = reward\r\n        self.done[idx]   = done\r\n\r\n"
  },
  {
    "path": "assignment2/utils/test_env.py",
    "content": "import numpy as np\r\n\r\nclass ActionSpace(object):\r\n    def __init__(self, n):\r\n        self.n = n\r\n\r\n    def sample(self):\r\n        return np.random.randint(0, self.n)\r\n\r\n\r\nclass ObservationSpace(object):\r\n    def __init__(self, shape):\r\n        self.shape = shape\r\n        self.bad_state = np.random.randint(0, 50, shape, dtype=np.uint8)\r\n        self.normal_state = np.random.randint(100, 150, shape, dtype=np.uint8)\r\n        self.good_state = np.random.randint(200, 250, shape, dtype=np.uint8)\r\n        self.states = [self.bad_state, self.normal_state, self.good_state]\r\n\r\n\r\nclass EnvTest(object):\r\n    \"\"\"\r\n    Adapted from Igor Gitman, CMU / Karan Goel\r\n    \"\"\"\r\n    def __init__(self, shape=(84, 84, 3)):\r\n        #3 states\r\n        self.rewards = [-0.1, 0, 0.1]\r\n        self.cur_state = 0\r\n        self.num_iters = 0\r\n        self.was_in_second = False\r\n        self.action_space = ActionSpace(4)\r\n        self.observation_space = ObservationSpace(shape)\r\n        \r\n\r\n    def reset(self):\r\n        self.cur_state = 0\r\n        self.num_iters = 0\r\n        self.was_in_second = False\r\n        return self.observation_space.states[self.cur_state]\r\n        \r\n\r\n    def step(self, action):\r\n        assert(0 <= action <= 3)\r\n        self.num_iters += 1\r\n        if action < 3:\r\n            self.cur_state = action\r\n        reward = self.rewards[self.cur_state]\r\n        if self.was_in_second is True:\r\n            reward *= -10\r\n        if self.cur_state == 1:\r\n            self.was_in_second = True\r\n        else:\r\n            self.was_in_second = False\r\n        return self.observation_space.states[self.cur_state], reward, self.num_iters >= 5, {'ale.lives':0}\r\n\r\n\r\n    def render(self):\r\n        print(self.cur_state)"
  },
  {
    "path": "assignment2/utils/viewer.py",
    "content": "import pyglet\r\n\r\n\r\nclass SimpleImageViewer(object):\r\n    \"\"\"\r\n    Modified version of gym viewer to chose format (RBG or I)\r\n    see source here https://github.com/openai/gym/blob/master/gym/envs/classic_control/rendering.py\r\n    \"\"\"\r\n    def __init__(self, display=None):\r\n        self.window = None\r\n        self.isopen = False\r\n        self.display = display\r\n\r\n\r\n    def imshow(self, arr):\r\n        if self.window is None:\r\n            height, width, channels = arr.shape\r\n            self.window = pyglet.window.Window(width=width, height=height, display=self.display)\r\n            self.width = width\r\n            self.height = height\r\n            self.isopen = True\r\n\r\n        ##########################\r\n        ####### old version ######\r\n        # assert arr.shape == (self.height, self.width, I), \"You passed in an image with the wrong number shape\"\r\n        # image = pyglet.image.ImageData(self.width, self.height, 'RGB', arr.tobytes())\r\n        ##########################\r\n        \r\n        ##########################\r\n        ####### new version ######\r\n        nchannels = arr.shape[-1]\r\n        if nchannels == 1:\r\n            _format = \"I\"\r\n        elif nchannels == 3:\r\n            _format = \"RGB\"\r\n        else:\r\n            raise NotImplementedError\r\n        image = pyglet.image.ImageData(self.width, self.height, _format, arr.tobytes())\r\n        ##########################\r\n        \r\n        self.window.clear()\r\n        self.window.switch_to()\r\n        self.window.dispatch_events()\r\n        image.blit(0,0)\r\n        self.window.flip()\r\n\r\n\r\n    def close(self):\r\n        if self.isopen:\r\n            self.window.close()\r\n            self.isopen = False\r\n\r\n\r\n    def __del__(self):\r\n        self.close()\r\n"
  },
  {
    "path": "assignment2/utils/wrappers.py",
    "content": "import numpy as np\r\nimport gym\r\nfrom gym import spaces\r\nfrom viewer import SimpleImageViewer\r\nfrom collections import deque\r\n\r\n\r\nclass MaxAndSkipEnv(gym.Wrapper):\r\n    \"\"\"\r\n    Wrapper from Berkeley's Assignment\r\n    Takes a max pool over the last n states\r\n    \"\"\"\r\n    def __init__(self, env=None, skip=4):\r\n        \"\"\"Return only every `skip`-th frame\"\"\"\r\n        super(MaxAndSkipEnv, self).__init__(env)\r\n        # most recent raw observations (for max pooling across time steps)\r\n        self._obs_buffer = deque(maxlen=2)\r\n        self._skip       = skip\r\n\r\n    def _step(self, action):\r\n        total_reward = 0.0\r\n        done = None\r\n        for _ in range(self._skip):\r\n            obs, reward, done, info = self.env.step(action)\r\n            self._obs_buffer.append(obs)\r\n            total_reward += reward\r\n            if done:\r\n                break\r\n\r\n        max_frame = np.max(np.stack(self._obs_buffer), axis=0)\r\n\r\n        return max_frame, total_reward, done, info\r\n\r\n    def _reset(self):\r\n        \"\"\"Clear past frame buffer and init. to first obs. from inner env.\"\"\"\r\n        self._obs_buffer.clear()\r\n        obs = self.env.reset()\r\n        self._obs_buffer.append(obs)\r\n        return obs\r\n\r\n\r\nclass PreproWrapper(gym.Wrapper):\r\n    \"\"\"\r\n    Wrapper for Pong to apply preprocessing\r\n    Stores the state into variable self.obs\r\n    \"\"\"\r\n    def __init__(self, env, prepro, shape, overwrite_render=True, high=255):\r\n        \"\"\"\r\n        Args:\r\n            env: (gym env)\r\n            prepro: (function) to apply to a state for preprocessing\r\n            shape: (list) shape of obs after prepro\r\n            overwrite_render: (bool) if True, render is overwriten to vizualise effect of prepro\r\n            grey_scale: (bool) if True, assume grey scale, else black and white\r\n            high: (int) max value of state after prepro\r\n        \"\"\"\r\n        super(PreproWrapper, self).__init__(env)\r\n        self.overwrite_render = overwrite_render\r\n        self.viewer = None\r\n        self.prepro = prepro\r\n        self.observation_space = spaces.Box(low=0, high=high, shape=shape)\r\n        self.high = high\r\n\r\n\r\n    def _step(self, action):\r\n        \"\"\"\r\n        Overwrites _step function from environment to apply preprocess\r\n        \"\"\"\r\n        obs, reward, done, info = self.env.step(action)\r\n        self.obs = self.prepro(obs)\r\n        return self.obs, reward, done, info\r\n\r\n\r\n    def _reset(self):\r\n        self.obs = self.prepro(self.env.reset())\r\n        return self.obs\r\n\r\n\r\n    def _render(self, mode='human', close=False):\r\n        \"\"\"\r\n        Overwrite _render function to vizualize preprocessing\r\n        \"\"\"\r\n\r\n        if self.overwrite_render:\r\n            if close:\r\n                if self.viewer is not None:\r\n                    self.viewer.close()\r\n                    self.viewer = None\r\n                return\r\n            img = self.obs\r\n            if mode == 'rgb_array':\r\n                return img\r\n            elif mode == 'human':\r\n                from gym.envs.classic_control import rendering\r\n                if self.viewer is None:\r\n                    self.viewer = SimpleImageViewer()\r\n                self.viewer.imshow(img)\r\n\r\n        else:\r\n            super(PongWrapper, self)._render(mode, close)\r\n"
  },
  {
    "path": "assignment3/discrete_env.py",
    "content": "import numpy as np\n\nfrom gym import Env, spaces\nfrom gym.utils import seeding\n\ndef categorical_sample(prob_n, np_random):\n    \"\"\"\n    Sample from categorical distribution\n    Each row specifies class probabilities\n    \"\"\"\n    prob_n = np.asarray(prob_n)\n    csprob_n = np.cumsum(prob_n)\n    return (csprob_n > np_random.rand()).argmax()\n\n\nclass DiscreteEnv(Env):\n\n    \"\"\"\n    Has the following members\n    - nS: number of states\n    - nA: number of actions\n    - P: transitions (*)\n    - isd: initial state distribution (**)\n\n    (*) dictionary dict of dicts of lists, where\n      P[s][a] == [(probability, nextstate, reward, done), ...]\n    (**) list or array of length nS\n\n\n    \"\"\"\n    def __init__(self, nS, nA, P, isd):\n        self.P = P\n        self.isd = isd\n        self.lastaction=None # for rendering\n        self.nS = nS\n        self.nA = nA\n\n        self.action_space = spaces.Discrete(self.nA)\n        self.observation_space = spaces.Discrete(self.nS)\n\n        self._seed()\n        self._reset()\n\n    def _seed(self, seed=None):\n        self.np_random, seed = seeding.np_random(seed)\n        return [seed]\n\n    def _reset(self):\n        self.s = categorical_sample(self.isd, self.np_random)\n        self.lastaction=None\n        return self.s\n\n    def _step(self, a):\n        transitions = self.P[self.s][a]\n        i = categorical_sample([t[0] for t in transitions], self.np_random)\n        p, s, r, d= transitions[i]\n        self.s = s\n        self.lastaction=a\n        return (s, r, d, {\"prob\" : p})\n"
  },
  {
    "path": "assignment3/frozen_lake.py",
    "content": "import numpy as np\nimport sys\nfrom six import StringIO, b\n\nfrom gym import utils\nimport discrete_env\n\nLEFT = 0\nDOWN = 1\nRIGHT = 2\nUP = 3\n\nMAPS = {\n\n    \"4x4\": [\n        \"SHHH\",\n        \"FHHH\",\n        \"FHHH\",\n        \"FFFG\"\n    ]\n}\n\nclass FrozenLakeEnv(discrete_env.DiscreteEnv):\n    \"\"\"\n    Winter is here. You and your friends were tossing around a frisbee at the park\n    when you made a wild throw that left the frisbee out in the middle of the lake.\n    The water is mostly frozen, but there are a few holes where the ice has melted.\n    If you step into one of those holes, you'll fall into the freezing water.\n    At this time, there's an international frisbee shortage, so it's absolutely imperative that\n    you navigate across the lake and retrieve the disc.\n    However, the ice is slippery, so you won't always move in the direction you intend.\n    The surface is described using a grid like the following\n\n        SHHH\n        FHHH\n        FHHH\n        FFFG\n\n    S : starting point, safe\n    F : frozen surface, safe\n    H : hole, you cannot move to these place\n    G : goal, where the frisbee is located\n\n    The episode ends when you reach the goal or fall in a hole or reach max steps\n    You receive a reward of 1 if you reach the goal, and zero otherwise.\n    \"\"\"\n\n    metadata = {'render.modes': ['human', 'ansi']}\n\n    def __init__(self, desc=None, map_name=\"4x4\",is_slippery=False):\n\n        if desc is None and map_name is None:\n            raise ValueError('Must provide either desc or map_name')\n        elif desc is None:\n            desc = MAPS[map_name]\n        self.desc = desc = np.asarray(desc,dtype='c')\n        self.nrow, self.ncol = nrow, ncol = desc.shape\n\n        nA = 4\n        nS = nrow * ncol\n\n        isd = np.array(desc == b'S').astype('float64').ravel()\n        isd /= isd.sum()\n\n        P = {s : {a : [] for a in range(nA)} for s in range(nS)}\n\n        self.a_true = []\n        for s in range(nS):\n            a_true_table = np.arange(4)\n            np.random.shuffle(a_true_table)\n            self.a_true.append(a_true_table)\n\n        def to_s(row, col):\n            return row*ncol + col\n        def inc(row, col, a):\n            a_true_table = self.a_true[to_s(row, col)]\n\n            if a_true_table[a]==0: # left\n                col = max(col-1,0)\n            elif a_true_table[a]==1: # down\n                row = min(row+1,nrow-1)\n            elif a_true_table[a]==2: # right\n                col = min(col+1,ncol-1)\n            elif a_true_table[a]==3: # up\n                row = max(row-1,0)\n            return (row, col)\n\n        for row in range(nrow):\n            for col in range(ncol):\n                s = to_s(row, col)\n                if desc[row, col] == b\"H\":\n                    continue\n                for a in range(4):\n                    li = P[s][a]\n                    letter = desc[row, col]\n                    if letter in b'GH':\n                        li.append((1.0, s, 0, True))\n                    else:\n                        if is_slippery:\n                            for b in [(a-1)%4, a, (a+1)%4]:\n                                newrow, newcol = inc(row, col, b)\n                                newstate = to_s(newrow, newcol)\n                                newletter = desc[newrow, newcol]\n\n                                 # if meet hole, stay at original place\n                                if newletter == b'H':\n                                    li.append((1.0, s, 0.0, False))\n                                    continue\n                                \n                                done = bytes(newletter) in b'GH'\n                                rew = float(newletter == b'G')\n                                li.append((0.8 if b==a else 0.1, newstate, rew, done))\n                        else:\n                            newrow, newcol = inc(row, col, a)\n                            newstate = to_s(newrow, newcol)\n                            newletter = desc[newrow, newcol]\n                            # if meet hole, stay at original place\n                            if newletter == b'H':\n                                li.append((1.0, s, 0.0, False))\n                                continue\n\n                            done = bytes(newletter) in b'GH'\n                            rew = float(newletter == b'G')\n                            li.append((1.0, newstate, rew, done))\n\n        super(FrozenLakeEnv, self).__init__(nS, nA, P, isd)\n\n    def _render(self, mode='human', close=False):\n        if close:\n            return\n        outfile = StringIO() if mode == 'ansi' else sys.stdout\n\n        row, col = self.s // self.ncol, self.s % self.ncol\n        desc = self.desc.tolist()\n        desc = [[c.decode('utf-8') for c in line] for line in desc]\n        desc[row][col] = utils.colorize(desc[row][col], \"red\", highlight=True)\n        if self.lastaction is not None:\n            outfile.write(\"  ({})\\n\".format([\"Left\",\"Down\",\"Right\",\"Up\"][self.lastaction]))\n        else:\n            outfile.write(\"\\n\")\n        outfile.write(\"\\n\".join(''.join(line) for line in desc)+\"\\n\")\n\n        return outfile\n"
  },
  {
    "path": "assignment3/q1.py",
    "content": "import math\nimport gym\nfrom frozen_lake import *\nimport numpy as np\nimport time\nfrom utils import *\nimport matplotlib.pyplot as plt\nfrom tqdm import *\n\n\ndef rmax(env, gamma, m, R_max, epsilon, num_episodes, max_step = 6):\n    \"\"\"Learn state-action values using the Rmax algorithm\n\n    Args:\n    ----------\n    env: gym.core.Environment\n        Environment to compute Q function for. Must have nS, nA, and P as\n        attributes.\n    gamma: float\n        Discount factor. Number in range [0, 1)\n    m: int\n        \tThreshold of visitance\n    R_max: float \n        The estimated max reward that could be obtained in the game\n    epsilon: \n        accuracy paramter\n    num_episodes: int \n        Number of episodes of training.\n    max_step: Int\n        max number of steps in each episode\n\n    Returns\n    -------\n    np.array\n    An array of shape [env.nS x env.nA] representing state-action values\n    \"\"\"\n\n    Q = np.ones((env.nS, env.nA)) * R_max / (1 - gamma)\n    R = np.zeros((env.nS, env.nA))\n    nSA = np.zeros((env.nS, env.nA))\n    nSASP = np.zeros((env.nS, env.nA, env.nS))\n    ########################################################\n    #                   YOUR CODE HERE                     #\n    ########################################################\n    total_score = 0\n    average_score = np.zeros(num_episodes)\n    for time in range(num_episodes):\n        is_done = False\n        cur_state = env.reset()\n        for _ in range(max_step):\n            if is_done:\n                break\n            action = np.argmax(Q[cur_state])\n            (next_state, reward, is_done, _) = env.step(action)\n            total_score += reward\n            if nSA[cur_state][action] < m:\n                nSA[cur_state][action] += 1\n                R[cur_state][action] += reward\n                nSASP[cur_state][action][next_state] +=1\n                if nSA[cur_state][action] == m:\n                    up_bound = int(np.ceil(np.log(1.0/(epsilon*(1.0-gamma)))/(1.0-gamma)))\n                    for i in range(up_bound):\n                        for s in range(env.nS):\n                            for a in range(env.nA):\n                                if nSA[s][a] >= m:\n                                    q_temp = R[s][a] / nSA[s][a]\n                                    for j in range(env.nS):\n                                        prob = nSASP[s][a][j] / nSA[s][a]    \n                                        q_temp += gamma*prob*np.max(Q[j])\n                                    Q[s][a] = q_temp\n            cur_state = next_state\n        average_score[time] = total_score / (time+1)\n    ########################################################\n    #                    END YOUR CODE                     #\n    ########################################################\n    return (Q, average_score)\n\n\ndef main():\n    env = FrozenLakeEnv(is_slippery=False)\n    print env.__doc__\n    for m in tqdm(np.arange(1,20,2)):\n        (Q, average_score) = rmax(env, gamma = 0.99, m=m, R_max = 1, epsilon = 0.1, num_episodes = 1000)\n        render_single_Q(env, Q)\n        plt.plot(np.arange(1000),np.array(average_score))\n    plt.title('The running average score of the R-max learning agent')\n    plt.xlabel('traning episodes')\n    plt.ylabel('score')\n    plt.legend(['m = '+str(i) for i in np.arange(1,20,2)], loc='upper right')\n    #plt.show()\n    plt.savefig('r-max.jpg')\n\nif __name__ == '__main__':\n    print \"haha\"\n    main()"
  },
  {
    "path": "assignment3/q2.py",
    "content": "import math\nimport gym\nfrom frozen_lake import *\nimport numpy as np\nimport time\nfrom utils import *\nfrom tqdm import *\nimport matplotlib.pyplot as plt\n\ndef learn_Q_QLearning(env, num_episodes=10000, gamma = 0.99, lr = 0.1, e = 0.2, max_step=6):\n    \"\"\"Learn state-action values using the Q-learning algorithm with epsilon-greedy exploration strategy(no decay)\n    Feel free to reuse your assignment1's code\n    Parameters\n    ----------\n    env: gym.core.Environment\n        Environment to compute Q function for. Must have nS, nA, and P as attributes.\n    num_episodes: int \n        Number of episodes of training.\n    gamma: float\n        Discount factor. Number in range [0, 1)\n    learning_rate: float\n        Learning rate. Number in range [0, 1)\n    e: float\n        Epsilon value used in the epsilon-greedy method. \n    max_step: Int\n        max number of steps in each episode\n\n    Returns\n    -------\n    np.array\n        An array of shape [env.nS x env.nA] representing state-action values\n    \"\"\"\n\n    Q = np.zeros((env.nS, env.nA))\n    ########################################################\n    #                     YOUR CODE HERE                   #\n    ######################################################## \n    total_score = 0\n    average_score = np.zeros(num_episodes)\n    for i in range(num_episodes):\n        done = False\n        state = env.reset()\n        for _ in range(max_step):\n            if done:\n                break\n            if np.random.rand() > e:\n                action = np.argmax(Q[state])\n            else:\n                action = np.random.randint(env.nA)\n            nextstate, reward, done, _ = env.step(action)\n            Q[state][action] = (1-lr)*Q[state][action]+lr*(reward+gamma*np.max(Q[nextstate]))\n            state = nextstate\n        total_score += reward\n        average_score[i] = total_score / (i+1)\n\n    ########################################################\n    #                     END YOUR CODE                    #\n    ########################################################\n    return (Q, average_score)\n\n\n\ndef main():\n    env = FrozenLakeEnv(is_slippery=False)\n    for e in tqdm(np.linspace(0,1,11)):\n        (Q, average_score) = learn_Q_QLearning(env, num_episodes = 10000, gamma = 0.99, lr = 0.1, e = e)\n        render_single_Q(env, Q)\n        plt.plot(np.arange(10000), np.array(average_score))\n    plt.title('The running average score of the Q-learning agent')\n    plt.xlabel('traning episodes')\n    plt.ylabel('score')\n    plt.legend(['e = '+str(i) for i in np.linspace(0,1,11)], loc='upper right')\n    #plt.show()\n    plt.savefig('q-learning.jpg')\n\n        \n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "assignment3/q3.py",
    "content": "import math\nimport gym\nfrom frozen_lake import *\nimport numpy as np\nimport time\nfrom utils import *\nimport matplotlib.pyplot as plt\nfrom tqdm import *\n\n\ndef rmax(env, gamma, m, R_max, epsilon, num_episodes, max_step = 6, e = 0.7):\n    \"\"\"Learn state-action values using the Rmax algorithm\n\n    Args:\n    ----------\n    env: gym.core.Environment\n        Environment to compute Q function for. Must have nS, nA, and P as\n        attributes.\n    gamma: float\n        Discount factor. Number in range [0, 1)\n    m: int\n        \tThreshold of visitance\n    R_max: float \n        The estimated max reward that could be obtained in the game\n    epsilon: \n        accuracy paramter\n    num_episodes: int \n        Number of episodes of training.\n    max_step: Int\n        max number of steps in each episode\n\n    Returns\n    -------\n    np.array\n    An array of shape [env.nS x env.nA] representing state-action values\n    \"\"\"\n\n    Q = np.ones((env.nS, env.nA)) * R_max / (1 - gamma)\n    R = np.zeros((env.nS, env.nA))\n    nSA = np.zeros((env.nS, env.nA))\n    nSASP = np.zeros((env.nS, env.nA, env.nS))\n    ########################################################\n    #                   YOUR CODE HERE                     #\n    ########################################################\n    total_score = 0\n    average_score = np.zeros(num_episodes)\n    for time in range(num_episodes):\n        is_done = False\n        cur_state = env.reset()\n        for _ in range(max_step):\n            if is_done:\n                break\n            if np.random.rand() > e:\n                action = np.argmax(Q[cur_state])\n            else:\n                action = np.random.randint(env.nA)\n            (next_state, reward, is_done, _) = env.step(action)\n            total_score += reward\n            if nSA[cur_state][action] < m:\n                nSA[cur_state][action] += 1\n                R[cur_state][action] += reward\n                nSASP[cur_state][action][next_state] +=1\n                if nSA[cur_state][action] == m:\n                    up_bound = int(np.ceil(np.log(1.0/(epsilon*(1.0-gamma)))/(1.0-gamma)))\n                    for i in range(up_bound):\n                        for s in range(env.nS):\n                            for a in range(env.nA):\n                                if nSA[s][a] >= m:\n                                    q_temp = R[s][a] / nSA[s][a]\n                                    for j in range(env.nS):\n                                        prob = nSASP[s][a][j] / nSA[s][a]    \n                                        q_temp += gamma*prob*np.max(Q[j])\n                                    Q[s][a] = q_temp\n            cur_state = next_state\n        average_score[time] = total_score / (time+1)\n    ########################################################\n    #                    END YOUR CODE                     #\n    ########################################################\n    return (Q, average_score)\n\n\ndef main():\n    env = FrozenLakeEnv(is_slippery=False)\n    print env.__doc__\n    (Q, average_score) = rmax(env, gamma = 0.99, m=1, R_max = 1, epsilon = 0.1, num_episodes = 1000)\n    render_single_Q(env, Q)\n    plt.plot(np.arange(1000),np.array(average_score))\n    plt.title('The running average score of the R-max with e-greedy learning agent')\n    plt.xlabel('traning episodes')\n    plt.ylabel('score')\n    #plt.show()\n    plt.savefig('r-max+e_greedy.jpg')\n\nif __name__ == '__main__':\n    print \"haha\"\n    main()"
  },
  {
    "path": "assignment3/requirements.txt",
    "content": "matplotlib\nnumpy\nsix"
  },
  {
    "path": "assignment3/utils.py",
    "content": "import math\nimport gym\nfrom frozen_lake import *\nimport numpy as np\nimport time\n\n\ndef render_single_Q(env, Q, max_step = 6):\n\t\"\"\"Renders Q function once on environment.\n\n    Parameters\n    ----------\n    env: gym.core.Environment\n      Environment to play Q function on. Must have nS, nA, and P as\n      attributes.\n    Q: np.array of shape [env.nS x env.nA]\n      Q function\n\t\"\"\"\n\n\tstate = env.reset()\n\tdone = False\n\tepisode_reward = 0\n\tcount = 0\n\twhile not done:\n\t\tenv.render()\n\t\ttime.sleep(0.5) # Seconds between frames. Modify as you wish.\n\t\taction = np.argmax(Q[state])\n\t\tstate, reward, done, _ = env.step(action)\n\t\tepisode_reward += reward\n\n\t\tcount += 1\n\t\tif count >= max_step:\n\t\t\tbreak\n\n\tprint \"Episode reward: %d\" % episode_reward\n"
  }
]