Repository: zlpure/CS234
Branch: master
Commit: 2d92db348c3d
Files: 72
Total size: 136.8 KB
Directory structure:
gitextract_onbbhi0g/
├── LICENSE
├── README.md
├── assignment1/
│ ├── Makefile
│ ├── collect_submission.sh
│ ├── lake_envs.py
│ ├── log
│ ├── model_based_learning.py
│ ├── model_free_learning.py
│ ├── requirements.txt
│ └── vi_and_pi.py
├── assignment2/
│ ├── .gitignore
│ ├── Makefile
│ ├── README.md
│ ├── collect_submission.sh
│ ├── configs/
│ │ ├── __init__.py
│ │ ├── frozen_lake.py
│ │ ├── q2_linear.py
│ │ ├── q3_nature.py
│ │ ├── q4_train_atari_linear.py
│ │ ├── q5_train_atari_nature.py
│ │ ├── q6_bonus_question.py
│ │ └── test.py
│ ├── core/
│ │ ├── __init__.py
│ │ ├── deep_q_learning.py
│ │ └── q_learning.py
│ ├── q1_schedule.py
│ ├── q2_linear.py
│ ├── q3_nature.py
│ ├── q4_train_atari_linear.py
│ ├── q5_train_atari_nature.py
│ ├── q6_double_q_learning.py
│ ├── q6_dueling.py
│ ├── requirements.txt
│ ├── results/
│ │ ├── q2_linear/
│ │ │ ├── events.out.tfevents.1511874609.zengliang-PU551LD
│ │ │ ├── log.txt
│ │ │ └── model.weights/
│ │ │ ├── .data-00000-of-00001
│ │ │ ├── .index
│ │ │ ├── .meta
│ │ │ └── checkpoint
│ │ ├── q3_nature/
│ │ │ ├── events.out.tfevents.1511876195.zengliang-PU551LD
│ │ │ ├── log.txt
│ │ │ └── model.weights/
│ │ │ ├── .index
│ │ │ ├── .meta
│ │ │ └── checkpoint
│ │ └── q4_train_atari_linear/
│ │ ├── log.txt
│ │ ├── model.weights/
│ │ │ ├── .data-00000-of-00001
│ │ │ ├── .index
│ │ │ ├── .meta
│ │ │ └── checkpoint
│ │ └── monitor/
│ │ ├── openaigym.episode_batch.0.2799.stats.json
│ │ ├── openaigym.episode_batch.0.3758.stats.json
│ │ ├── openaigym.episode_batch.0.5469.stats.json
│ │ ├── openaigym.manifest.0.2799.manifest.json
│ │ ├── openaigym.manifest.0.3758.manifest.json
│ │ ├── openaigym.manifest.0.5469.manifest.json
│ │ ├── openaigym.video.0.2799.video000000.meta.json
│ │ ├── openaigym.video.0.3758.video000000.meta.json
│ │ └── openaigym.video.0.5469.video000000.meta.json
│ └── utils/
│ ├── __init__.py
│ ├── general.py
│ ├── preprocess.py
│ ├── replay_buffer.py
│ ├── test_env.py
│ ├── viewer.py
│ └── wrappers.py
└── assignment3/
├── discrete_env.py
├── frozen_lake.py
├── q1.py
├── q2.py
├── q3.py
├── requirements.txt
└── utils.py
================================================
FILE CONTENTS
================================================
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2017 Liang Zeng
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
## My Solution to Assignments of CS234
This is my solution to three assignments of CS234.<br>
[CS234: Deep Reinforcement Learning](http://cs234.stanford.edu/) is
an interesting class, which teaches you what is the reinforcement learning:
Learn to make good sequences of decisions. This class provides some basic knowledge and insights of cutting-edge research in reinforcement learning. More details are as follows:
* Define the key features of RL vs AI & other ML
* Define MDP, POMDP, bandit, batch offline RL, online RL
* Describe the exploration vs exploitation challenge and compare and contrast 2 or more approaches
* Given an application problem (e.g. from computer vision, robotics, etc) decide if it should be formulated as a RL problem, if yes how to formulate, what algorithm (from class) is best suited to address, and justify an answer
* Implement several RL algorithms incl. a deep RL approach
* Describe multiple criteria for analyzing RL algorithms and evaluate algorithms on these metrics: e.g. regret, sample complexity, computational complexity, convergence, etc.
* List at least two open challenges or hot topics in RL
******
**Note:** If you consult my source codes that you may want to incorporate into your algorithm or system, you should clearly cite references in your codes.
******
## Table of Contents
* [Assignment 1](https://github.com/zlpure/CS234/tree/master/assignment1)
* Bellman Operator Properties
* Value Iteration
* Grid Policies
* Frozen Lake MDP
* Frozen Lake Reinforcement Learning
* [Assignment 2](https://github.com/zlpure/CS234/tree/master/assignment2)
* Q-learning
* Linear Approximation
* Deepmind's DQN
* (Bonus) Double DQN
* (Bonus) Dueling DQN
* [Assignment 3](https://github.com/zlpure/CS234/tree/master/assignment3)
* R-max algorithm
* epsilon-greedy q-learning
* Expected Regret Bounds
## Dependencies
* Anaconda
* tensorflow>=0.12
* matplotlib
* scipy
* numpy
* sklearn
* six
## Author
[@zlpure](github.com/zlpure)
================================================
FILE: assignment1/Makefile
================================================
submit:
sh collect_submission.sh
clean:
rm -f assignment1.zip
rm -f *.pyc *.png *.npy utils/*.pyc
================================================
FILE: assignment1/collect_submission.sh
================================================
rm -f assignment1.zip
zip -r assignment1.zip *.py *.ipynb
================================================
FILE: assignment1/lake_envs.py
================================================
# coding: utf-8
"""Defines some frozen lake maps."""
from gym.envs.toy_text import frozen_lake, discrete
from gym.envs.registration import register
register(
id='Deterministic-4x4-FrozenLake-v0',
entry_point='gym.envs.toy_text.frozen_lake:FrozenLakeEnv',
kwargs={'map_name': '4x4',
'is_slippery': False})
register(
id='Deterministic-8x8-FrozenLake-v0',
entry_point='gym.envs.toy_text.frozen_lake:FrozenLakeEnv',
kwargs={'map_name': '8x8',
'is_slippery': False})
register(
id='Stochastic-4x4-FrozenLake-v0',
entry_point='gym.envs.toy_text.frozen_lake:FrozenLakeEnv',
kwargs={'map_name': '4x4',
'is_slippery': True})
================================================
FILE: assignment1/log
================================================
Winter is here. You and your friends were tossing around a frisbee at the park
when you made a wild throw that left the frisbee out in the middle of the lake.
The water is mostly frozen, but there are a few holes where the ice has melted.
If you step into one of those holes, you'll fall into the freezing water.
At this time, there's an international frisbee shortage, so it's absolutely imperative that
you navigate across the lake and retrieve the disc.
However, the ice is slippery, so you won't always move in the direction you intend.
The surface is described using a grid like the following
SFFF
FHFH
FFFH
HFFG
S : starting point, safe
F : frozen surface, safe
H : hole, fall to your doom
G : goal, where the frisbee is located
The episode ends when you reach the goal or fall in a hole.
You receive a reward of 1 if you reach the goal, and zero otherwise.
[(0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 14, 0.0, False), (0.3333333333333333, 9, 0.0, False)]
[(0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 14, 0.0, False), (0.3333333333333333, 15, 1.0, True)]
1.0 1
0.3 0
0.3 1
0.09 0
0.3 0
1.0 1
0.09 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.027 0
0.0081 0
0.0081 0
0.09 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
0.000729 0
0.00243 0
0.00243 0
0.0081 1
0.0081 0
0.027 0
0.00243 0
0.0081 0
0.000729 0
0.00243 0
0.0081 0
0.0081 0
0.09 0
0.00243 0
0.0081 0
0.027 1
0.0081 0
0.09 0
0.027 0
0.3 0
0.027 0
0.09 0
0.3 1
0.09 0
0.3 0
1.0 1
[ 0.002 0.008 0.027 0.008 0.008 0. 0.09 0. 0.027 0.09 0.3
0. 0. 0.3 1. 0. ]
[0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0]
================================================
FILE: assignment1/model_based_learning.py
================================================
### Episodic Model Based Learning using Maximum Likelihood Estimate of the Environment
# Do not change the arguments and output types of any of the functions provided! You may debug in Main and elsewhere.
import numpy as np
import gym
import time
from lake_envs import *
import matplotlib.pyplot as plt
from tqdm import *
from vi_and_pi import value_iteration
from vi_and_pi import policy_iteration
def initialize_P(nS, nA):
"""Initializes a uniformly random model of the environment with 0 rewards.
Parameters
----------
nS: int
Number of states
nA: int
Number of actions
Returns
-------
P: np.array of shape [nS x nA x nS x 4] where items are tuples representing transition information
P[state][action] is a list of (prob, next_state, reward, done) tuples.
"""
P = [[[(1.0/nS, i, 0, False) for i in range(nS)] for _ in range(nA)] for _ in range(nS)]
return P
def initialize_counts(nS, nA):
"""Initializes a counts array.
Parameters
----------
nS: int
Number of states
nA: int
Number of actions
Returns
-------
counts: np.array of shape [nS x nA x nS]
counts[state][action][next_state] is the number of times that doing "action" at state "state" transitioned to "next_state"
"""
counts = [[[0 for _ in range(nS)] for _ in range(nA)] for _ in range(nS)]
return counts
def initialize_rewards(nS, nA):
"""Initializes a rewards array. Values represent running averages.
Parameters
----------
nS: int
Number of states
nA: int
Number of actions
Returns
-------
rewards: array of shape [nS x nA x nS]
counts[state][action][next_state] is the running average of rewards of doing "action" at "state" transtioned to "next_state"
"""
rewards = [[[0 for _ in range (nS)] for _ in range(nA)] for _ in range(nS)]
return rewards
def counts_and_rewards_to_P(counts, rewards, terminal_state):
"""Converts counts and rewards arrays to a P array consistent with the Gym environment data structure for a model of the environment.
Use this function to convert your counts and rewards arrays to a P that you can use in value iteration.
Parameters
----------
counts: array of shape [nS x nA x nS]
counts[state][action][next_state] is the number of times that doing "action" at state "state" transitioned to "next_state"
rewards: array of shape [nS x nA x nS]
counts[state][action][next_state] is the running average of rewards of doing "action" at "state" transtioned to "next_state"
Returns
-------
P: np.array of shape [nS x nA x nS' x 4] where items are tuples representing transition information
P[state][action] is a list of (prob, next_state, reward, done) tuples.
"""
nS = len(counts)
nA = len(counts[0])
P = [[[] for _ in range(nA)] for _ in range(nS)]
for state in range(nS):
for action in range(nA):
if sum(counts[state][action]) != 0:
for next_state in range(nS):
if counts[state][action][next_state] != 0:
prob = float(counts[state][action][next_state]) / float(sum(counts[state][action]))
reward = rewards[state][action][next_state]
if next_state in terminal_state:
P[state][action].append((prob, next_state, reward, True))
else:
P[state][action].append((prob, next_state, reward, False))
else:
prob = 1.0 / float(nS)
for next_state in range(nS):
P[state][action].append((prob, next_state, 0, False))
#for action in range(nA):
#P[nS-2][2][nS-1] = (1.0, nS-1, 1, True)
return P
def update_mdp_model_with_history(counts, rewards, history):
"""Given a history of an entire episode, update the count and rewards arrays
Parameters
----------
counts: array of shape [nS x nA x nS]
counts[state][action][next_state] is the number of times that doing "action" at state "state" transitioned to "next_state"
rewards: array of shape [nS x nA x nS]
counts[state][action][next_state] is the running average of rewards of doing "action" at "state" transtioned to "next_state"
history:
a list of [state, action, reward, next_state, done]
"""
# HINT: For terminal states, we define that the probability of any action returning the state to itself is 1 (with zero reward)
# Make sure you record this information in your counts array by updating the counts for this accordingly for your
# value iteration to work.
############################
# YOUR IMPLEMENTATION HERE #
for item in history:
#print item
(state, action, reward, next_state, done) = item
#if not done:
# counts[state][action][next_state] += 1
# rewards[state][action][next_state] = float(rewards[state][action][next_state]+reward) / counts[state][action][next_state]
#else:
# counts[state][action][next_state] = 1
# rewards[state][action][next_state] = float(rewards[state][action][next_state]+reward) / counts[state][action][next_state]
counts[state][action][next_state] += 1
all_reward = float(rewards[state][action][next_state]*(counts[state][action][next_state]-1)+reward)
rewards[state][action][next_state] = all_reward / counts[state][action][next_state]
############################
return counts, rewards
def learn_with_mdp_model(env, method=None, num_episodes=5000, gamma = 0.95, e = 0.8, decay_rate = 0.99):
"""Build a model of the environment and use value iteration to learn a policy. In the next episode, play with the new
policy using epsilon-greedy exploration.
Your model of the environment should be based on updating counts and rewards arrays. The counts array counts the number
of times that "state" with "action" led to "next_state", and the rewards array is the running average of rewards for
going from at "state" with "action" leading to "next_state".
For a single episode, create a list called "history" with all the experience
from that episode, then update the "counts" and "rewards" arrays using the function "update_mdp_model_with_history".
You may then call the prewritten function "counts_and_rewards_to_P" to convert your counts and rewards arrays to
an environment data structure P consistent with the Gym environment's one. You may then call on value_iteration(P, nS, nA)
to get a policy.
Parameters
----------
env: gym.core.Environment
Environment to compute Q function for. Must have nS, nA, and P as
attributes.
num_episodes: int
Number of episodes of training.
gamma: float
Discount factor. Number in range [0, 1)
learning_rate: float
Learning rate. Number in range [0, 1)
e: float
Epsilon value used in the epsilon-greedy method.
decay_rate: float
Rate at which epsilon falls. Number in range [0, 1)
Returns
-------
policy: np.array
An array of shape [env.nS] representing the action to take at a given state.
"""
P = initialize_P(env.nS, env.nA)
counts = initialize_counts(env.nS, env.nA)
rewards = initialize_rewards(env.nS, env.nA)
############################
# YOUR IMPLEMENTATION HERE #
new_policy = np.zeros((env.nS)).astype(int)
terminal_state = []
for i in range(num_episodes):
done = False
state = env.reset()
his = []
while not done:
if np.random.rand() > e:
action = new_policy[state]
else:
action = np.random.randint(env.nA)
nextstate, reward, done, _ = env.step(action)
his.append([state, action, reward, nextstate, done])
state = nextstate
if state not in terminal_state:
terminal_state.append(state)
counts, rewards = update_mdp_model_with_history(counts, rewards, his)
P = counts_and_rewards_to_P(counts, rewards, terminal_state)
_, new_policy = method(P, env.nS, env.nA, gamma)
if i%10 == 0:
e *= decay_rate
############################
return new_policy
def render_single(env, policy):
"""Renders policy once on environment. Watch your agent play!
Parameters
----------
env: gym.core.Environment
Environment to play on. Must have nS, nA, and P as
attributes.
Policy: np.array of shape [env.nS]
The action to take at a given state
"""
episode_reward = 0
state = env.reset()
done = False
while not done:
#env.render()
#time.sleep(0.5) # Seconds between frames. Modify as you wish.
action = policy[state]
state, reward, done, _ = env.step(action)
episode_reward += reward
#print "Episode reward: %f" % episode_reward
return episode_reward
# Feel free to run your own debug code in main!
def main():
env = gym.make('Stochastic-4x4-FrozenLake-v0')
#render_single(env, policy)
#print policy
score1 = []
score2 = []
average_score1 = []
average_score2 = []
for i in tqdm(np.arange(1, 5000, 50)):
policy1 = learn_with_mdp_model(env, method=value_iteration, num_episodes=i+1)
policy2 = learn_with_mdp_model(env, method=policy_iteration, num_episodes=i+1)
episode_reward1 = render_single(env, policy1)
episode_reward2 = render_single(env, policy2)
score1.append(episode_reward1)
score2.append(episode_reward2)
for i in range(100):
average_score1[i] = np.mean(score1[:i+1])
average_score2[i] = np.mean(score2[:i+1])
plt.plot(np.arange(1, 5000, 50),np.array(average_score1))
plt.plot(np.arange(1, 5000, 50),np.array(average_score2))
plt.title('The running average score of the model-based learning agent')
plt.xlabel('traning episodes')
plt.ylabel('score')
plt.legend(['value-iteration', 'policy_iteration'], loc='upper right')
#plt.show()
plt.savefig('model-based.jpg')
if __name__ == '__main__':
main()
================================================
FILE: assignment1/model_free_learning.py
================================================
### Episode model free learning using Q-learning and SARSA
# Do not change the arguments and output types of any of the functions provided! You may debug in Main and elsewhere.
import numpy as np
import gym
import time
from lake_envs import *
import matplotlib.pyplot as plt
from tqdm import *
def learn_Q_QLearning(env, num_episodes=2000, gamma=0.95, lr=0.1, e=0.8, decay_rate=0.99):
"""Learn state-action values using the Q-learning algorithm with epsilon-greedy exploration strategy.
Update Q at the end of every episode.
Parameters
----------
env: gym.core.Environment
Environment to compute Q function for. Must have nS, nA, and P as
attributes.
num_episodes: int
Number of episodes of training.
gamma: float
Discount factor. Number in range [0, 1)
learning_rate: float
Learning rate. Number in range [0, 1)
e: float
Epsilon value used in the epsilon-greedy method.
decay_rate: float
Rate at which epsilon falls. Number in range [0, 1)
Returns
-------
np.array
An array of shape [env.nS x env.nA] representing state, action values
"""
############################
# YOUR IMPLEMENTATION HERE #
q_value = np.zeros([env.nS, env.nA])
for i in range(num_episodes):
done = False
state = env.reset()
while not done:
if np.random.rand() > e:
action = np.argmax(q_value[state])
else:
action = np.random.randint(env.nA)
nextstate, reward, done, _ = env.step(action)
q_value[state][action] = (1-lr)*q_value[state][action]+lr*(reward+gamma*np.max(q_value[nextstate]))
state = nextstate
if i%10 == 0:
e *= decay_rate
'''
print np.mean(q_value)
plt.plot(np.arange(num_episodes),np.array(score))
plt.title('The running average score of the Q-learning agent')
plt.xlabel('traning episodes')
plt.ylabel('score')
#plt.show()
plt.savefig('c.jpg')
'''
############################
return q_value
def learn_Q_SARSA(env, num_episodes=2000, gamma=0.95, lr=0.1, e=0.8, decay_rate=0.99):
"""Learn state-action values using the SARSA algorithm with epsilon-greedy exploration strategy
Update Q at the end of every episode.
Parameters
----------
env: gym.core.Environment
Environment to compute Q function for. Must have nS, nA, and P as
attributes.
num_episodes: int
Number of episodes of training.
gamma: float
Discount factor. Number in range [0, 1)
learning_rate: float
Learning rate. Number in range [0, 1)
e: float
Epsilon value used in the epsilon-greedy method.
decay_rate: float
Rate at which epsilon falls. Number in range [0, 1)
Returns
-------
np.array
An array of shape [env.nS x env.nA] representing state-action values
"""
############################
# YOUR IMPLEMENTATION HERE #
q_value = np.zeros([env.nS, env.nA])
for i in range(num_episodes):
done = False
state = env.reset()
if np.random.rand() > e:
action = np.argmax(q_value[state])
else:
action = np.random.randint(env.nA)
while not done:
nextstate, reward, done, _ = env.step(action)
if np.random.rand() > e:
nextaction = np.argmax(q_value[nextstate])
else:
nextaction = np.random.randint(env.nA)
q_value[state][action] = (1-lr)*q_value[state][action]+lr*(reward+gamma*q_value[nextstate][nextaction])
state = nextstate
action = nextaction
if i%10 == 0:
e *= decay_rate
############################
return q_value
def render_single_Q(env, Q):
"""Renders Q function once on environment. Watch your agent play!
Parameters
----------
env: gym.core.Environment
Environment to play Q function on. Must have nS, nA, and P as
attributes.
Q: np.array of shape [env.nS x env.nA]
state-action values.
"""
episode_reward = 0
state = env.reset()
done = False
while not done:
#env.render() #show frames
#time.sleep(0.5) # Seconds between frames. Modify as you wish.
action = np.argmax(Q[state])
state, reward, done, _ = env.step(action)
episode_reward += reward
#print "Episode reward: %f" % episode_reward
return episode_reward
# Feel free to run your own debug code in main!
def main():
env = gym.make('Stochastic-4x4-FrozenLake-v0')
score1 = []
score2 = []
average_score1 = []
average_score2 = []
for i in tqdm(range(4000)):
Q1 = learn_Q_QLearning(env, num_episodes=i+1)
Q2 = learn_Q_SARSA(env, num_episodes=i+1)
episode_reward1 = render_single_Q(env, Q1)
episode_reward2 = render_single_Q(env, Q2)
score1.append(episode_reward1)
score2.append(episode_reward2)
for i in range(4000):
average_score1.append(np.mean(score1[:i+1]))
average_score2.append(np.mean(score2[:i+1]))
plt.plot(np.arange(4000),np.array(average_score1))
plt.plot(np.arange(4000),np.array(average_score2))
plt.title('The running average score of the Q-learning agent')
plt.xlabel('traning episodes')
plt.ylabel('score')
plt.legend(['q-learning', 'sarsa'], loc='upper right')
#plt.show()
plt.savefig('model-free.jpg')
if __name__ == '__main__':
main()
================================================
FILE: assignment1/requirements.txt
================================================
matplotlib
numpy
================================================
FILE: assignment1/vi_and_pi.py
================================================
### MDP Value Iteration and Policy Iteratoin
# You might not need to use all parameters
import numpy as np
import gym
import time
from lake_envs import *
np.set_printoptions(precision=3)
def value_iteration(P, nS, nA, gamma=0.9, max_iteration=20, tol=1e-3):
"""
Learn value function and policy by using value iteration method for a given
gamma and environment.
Parameters:
----------
P: dictionary
It is from gym.core.Environment
P[state][action] is tuples with (probability, nextstate, reward, terminal)
nS: int
number of states
nA: int
number of actions
gamma: float
Discount factor. Number in range [0, 1)
max_iteration: int
The maximum number of iterations to run before stopping. Feel free to change it.
tol: float
Determines when value function has converged.
Returns:
----------
value function: np.ndarray
policy: np.ndarray
"""
V = np.zeros(nS)
policy = np.zeros(nS, dtype=int)
############################
# YOUR IMPLEMENTATION HERE #
idx = 1
new_V = V.copy()
#print P[14][2]
while idx<=max_iteration or np.sum(np.sqrt(np.square(new_V-V)))>tol:
idx += 1
V = new_V
for state in range(nS):
max_result = -10
max_idx = 0
for action in range(nA):
result = P[state][action]
temp = np.array(result)[:,2].mean()
#temp = result[0][2]
for num in range(len(result)):
(probability, nextstate, reward, terminal) = result[num]
temp += gamma*probability*V[nextstate]
if max_result < temp:
max_result = temp
max_idx = action
new_V[state] = max_result
policy[state] = max_idx
#print new_V
#print policy
############################
return V, policy
def policy_evaluation(P, nS, nA, policy, gamma=0.9, max_iteration=100, tol=1e-3):
"""Evaluate the value function from a given policy.
Parameters
----------
P: dictionary
It is from gym.core.Environment
P[state][action] is tuples with (probability, nextstate, reward, terminal)
nS: int
number of states
nA: int
number of actions
gamma: float
Discount factor. Number in range [0, 1)
policy: np.array
The policy to evaluate. Maps states to actions.
max_iteration: int
The maximum number of iterations to run before stopping. Feel free to change it.
tol: float
Determines when value function has converged.
Returns
-------
value function: np.ndarray
The value function from the given policy.
"""
############################
# YOUR IMPLEMENTATION HERE #
value_function = np.zeros(nS)
new_value_function = value_function.copy()
i = 0
while i<=max_iteration or np.sum(np.sqrt(np.square(new_value_function-value_function)))>tol:
i += 1
value_function = new_value_function.copy()
for state in range(nS):
result = P[state][policy[state]]
new_value_function[state] = np.array(result)[:,2].mean()
for num in range(len(result)):
(probability, nextstate, reward, terminal) = result[num]
new_value_function[state] += (gamma * probability * value_function[nextstate])
############################
return new_value_function
def policy_improvement(P, nS, nA, value_from_policy, policy, gamma=0.9):
"""Given the value function from policy improve the policy.
Parameters
----------
P: dictionary
It is from gym.core.Environment
P[state][action] is tuples with (probability, nextstate, reward, terminal)
nS: int
number of states
nA: int
number of actions
gamma: float
Discount factor. Number in range [0, 1)
value_from_policy: np.ndarray
The value calculated from the policy
policy: np.array
The previous policy.
Returns
-------
new policy: np.ndarray
An array of integers. Each integer is the optimal action to take
in that state according to the environment dynamics and the
given value function.
"""
############################
# YOUR IMPLEMENTATION HERE #
q_function = np.zeros([nS,nA])
for state in range(nS):
for action in range(nA):
result = P[state][action]
for num in range(len(result)):
(probability, nextstate, reward, terminal) = result[num]
q_function[state][action] = reward
q_function[state][action] += (gamma*probability*value_from_policy[nextstate])
new_policy = np.argmax(q_function, axis=1)
############################
return new_policy
def policy_iteration(P, nS, nA, gamma=0.9, max_iteration=200, tol=1e-3):
"""Runs policy iteration.
You should use the policy_evaluation and policy_improvement methods to
implement this method.
Parameters
----------
P: dictionary
It is from gym.core.Environment
P[state][action] is tuples with (probability, nextstate, reward, terminal)
nS: int
number of states
nA: int
number of actions
gamma: float
Discount factor. Number in range [0, 1)
max_iteration: int
The maximum number of iterations to run before stopping. Feel free to change it.
tol: float
Determines when value function has converged.
Returns:
----------
value function: np.ndarray
policy: np.ndarray
"""
V = np.zeros(nS)
policy = np.zeros(nS, dtype=int)
############################
# YOUR IMPLEMENTATION HERE #
i = 0
new_policy= policy.copy()
while i<=max_iteration or np.sum(np.sqrt(np.square(new_policy-policy)))>tol:
i += 1
policy = new_policy
V = policy_evaluation(P, nS, nA, policy)
new_policy = policy_improvement(P, nS, nA, V, policy)
############################
return V, policy
def example(env):
"""Show an example of gym
Parameters
----------
env: gym.core.Environment
Environment to play on. Must have nS, nA, and P as
attributes.
"""
env.seed(0);
from gym.spaces import prng; prng.seed(10) # for print the location
# Generate the episode
ob = env.reset()
for t in range(100):
env.render()
a = env.action_space.sample()
ob, rew, done, _ = env.step(a)
if done:
break
assert done
env.render();
def render_single(env, policy):
"""Renders policy once on environment. Watch your agent play!
Parameters
----------
env: gym.core.Environment
Environment to play on. Must have nS, nA, and P as
attributes.
Policy: np.array of shape [env.nS]
The action to take at a given state
"""
episode_reward = 0
ob = env.reset()
for t in range(100):
env.render()
#time.sleep(0.5) # Seconds between frames. Modify as you wish.
a = policy[ob]
ob, rew, done, _ = env.step(a)
episode_reward += rew
if done:
break
assert done
env.render();
print "Episode reward: %f" % episode_reward
# Feel free to run your own debug code in main!
# Play around with these hyperparameters.
if __name__ == "__main__":
env = gym.make("Stochastic-4x4-FrozenLake-v0")
print env.__doc__
#print "Here is an example of state, action, reward, and next state"
#example(env)
V_vi, p_vi = value_iteration(env.P, env.nS, env.nA, gamma=0.9, max_iteration=20, tol=1e-3)
#V_pi, p_pi = policy_iteration(env.P, env.nS, env.nA, gamma=0.9, max_iteration=20, tol=1e-3)
render_single(env, p_vi)
================================================
FILE: assignment2/.gitignore
================================================
/results
================================================
FILE: assignment2/Makefile
================================================
submit:
sh collect_submission.sh
clean:
rm -f assignment1.zip
rm -f *.pyc *.png *.npy utils/*.pyc
================================================
FILE: assignment2/README.md
================================================
# RL with Atari
## Install
First, install gym and atari environments. You may need to install other dependencies depending on your system.
```
pip install gym
```
and then install atari with one of the following commands
```
pip install "gym[atari]"
pip install gym[atari]
```
We also require you to use a version greater than 1 for Tensorflow.
## Environment
### Pong-v0
- We play against a decent AI player.
- One player wins if the ball pass through the other player and gets reward +1 else -1.
- Episode is over when one of the player reaches 21 wins
- final score is between -21 or +21 (lost all or won all)
```python
# action = int in [0, 6)
# state = (210, 160, 3) array
# reward = 0 during the game, 1 if we win, -1 else
```
We use a modified env where the dimension of the input is reduced to
```python
# state = (80, 80, 1)
```
with downsampling and greyscale.
## Training
Once done with implementing `q2_linear.py` (setup of the tensorflow necessary op) and `q3_nature` make sure you test your implementation by launching `python q2_linear.py` and `python q3_nature.py` that will run your code on the Test environment.
You can launch the training of DeepMind's DQN on pong with
```
python q5_train_atari_nature.py
```
The default config file should be sufficient to reach good performance after 5 million steps.
You can monitor your training with Tensorboard by doing, on Azure
```
tensorboard --logdir=results
```
and then connect to `ip-of-you-machine:6006`
**Credits**
Assignment code written by Guillaume Genthial and Shuhui Qu.
================================================
FILE: assignment2/collect_submission.sh
================================================
rm -f assignment2.zip
zip -r assignment2.zip . -x "*.pyc" "*.git*" "*weights/*" "*README.md" "*collect_submission.sh" "*events.out*" "*/monitor/*"
================================================
FILE: assignment2/configs/__init__.py
================================================
================================================
FILE: assignment2/configs/frozen_lake.py
================================================
class config():
# env config
render_train = False
render_test = False
env_name = "Pong-v0"
RGB = True
overwrite_render = True
# output config
output_path = "results/test/"
model_output = output_path + "model.weights/"
log_path = output_path + "log.txt"
plot_output = output_path + "scores.png"
training_path = "results/train/"
# model and training config
num_episodes_test = 20
grad_clip = True
clip_val = 10
saving_freq = 500
log_freq = 50
eval_freq = 50000
soft_epsilon = 0.05
# nature paper hyper params
nsteps_train = 2000*200
batch_size = 32
buffer_size = 50000
target_update_freq = 5000
gamma = 0.99
learning_freq = 1
state_history = 1
skip_frame = 1
lr = 0.1
eps_begin = 0.1
eps_end = 0.01
eps_nsteps = nsteps_train
learning_start = 5000
================================================
FILE: assignment2/configs/q2_linear.py
================================================
class config():
# env config
render_train = False
render_test = False
overwrite_render = True
record = False
high = 255.
# output config
output_path = "results/q2_linear/"
model_output = output_path + "model.weights/"
log_path = output_path + "log.txt"
plot_output = output_path + "scores.png"
# model and training config
num_episodes_test = 20
grad_clip = True
clip_val = 10
saving_freq = 5000
log_freq = 50
eval_freq = 1000
soft_epsilon = 0
# hyper params
nsteps_train = 10000
batch_size = 32
buffer_size = 1000
target_update_freq = 500
gamma = 0.99
learning_freq = 4
state_history = 4
lr_begin = 0.005
lr_end = 0.001
lr_nsteps = nsteps_train/2
eps_begin = 1
eps_end = 0.01
eps_nsteps = nsteps_train/2
learning_start = 200
================================================
FILE: assignment2/configs/q3_nature.py
================================================
class config():
# env config
render_train = False
render_test = False
overwrite_render = True
record = False
high = 255.
# output config
output_path = "results/q3_nature/"
model_output = output_path + "model.weights/"
log_path = output_path + "log.txt"
plot_output = output_path + "scores.png"
# model and training config
num_episodes_test = 20
grad_clip = True
clip_val = 10
saving_freq = 5000
log_freq = 50
eval_freq = 100
soft_epsilon = 0
# hyper params
nsteps_train = 1000
batch_size = 32
buffer_size = 500
target_update_freq = 500
gamma = 0.99
learning_freq = 4
state_history = 4
lr_begin = 0.00025
lr_end = 0.0001
lr_nsteps = nsteps_train/2
eps_begin = 1
eps_end = 0.01
eps_nsteps = nsteps_train/2
learning_start = 200
================================================
FILE: assignment2/configs/q4_train_atari_linear.py
================================================
class config():
# env config
render_train = False
render_test = False
env_name = "Pong-v0"
overwrite_render = True
record = True
high = 255.
# output config
output_path = "results/q4_train_atari_linear/"
model_output = output_path + "model.weights/"
log_path = output_path + "log.txt"
plot_output = output_path + "scores.png"
record_path = output_path + "monitor/"
# model and training config
num_episodes_test = 50
grad_clip = True
clip_val = 10
saving_freq = 250000
log_freq = 50
eval_freq = 250000
record_freq = 250000
soft_epsilon = 0.05
# nature paper hyper params
nsteps_train = 5000000
batch_size = 32
buffer_size = 1000000
target_update_freq = 10000
gamma = 0.99
learning_freq = 4
state_history = 4
skip_frame = 4
lr_begin = 0.00025
lr_end = 0.00005
lr_nsteps = nsteps_train/2
eps_begin = 1
eps_end = 0.1
eps_nsteps = 1000000
learning_start = 50000
================================================
FILE: assignment2/configs/q5_train_atari_nature.py
================================================
class config():
# env config
render_train = False
render_test = False
env_name = "Pong-v0"
overwrite_render = True
record = True
high = 255.
# output config
output_path = "results/q5_train_atari_nature/"
model_output = output_path + "model.weights/"
log_path = output_path + "log.txt"
plot_output = output_path + "scores.png"
record_path = output_path + "monitor/"
# model and training config
num_episodes_test = 50
grad_clip = True
clip_val = 10
saving_freq = 250000
log_freq = 50
eval_freq = 250000
record_freq = 250000
soft_epsilon = 0.05
# nature paper hyper params
nsteps_train = 5000000
batch_size = 32
buffer_size = 1000000
target_update_freq = 10000
gamma = 0.99
learning_freq = 4
state_history = 4
skip_frame = 4
lr_begin = 0.00025
lr_end = 0.00005
lr_nsteps = nsteps_train/2
eps_begin = 1
eps_end = 0.1
eps_nsteps = 1000000
learning_start = 50000
================================================
FILE: assignment2/configs/q6_bonus_question.py
================================================
class config():
# env config
render_train = False
render_test = False
env_name = "Pong-v0"
overwrite_render = True
record = True
high = 255.
# output config
output_path = "results/q6_bonus_question/"
model_output = output_path + "model.weights/"
log_path = output_path + "log.txt"
plot_output = output_path + "scores.png"
record_path = output_path + "monitor/"
# model and training config
num_episodes_test = 50
grad_clip = True
clip_val = 10
saving_freq = 250000
log_freq = 50
eval_freq = 250000
record_freq = 250000
soft_epsilon = 0.05
# nature paper hyper params
nsteps_train = 10000000
batch_size = 32
buffer_size = 1000000
target_update_freq = 10000
gamma = 0.99
learning_freq = 4
state_history = 4
skip_frame = 4
lr_begin = 0.00025
lr_end = 0.00005
lr_nsteps = nsteps_train/2
eps_begin = 1
eps_end = 0.1
eps_nsteps = 1000000
learning_start = 50000
================================================
FILE: assignment2/configs/test.py
================================================
class config():
# env config
render_train = True
render_test = False
env_name = "Pong-v0"
overwrite_render = True
record = True
high = 255.
# output config
output_path = "results/test/"
model_output = output_path + "model.weights/"
log_path = output_path + "log.txt"
plot_output = output_path + "scores.png"
record_path = output_path + "video/"
# model and training config
num_episodes_test = 10
grad_clip = True
clip_val = 10
saving_freq = 1000
log_freq = 50
eval_freq = 1000
record_freq = 1000
soft_epsilon = 0.05
# nature paper hyper params
nsteps_train = 10000
batch_size = 32
buffer_size = 1000
target_update_freq = 1000
gamma = 0.99
learning_freq = 4
state_history = 4
skip_frame = 4
lr = 0.0001
eps_begin = 1
eps_end = 0.1
eps_nsteps = 1000
learning_start = 500
================================================
FILE: assignment2/core/__init__.py
================================================
================================================
FILE: assignment2/core/deep_q_learning.py
================================================
import os
import numpy as np
import tensorflow as tf
import time
from q_learning import QN
class DQN(QN):
"""
Abstract class for Deep Q Learning
"""
def add_placeholders_op(self):
raise NotImplementedError
def get_q_values_op(self, scope, reuse=False):
"""
set Q values, of shape = (batch_size, num_actions)
"""
raise NotImplementedError
def add_update_target_op(self, q_scope, target_q_scope):
"""
Update_target_op will be called periodically
to copy Q network to target Q network
Args:
q_scope: name of the scope of variables for q
target_q_scope: name of the scope of variables for the target
network
"""
raise NotImplementedError
def add_loss_op(self, q, target_q):
"""
Set (Q_target - Q)^2
"""
raise NotImplementedError
def add_optimizer_op(self, scope):
"""
Set training op wrt to loss for variable in scope
"""
raise NotImplementedError
def process_state(self, state):
"""
Processing of state
State placeholders are tf.uint8 for fast transfer to GPU
Need to cast it to float32 for the rest of the tf graph.
Args:
state: node of tf graph of shape = (batch_size, height, width, nchannels)
of type tf.uint8.
if , values are between 0 and 255 -> 0 and 1
"""
state = tf.cast(state, tf.float32)
state /= self.config.high
return state
def build(self):
"""
Build model by adding all necessary variables
"""
# add placeholders
self.add_placeholders_op()
# compute Q values of state
s = self.process_state(self.s)
self.q = self.get_q_values_op(s, scope="q", reuse=False)
# compute Q values of next state
sp = self.process_state(self.sp)
self.target_q = self.get_q_values_op(sp, scope="target_q", reuse=False)
# add update operator for target network
self.add_update_target_op("q", "target_q")
# add square loss
self.add_loss_op(self.q, self.target_q)
# add optmizer for the main networks
self.add_optimizer_op("q")
def initialize(self):
"""
Assumes the graph has been constructed
Creates a tf Session and run initializer of variables
"""
# create tf session
self.sess = tf.Session()
# tensorboard stuff
self.add_summary()
# initiliaze all variables
init = tf.global_variables_initializer()
self.sess.run(init)
# synchronise q and target_q networks
self.sess.run(self.update_target_op)
# for saving networks weights
self.saver = tf.train.Saver()
def add_summary(self):
"""
Tensorboard stuff
"""
# extra placeholders to log stuff from python
self.avg_reward_placeholder = tf.placeholder(tf.float32, shape=(), name="avg_reward")
self.max_reward_placeholder = tf.placeholder(tf.float32, shape=(), name="max_reward")
self.std_reward_placeholder = tf.placeholder(tf.float32, shape=(), name="std_reward")
self.avg_q_placeholder = tf.placeholder(tf.float32, shape=(), name="avg_q")
self.max_q_placeholder = tf.placeholder(tf.float32, shape=(), name="max_q")
self.std_q_placeholder = tf.placeholder(tf.float32, shape=(), name="std_q")
self.eval_reward_placeholder = tf.placeholder(tf.float32, shape=(), name="eval_reward")
# add placeholders from the graph
tf.summary.scalar("loss", self.loss)
tf.summary.scalar("grads norm", self.grad_norm)
# extra summaries from python -> placeholders
tf.summary.scalar("Avg Reward", self.avg_reward_placeholder)
tf.summary.scalar("Max Reward", self.max_reward_placeholder)
tf.summary.scalar("Std Reward", self.std_reward_placeholder)
tf.summary.scalar("Avg Q", self.avg_q_placeholder)
tf.summary.scalar("Max Q", self.max_q_placeholder)
tf.summary.scalar("Std Q", self.std_q_placeholder)
tf.summary.scalar("Eval Reward", self.eval_reward_placeholder)
# logging
self.merged = tf.summary.merge_all()
self.file_writer = tf.summary.FileWriter(self.config.output_path,
self.sess.graph)
def save(self):
"""
Saves session
"""
if not os.path.exists(self.config.model_output):
os.makedirs(self.config.model_output)
self.saver.save(self.sess, self.config.model_output)
def get_best_action(self, state):
"""
Return best action
Args:
state: 4 consecutive observations from gym
Returns:
action: (int)
action_values: (np array) q values for all actions
"""
action_values = self.sess.run(self.q, feed_dict={self.s: [state]})[0]
return np.argmax(action_values), action_values
def update_step(self, t, replay_buffer, lr):
"""
Performs an update of parameters by sampling from replay_buffer
Args:
t: number of iteration (episode and move)
replay_buffer: ReplayBuffer instance .sample() gives batches
lr: (float) learning rate
Returns:
loss: (Q - Q_target)^2
"""
s_batch, a_batch, r_batch, sp_batch, done_mask_batch = replay_buffer.sample(
self.config.batch_size)
fd = {
# inputs
self.s: s_batch,
self.a: a_batch,
self.r: r_batch,
self.sp: sp_batch,
self.done_mask: done_mask_batch,
self.lr: lr,
# extra info
self.avg_reward_placeholder: self.avg_reward,
self.max_reward_placeholder: self.max_reward,
self.std_reward_placeholder: self.std_reward,
self.avg_q_placeholder: self.avg_q,
self.max_q_placeholder: self.max_q,
self.std_q_placeholder: self.std_q,
self.eval_reward_placeholder: self.eval_reward,
}
loss_eval, grad_norm_eval, summary, _ = self.sess.run([self.loss, self.grad_norm,
self.merged, self.train_op], feed_dict=fd)
# tensorboard stuff
self.file_writer.add_summary(summary, t)
return loss_eval, grad_norm_eval
def update_target_params(self):
"""
Update parametes of Q' with parameters of Q
"""
self.sess.run(self.update_target_op)
================================================
FILE: assignment2/core/q_learning.py
================================================
import os
import gym
import numpy as np
import logging
import time
import sys
from gym import wrappers
from collections import deque
from utils.general import get_logger, Progbar, export_plot
from utils.replay_buffer import ReplayBuffer
from utils.preprocess import greyscale
from utils.wrappers import PreproWrapper, MaxAndSkipEnv
class QN(object):
"""
Abstract Class for implementing a Q Network
"""
def __init__(self, env, config, logger=None):
"""
Initialize Q Network and env
Args:
config: class with hyperparameters
logger: logger instance from logging module
"""
# directory for training outputs
if not os.path.exists(config.output_path):
os.makedirs(config.output_path)
# store hyper params
self.config = config
self.logger = logger
if logger is None:
self.logger = get_logger(config.log_path)
self.env = env
# build model
self.build()
def build(self):
"""
Build model
"""
pass
@property
def policy(self):
"""
model.policy(state) = action
"""
return lambda state: self.get_action(state)
def save(self):
"""
Save model parameters
Args:
model_path: (string) directory
"""
pass
def initialize(self):
"""
Initialize variables if necessary
"""
pass
def get_best_action(self, state):
"""
Returns best action according to the network
Args:
state: observation from gym
Returns:
tuple: action, q values
"""
raise NotImplementedError
def get_action(self, state):
"""
Returns action with some epsilon strategy
Args:
state: observation from gym
"""
if np.random.random() < self.config.soft_epsilon:
return self.env.action_space.sample()
else:
return self.get_best_action(state)[0]
def update_target_params(self):
"""
Update params of Q' with params of Q
"""
raise NotImplementedError
def init_averages(self):
"""
Defines extra attributes for tensorboard
"""
self.avg_reward = -21.
self.max_reward = -21.
self.std_reward = 0
self.avg_q = 0
self.max_q = 0
self.std_q = 0
self.eval_reward = -21.
def update_averages(self, rewards, max_q_values, q_values, scores_eval):
"""
Update the averages
Args:
rewards: deque
max_q_values: deque
q_values: deque
scores_eval: list
"""
self.avg_reward = np.mean(rewards)
self.max_reward = np.max(rewards)
self.std_reward = np.sqrt(np.var(rewards) / len(rewards))
self.max_q = np.mean(max_q_values)
self.avg_q = np.mean(q_values)
self.std_q = np.sqrt(np.var(q_values) / len(q_values))
if len(scores_eval) > 0:
self.eval_reward = scores_eval[-1]
def train(self, exp_schedule, lr_schedule):
"""
Performs training of Q
Args:
exp_schedule: Exploration instance s.t.
exp_schedule.get_action(best_action) returns an action
lr_schedule: Schedule for learning rate
"""
# initialize replay buffer and variables
replay_buffer = ReplayBuffer(self.config.buffer_size, self.config.state_history)
rewards = deque(maxlen=self.config.num_episodes_test)
max_q_values = deque(maxlen=1000)
q_values = deque(maxlen=1000)
self.init_averages()
t = last_eval = last_record = 0 # time control of nb of steps
scores_eval = [] # list of scores computed at iteration time
scores_eval += [self.evaluate()]
prog = Progbar(target=self.config.nsteps_train)
# interact with environment
while t < self.config.nsteps_train:
total_reward = 0
state = self.env.reset()
while True:
t += 1
last_eval += 1
last_record += 1
if self.config.render_train: self.env.render()
# replay memory stuff
idx = replay_buffer.store_frame(state)
q_input = replay_buffer.encode_recent_observation()
# chose action according to current Q and exploration
best_action, q_values = self.get_best_action(q_input)
action = exp_schedule.get_action(best_action)
# store q values
max_q_values.append(max(q_values))
q_values += list(q_values)
# perform action in env
new_state, reward, done, info = self.env.step(action)
# store the transition
replay_buffer.store_effect(idx, action, reward, done)
state = new_state
# perform a training step
loss_eval, grad_eval = self.train_step(t, replay_buffer, lr_schedule.epsilon)
# logging stuff
if ((t > self.config.learning_start) and (t % self.config.log_freq == 0) and
(t % self.config.learning_freq == 0)):
self.update_averages(rewards, max_q_values, q_values, scores_eval)
exp_schedule.update(t)
lr_schedule.update(t)
if len(rewards) > 0:
prog.update(t + 1, exact=[("Loss", loss_eval), ("Avg R", self.avg_reward),
("Max R", np.max(rewards)), ("eps", exp_schedule.epsilon),
("Grads", grad_eval), ("Max Q", self.max_q),
("lr", lr_schedule.epsilon)])
elif (t < self.config.learning_start) and (t % self.config.log_freq == 0):
sys.stdout.write("\rPopulating the memory {}/{}...".format(t,
self.config.learning_start))
sys.stdout.flush()
# count reward
total_reward += reward
if done or t >= self.config.nsteps_train:
break
# updates to perform at the end of an episode
rewards.append(total_reward)
if (t > self.config.learning_start) and (last_eval > self.config.eval_freq):
# evaluate our policy
last_eval = 0
print("")
scores_eval += [self.evaluate()]
if (t > self.config.learning_start) and self.config.record and (last_record > self.config.record_freq):
self.logger.info("Recording...")
last_record =0
self.record()
# last words
self.logger.info("- Training done.")
self.save()
scores_eval += [self.evaluate()]
export_plot(scores_eval, "Scores", self.config.plot_output)
def train_step(self, t, replay_buffer, lr):
"""
Perform training step
Args:
t: (int) nths step
replay_buffer: buffer for sampling
lr: (float) learning rate
"""
loss_eval, grad_eval = 0, 0
# perform training step
if (t > self.config.learning_start and t % self.config.learning_freq == 0):
loss_eval, grad_eval = self.update_step(t, replay_buffer, lr)
# occasionaly update target network with q network
if t % self.config.target_update_freq == 0:
self.update_target_params()
# occasionaly save the weights
if (t % self.config.saving_freq == 0):
self.save()
return loss_eval, grad_eval
def evaluate(self, env=None, num_episodes=None):
"""
Evaluation with same procedure as the training
"""
# log our activity only if default call
if num_episodes is None:
self.logger.info("Evaluating...")
# arguments defaults
if num_episodes is None:
num_episodes = self.config.num_episodes_test
if env is None:
env = self.env
# replay memory to play
replay_buffer = ReplayBuffer(self.config.buffer_size, self.config.state_history)
rewards = []
for i in range(num_episodes):
total_reward = 0
state = env.reset()
while True:
if self.config.render_test: env.render()
# store last state in buffer
idx = replay_buffer.store_frame(state)
q_input = replay_buffer.encode_recent_observation()
action = self.get_action(q_input)
# perform action in env
new_state, reward, done, info = env.step(action)
# store in replay memory
replay_buffer.store_effect(idx, action, reward, done)
state = new_state
# count reward
total_reward += reward
if done:
break
# updates to perform at the end of an episode
rewards.append(total_reward)
avg_reward = np.mean(rewards)
sigma_reward = np.sqrt(np.var(rewards) / len(rewards))
if num_episodes > 1:
msg = "Average reward: {:04.2f} +/- {:04.2f}".format(avg_reward, sigma_reward)
self.logger.info(msg)
return avg_reward
def record(self):
"""
Re create an env and record a video for one episode
"""
env = gym.make(self.config.env_name)
env = gym.wrappers.Monitor(env, self.config.record_path, video_callable=lambda x: True, resume=True)
env = MaxAndSkipEnv(env, skip=self.config.skip_frame)
env = PreproWrapper(env, prepro=greyscale, shape=(80, 80, 1),
overwrite_render=self.config.overwrite_render)
self.evaluate(env, 1)
def run(self, exp_schedule, lr_schedule):
"""
Apply procedures of training for a QN
Args:
exp_schedule: exploration strategy for epsilon
lr_schedule: schedule for learning rate
"""
# initialize
self.initialize()
# record one game at the beginning
if self.config.record:
self.record()
# model
self.train(exp_schedule, lr_schedule)
# record one game at the end
if self.config.record:
self.record()
================================================
FILE: assignment2/q1_schedule.py
================================================
import numpy as np
from utils.test_env import EnvTest
class LinearSchedule(object):
def __init__(self, eps_begin, eps_end, nsteps):
"""
Args:
eps_begin: initial exploration
eps_end: end exploration
nsteps: number of steps between the two values of eps
"""
self.epsilon = eps_begin
self.eps_begin = eps_begin
self.eps_end = eps_end
self.nsteps = nsteps
def update(self, t):
"""
Updates epsilon
Args:
t: (int) nth frames
"""
##############################################################
"""
TODO: modify self.epsilon such that
for t = 0, self.epsilon = self.eps_begin
for t = self.nsteps, self.epsilon = self.eps_end
linear decay between the two
self.epsilon should never go under self.eps_end
"""
##############################################################
################ YOUR CODE HERE - 3-4 lines ##################
value = np.linspace(self.eps_end, self.eps_begin, self.nsteps+1)
#if t > self.nsteps:
# self.epsilon = self.eps_end
#else:
# self.epsilon = value[t]
self.epsilon = value[t] if t <= self.nsteps else self.eps_end
##############################################################
######################## END YOUR CODE ############## ########
class LinearExploration(LinearSchedule):
def __init__(self, env, eps_begin, eps_end, nsteps):
"""
Args:
env: gym environment
eps_begin: initial exploration
eps_end: end exploration
nsteps: number of steps between the two values of eps
"""
self.env = env
super(LinearExploration, self).__init__(eps_begin, eps_end, nsteps)
def get_action(self, best_action):
"""
Returns a random action with prob epsilon, otherwise return the best_action
Args:
best_action: (int) best action according some policy
Returns:
an action
"""
##############################################################
"""
TODO: with probability self.epsilon, return a random action
else, return best_action
you can access the environment stored in self.env
and epsilon with self.epsilon
"""
##############################################################
################ YOUR CODE HERE - 4-5 lines ##################
temp = np.random.rand()
if temp < self.epsilon:
best_action = np.random.randint(self.env.action_space.n)
return best_action
##############################################################
######################## END YOUR CODE ############## ########
def test1():
env = EnvTest((5, 5, 1))
exp_strat = LinearExploration(env, 1, 0, 10)
found_diff = False
for i in range(10):
rnd_act = exp_strat.get_action(0)
if rnd_act != 0 and rnd_act is not None:
found_diff = True
assert found_diff, "Test 1 failed."
print("Test1: ok")
def test2():
env = EnvTest((5, 5, 1))
exp_strat = LinearExploration(env, 1, 0, 10)
exp_strat.update(5)
assert exp_strat.epsilon == 0.5, "Test 2 failed"
print("Test2: ok")
def test3():
env = EnvTest((5, 5, 1))
exp_strat = LinearExploration(env, 1, 0.5, 10)
exp_strat.update(20)
assert exp_strat.epsilon == 0.5, "Test 3 failed"
print("Test3: ok")
def your_test():
"""
Use this to implement your own tests
"""
pass
if __name__ == "__main__":
test1()
test2()
test3()
your_test()
================================================
FILE: assignment2/q2_linear.py
================================================
import tensorflow as tf
import tensorflow.contrib.layers as layers
from utils.general import get_logger
from utils.test_env import EnvTest
from core.deep_q_learning import DQN
from q1_schedule import LinearExploration, LinearSchedule
from configs.q2_linear import config
class Linear(DQN):
"""
Implement Fully Connected with Tensorflow
"""
def add_placeholders_op(self):
"""
Adds placeholders to the graph
These placeholders are used as inputs by the rest of the model building and will be fed
data during training. Note that when "None" is in a placeholder's shape, it's flexible
(so we can use different batch sizes without rebuilding the model
"""
# this information might be useful
# here, typically, a state shape is (80, 80, 1)
state_shape = list(self.env.observation_space.shape)
##############################################################
"""
TODO: add placeholders:
Remember that we stack 4 consecutive frames together, ending up with an input of shape
(80, 80, 4).
- self.s: batch of states, type = uint8
shape = (batch_size, img height, img width, nchannels x config.state_history)
- self.a: batch of actions, type = int32
shape = (batch_size)
- self.r: batch of rewards, type = float32
shape = (batch_size)
- self.sp: batch of next states, type = uint8
shape = (batch_size, img height, img width, nchannels x config.state_history)
- self.done_mask: batch of done, type = bool
shape = (batch_size)
note that this placeholder contains bool = True only if we are done in
the relevant transition
- self.lr: learning rate, type = float32
(Don't change the variable names!)
HINT: variables from config are accessible with self.config.variable_name
Also, you may want to use a dynamic dimension for the batch dimension.
Check the use of None for tensorflow placeholders.
you can also use the state_shape computed above.
"""
##############################################################
################YOUR CODE HERE (6-15 lines) ##################
img_height, img_width, nchannels = state_shape[0], state_shape[1], state_shape[2]
self.s = tf.placeholder(dtype=tf.uint8, shape=[None, img_height, img_width, nchannels*self.config.state_history],
name='state')
self.a = tf.placeholder(dtype=tf.int32, shape=[None], name='action')
self.r = tf.placeholder(dtype=tf.float32, shape=[None], name='reward')
self.sp = tf.placeholder(dtype=tf.uint8, shape=[None, img_height, img_width, nchannels*self.config.state_history],
name='next_state')
self.done_mask = tf.placeholder(dtype=tf.bool, shape=[None], name='done_mask')
self.lr = tf.placeholder(dtype=tf.float32, shape=(), name='lr')
##############################################################
######################## END YOUR CODE #######################
def get_q_values_op(self, state, scope, reuse=False):
"""
Returns Q values for all actions
Args:
state: (tf tensor)
shape = (batch_size, img height, img width, nchannels)
scope: (string) scope name, that specifies if target network or not
reuse: (bool) reuse of variables in the scope
Returns:
out: (tf tensor) of shape = (batch_size, num_actions)
"""
# this information might be useful
num_actions = self.env.action_space.n
out = state
##############################################################
"""
TODO: implement a fully connected with no hidden layer (linear
approximation) using tensorflow. In other words, if your state s
has a flattened shape of n, and you have m actions, the result of
your computation sould be equal to
W s where W is a matrix of shape m x n
HINT: you may find tensorflow.contrib.layers useful (imported)
make sure to understand the use of the scope param
you can use any other methods from tensorflow
you are not allowed to import extra packages (like keras,
lasagne, cafe, etc.)
"""
##############################################################
################ YOUR CODE HERE - 2-3 lines ##################
state_flatten = layers.flatten(state, scope=scope)
out = layers.fully_connected(state_flatten, num_actions, reuse=reuse,
scope=scope, activation_fn=None)
##############################################################
######################## END YOUR CODE #######################
return out
def add_update_target_op(self, q_scope, target_q_scope):
"""
update_target_op will be called periodically
to copy Q network weights to target Q network
Remember that in DQN, we maintain two identical Q networks with
2 different set of weights. In tensorflow, we distinguish them
with two different scopes. One for the target network, one for the
regular network. If you're not familiar with the scope mechanism
in tensorflow, read the docs
https://www.tensorflow.org/programmers_guide/variable_scope
Periodically, we need to update all the weights of the Q network
and assign them with the values from the regular network. Thus,
what we need to do is to build a tf op, that, when called, will
assign all variables in the target network scope with the values of
the corresponding variables of the regular network scope.
Args:
q_scope: (string) name of the scope of variables for q
target_q_scope: (string) name of the scope of variables
for the target network
"""
##############################################################
"""
TODO: add an operator self.update_target_op that assigns variables
from target_q_scope with the values of the corresponding var
in q_scope
HINT: you may find the following functions useful:
- tf.get_collection #list
- tf.assign #return tensor
- tf.group
(be sure that you set self.update_target_op)
"""
##############################################################
################### YOUR CODE HERE - 5-10 lines #############
q_collection = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=q_scope)
target_q_collection = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=target_q_scope)
op = [tf.assign(target_q_collection[i], q_collection[i]) for i in range(len(q_collection))]
self.update_target_op = tf.group(*op)
##############################################################
######################## END YOUR CODE #######################
def add_loss_op(self, q, target_q):
"""
Sets the loss of a batch, self.loss is a scalar
Args:
q: (tf tensor) shape = (batch_size, num_actions)
target_q: (tf tensor) shape = (batch_size, num_actions)
"""
# you may need this variable
num_actions = self.env.action_space.n
##############################################################
"""
TODO: The loss for an example is defined as:
Q_samp(s) = r if done
= r + gamma * max_a' Q_target(s', a')
loss = (Q_samp(s) - Q(s, a))^2
You need to compute the average of the loss over the minibatch
and store the resulting scalar into self.loss
HINT: - config variables are accessible through self.config
- you can access placeholders like self.a (for actions)
self.r (rewards) or self.done_mask for instance
- you may find the following functions useful
- tf.cast
- tf.reduce_max / reduce_sum
- tf.one_hot
- ...
(be sure that you set self.loss)
"""
##############################################################
##################### YOUR CODE HERE - 4-5 lines #############
#done = tf.cast(self.done_mask, tf.float32)
temp = self.r + self.config.gamma*tf.reduce_max(target_q, axis=1)
q_samp = tf.where(self.done_mask, self.r, temp)
action = tf.one_hot(self.a, num_actions)
q_new = tf.reduce_sum(tf.multiply(action,q), axis=1)
self.loss = tf.reduce_mean(tf.square(q_new - q_samp))
##############################################################
######################## END YOUR CODE #######################
def add_optimizer_op(self, scope):
"""
Set self.train_op and self.grad_norm
"""
##############################################################
"""
TODO: 1. get Adam Optimizer (remember that we defined self.lr in the placeholders
section)
2. compute grads wrt to variables in scope for self.loss
3. clip the grads by norm with self.config.clip_val if self.config.grad_clip
is True
4. apply the gradients and store the train op in self.train_op
(sess.run(train_op) must update the variables)
5. compute the global norm of the gradients and store this scalar
in self.grad_norm
HINT: you may find the following functinos useful
- tf.get_collection
- optimizer.compute_gradients
- tf.clip_by_norm
- optimizer.apply_gradients
- tf.global_norm
you can access config variable by writing self.config.variable_name
(be sure that you set self.train_op and self.grad_norm)
"""
##############################################################
#################### YOUR CODE HERE - 8-12 lines #############
optimizer = tf.train.AdamOptimizer(learning_rate=self.lr)
scope_variable = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope)
grads_and_vars = optimizer.compute_gradients(self.loss, scope_variable)
if self.config.grad_clip:
clipped_grads_and_vars = [(tf.clip_by_norm(item[0],self.config.clip_val),item[1]) for item in grads_and_vars]
self.train_op = optimizer.apply_gradients(clipped_grads_and_vars)
self.grad_norm = tf.global_norm([item[0] for item in grads_and_vars])
##############################################################
######################## END YOUR CODE #######################
if __name__ == '__main__':
env = EnvTest((5, 5, 1))
# exploration strategy
exp_schedule = LinearExploration(env, config.eps_begin,
config.eps_end, config.eps_nsteps)
# learning rate schedule
lr_schedule = LinearSchedule(config.lr_begin, config.lr_end,
config.lr_nsteps)
# train model
model = Linear(env, config)
model.run(exp_schedule, lr_schedule)
================================================
FILE: assignment2/q3_nature.py
================================================
import tensorflow as tf
import tensorflow.contrib.layers as layers
from utils.general import get_logger
from utils.test_env import EnvTest
from q1_schedule import LinearExploration, LinearSchedule
from q2_linear import Linear
from configs.q3_nature import config
class NatureQN(Linear):
"""
Implementing DeepMind's Nature paper. Here are the relevant urls.
https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
"""
def get_q_values_op(self, state, scope, reuse=False):
"""
Returns Q values for all actions
Args:
state: (tf tensor)
shape = (batch_size, img height, img width, nchannels)
scope: (string) scope name, that specifies if target network or not
reuse: (bool) reuse of variables in the scope
Returns:
out: (tf tensor) of shape = (batch_size, num_actions)
"""
# this information might be useful
num_actions = self.env.action_space.n
out = state
##############################################################
"""
TODO: implement the computation of Q values like in the paper
https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
you may find the section "model architecture" of the appendix of the
nature paper particulary useful.
store your result in out of shape = (batch_size, num_actions)
HINT: you may find tensorflow.contrib.layers useful (imported)
make sure to understand the use of the scope param
you can use any other methods from tensorflow
you are not allowed to import extra packages (like keras,
lasagne, cafe, etc.)
"""
##############################################################
################ YOUR CODE HERE - 10-15 lines ################
with tf.variable_scope(scope, reuse=reuse) as _:
out = layers.conv2d(out, num_outputs=32, kernel_size=8, stride=4)
out = layers.conv2d(out, num_outputs=64, kernel_size=4, stride=2)
out = layers.conv2d(out, num_outputs=64, kernel_size=3, stride=1)
out = layers.flatten(out)
out = layers.fully_connected(out, num_outputs=512)
out = layers.fully_connected(out, num_outputs=num_actions, activation_fn=None)
##############################################################
######################## END YOUR CODE #######################
return out
"""
Use deep Q network for test environment.
"""
if __name__ == '__main__':
env = EnvTest((80, 80, 1))
# exploration strategy
exp_schedule = LinearExploration(env, config.eps_begin,
config.eps_end, config.eps_nsteps)
# learning rate schedule
lr_schedule = LinearSchedule(config.lr_begin, config.lr_end,
config.lr_nsteps)
# train model
model = NatureQN(env, config)
model.run(exp_schedule, lr_schedule)
================================================
FILE: assignment2/q4_train_atari_linear.py
================================================
import gym
from utils.preprocess import greyscale
from utils.wrappers import PreproWrapper, MaxAndSkipEnv
from q1_schedule import LinearExploration, LinearSchedule
from q2_linear import Linear
from configs.q4_train_atari_linear import config
"""
Use linear approximation for the Atari game. Please report the final result.
Feel free to change the configurations (in the configs/ folder).
If so, please report your hyperparameters.
You'll find the results, log and video recordings of your agent every 250k under
the corresponding file in the results folder. A good way to monitor the progress
of the training is to use Tensorboard. The starter code writes summaries of different
variables.
To launch tensorboard, open a Terminal window and run
tensorboard --logdir=results/
Then, connect remotely to
address-ip-of-the-server:6006
6006 is the default port used by tensorboard.
"""
if __name__ == '__main__':
# make env
env = gym.make(config.env_name)
env = MaxAndSkipEnv(env, skip=config.skip_frame)
env = PreproWrapper(env, prepro=greyscale, shape=(80, 80, 1),
overwrite_render=config.overwrite_render)
# exploration strategy
exp_schedule = LinearExploration(env, config.eps_begin,
config.eps_end, config.eps_nsteps)
# learning rate schedule
lr_schedule = LinearSchedule(config.lr_begin, config.lr_end,
config.lr_nsteps)
# train model
model = Linear(env, config)
model.run(exp_schedule, lr_schedule)
================================================
FILE: assignment2/q5_train_atari_nature.py
================================================
import gym
from utils.preprocess import greyscale
from utils.wrappers import PreproWrapper, MaxAndSkipEnv
from q1_schedule import LinearExploration, LinearSchedule
from q3_nature import NatureQN
from configs.q5_train_atari_nature import config
"""
Use deep Q network for the Atari game. Please report the final result.
Feel free to change the configurations (in the configs/ folder).
If so, please report your hyperparameters.
You'll find the results, log and video recordings of your agent every 250k under
the corresponding file in the results folder. A good way to monitor the progress
of the training is to use Tensorboard. The starter code writes summaries of different
variables.
To launch tensorboard, open a Terminal window and run
tensorboard --logdir=results/
Then, connect remotely to
address-ip-of-the-server:6006
6006 is the default port used by tensorboard.
"""
if __name__ == '__main__':
# make env
env = gym.make(config.env_name)
env = MaxAndSkipEnv(env, skip=config.skip_frame)
env = PreproWrapper(env, prepro=greyscale, shape=(80, 80, 1),
overwrite_render=config.overwrite_render)
# exploration strategy
exp_schedule = LinearExploration(env, config.eps_begin,
config.eps_end, config.eps_nsteps)
# learning rate schedule
lr_schedule = LinearSchedule(config.lr_begin, config.lr_end,
config.lr_nsteps)
# train model
model = NatureQN(env, config)
model.run(exp_schedule, lr_schedule)
================================================
FILE: assignment2/q6_double_q_learning.py
================================================
import gym
from utils.preprocess import greyscale
from utils.wrappers import PreproWrapper, MaxAndSkipEnv
import tensorflow as tf
import tensorflow.contrib.layers as layers
from utils.general import get_logger
from utils.test_env import EnvTest
from q1_schedule import LinearExploration, LinearSchedule
from q2_linear import Linear
from q3_nature import NatureQN
from configs.q6_bonus_question import config
class MyDQN(NatureQN):
"""
Going beyond - implement your own Deep Q Network to find the perfect
balance between depth, complexity, number of parameters, etc.
You can change the way the q-values are computed, the exploration
strategy, or the learning rate schedule. You can also create your own
wrapper of environment and transform your input to something that you
think we'll help to solve the task. Ideally, your network would run faster
than DeepMind's and achieve similar performance!
You can also change the optimizer (by overriding the functions defined
in TFLinear), or even change the sampling strategy from the replay buffer.
If you prefer not to build on the current architecture, you're welcome to
write your own code.
You may also try more recent approaches, like double Q learning
(see https://arxiv.org/pdf/1509.06461.pdf) or dueling networks
(see https://arxiv.org/abs/1511.06581), but this would be for extra
extra bonus points.
"""
def add_loss_op(self, q, target_q):
"""
Sets the loss of a batch, self.loss is a scalar
Args:
q: (tf tensor) shape = (batch_size, num_actions)
target_q: (tf tensor) shape = (batch_size, num_actions)
"""
# you may need this variable
num_actions = self.env.action_space.n
##############################################################
"""
TODO: The loss for an example is defined as:
Q_samp(s) = r if done
= r + gamma * Q_target(s', max_a'Q(s',a'))
loss = (Q_samp(s) - Q(s, a))^2
You need to compute the average of the loss over the minibatch
and store the resulting scalar into self.loss
HINT: - config variables are accessible through self.config
- you can access placeholders like self.a (for actions)
self.r (rewards) or self.done_mask for instance
- you may find the following functions useful
- tf.cast
- tf.reduce_max / reduce_sum
- tf.one_hot
- ...
(be sure that you set self.loss)
"""
##############################################################
##################### YOUR CODE HERE - 4-5 lines #############
#done = tf.cast(self.done_mask, tf.float32)
idx = tf.arg_max(q, dimension=1)
idx_one_hot = tf.one_hot(idx, num_actions)
temp = self.r + self.config.gamma*tf.reduce_sum(tf.multiply(target_q, idx_one_hot), axis=1)
q_samp = tf.where(self.done_mask, self.r, temp)
action = tf.one_hot(self.a, num_actions)
q_new = tf.reduce_sum(tf.multiply(action,q), axis=1)
self.loss = tf.reduce_mean(tf.square(q_new - q_samp))
##############################################################
######################## END YOUR CODE #######################
"""
Use a different architecture for the Atari game. Please report the final result.
Feel free to change the configuration. If so, please report your hyperparameters.
"""
if __name__ == '__main__':
# make env
env = gym.make(config.env_name)
env = MaxAndSkipEnv(env, skip=config.skip_frame)
env = PreproWrapper(env, prepro=greyscale, shape=(80, 80, 1),
overwrite_render=config.overwrite_render)
# exploration strategy
# you may want to modify this schedule
exp_schedule = LinearExploration(env, config.eps_begin,
config.eps_end, config.eps_nsteps)
# you may want to modify this schedule
# learning rate schedule
lr_schedule = LinearSchedule(config.lr_begin, config.lr_end,
config.lr_nsteps)
# train model
model = MyDQN(env, config)
model.run(exp_schedule, lr_schedule)
================================================
FILE: assignment2/q6_dueling.py
================================================
import gym
from utils.preprocess import greyscale
from utils.wrappers import PreproWrapper, MaxAndSkipEnv
import tensorflow as tf
import tensorflow.contrib.layers as layers
from utils.general import get_logger
from utils.test_env import EnvTest
from q1_schedule import LinearExploration, LinearSchedule
from q2_linear import Linear
from configs.q6_bonus_question import config
class MyDQN(Linear):
"""
Going beyond - implement your own Deep Q Network to find the perfect
balance between depth, complexity, number of parameters, etc.
You can change the way the q-values are computed, the exploration
strategy, or the learning rate schedule. You can also create your own
wrapper of environment and transform your input to something that you
think we'll help to solve the task. Ideally, your network would run faster
than DeepMind's and achieve similar performance!
You can also change the optimizer (by overriding the functions defined
in TFLinear), or even change the sampling strategy from the replay buffer.
If you prefer not to build on the current architecture, you're welcome to
write your own code.
You may also try more recent approaches, like double Q learning
(see https://arxiv.org/pdf/1509.06461.pdf) or dueling networks
(see https://arxiv.org/abs/1511.06581), but this would be for extra
extra bonus points.
"""
def get_q_values_op(self, state, scope, reuse=False):
"""
Returns Q values for all actions
Args:
state: (tf tensor)
shape = (batch_size, img height, img width, nchannels)
scope: (string) scope name, that specifies if target network or not
reuse: (bool) reuse of variables in the scope
Returns:
out: (tf tensor) of shape = (batch_size, num_actions)
"""
# this information might be useful
num_actions = self.env.action_space.n
out = state
##############################################################
"""
TODO: implement the computation of Q values like in the paper
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
HINT: you may find tensorflow.contrib.layers useful (imported)
make sure to understand the use of the scope param
you can use any other methods from tensorflow
you are not allowed to import extra packages (like keras,
lasagne, cafe, etc.)
L1: 32 8x8 filters with stride 4 + RELU
L2: 64 4x4 filters with stride 2 + RELU
L3: 64 3x3 fitlers with stride 1 + RELU
L4a: 512 unit Fully-Connected layer + RELU
L4b: 512 unit Fully-Connected layer + RELU
L5a: 1 unit FC (State Value)
L5b: #actions FC (Advantage Value)
L6: Aggregate V(s)+A(s,a)
"""
##############################################################
################ YOUR CODE HERE - 10-15 lines ################
with tf.variable_scope(scope, reuse=reuse) as _:
out = layers.conv2d(out, num_outputs=32, kernel_size=8, stride=4)
out = layers.conv2d(out, num_outputs=64, kernel_size=4, stride=2)
out = layers.conv2d(out, num_outputs=64, kernel_size=3, stride=1)
out = layers.flatten(out)
out = layers.fully_connected(out, num_outputs=512)
out1 = layers.fully_connected(out, num_outputs=1, activation_fn=None)
out2 = layers.fully_connected(out, num_outputs=num_actions, activation_fn=None)
out = out2 - tf.tile(tf.expand_dims(tf.reduce_mean(out2, axis=1),-1), [1,num_actions])
out = out + tf.tile(out1, [1,num_actions])
##############################################################
######################## END YOUR CODE #######################
return out
"""
Use a different architecture for the Atari game. Please report the final result.
Feel free to change the configuration. If so, please report your hyperparameters.
"""
if __name__ == '__main__':
# make env
env = gym.make(config.env_name)
env = MaxAndSkipEnv(env, skip=config.skip_frame)
env = PreproWrapper(env, prepro=greyscale, shape=(80, 80, 1),
overwrite_render=config.overwrite_render)
# exploration strategy
# you may want to modify this schedule
exp_schedule = LinearExploration(env, config.eps_begin,
config.eps_end, config.eps_nsteps)
# you may want to modify this schedule
# learning rate schedule
lr_schedule = LinearSchedule(config.lr_begin, config.lr_end,
config.lr_nsteps)
# train model
model = MyDQN(env, config)
model.run(exp_schedule, lr_schedule)
================================================
FILE: assignment2/requirements.txt
================================================
matplotlib
numpy
six
================================================
FILE: assignment2/results/q2_linear/log.txt
================================================
2017-11-28 20:52:49,822:INFO: Evaluating...
2017-11-28 20:52:50,064:INFO: Average reward: -0.50 +/- 0.00
2017-11-28 20:52:50,983:INFO: Evaluating...
2017-11-28 20:52:51,013:INFO: Average reward: -0.50 +/- 0.00
2017-11-28 20:52:51,772:INFO: Evaluating...
2017-11-28 20:52:51,803:INFO: Average reward: -0.10 +/- 0.00
2017-11-28 20:52:52,561:INFO: Evaluating...
2017-11-28 20:52:52,592:INFO: Average reward: -0.10 +/- 0.00
2017-11-28 20:52:53,356:INFO: Evaluating...
2017-11-28 20:52:53,386:INFO: Average reward: -0.30 +/- 0.00
2017-11-28 20:52:54,208:INFO: Evaluating...
2017-11-28 20:52:54,240:INFO: Average reward: -0.10 +/- 0.00
2017-11-28 20:52:54,996:INFO: Evaluating...
2017-11-28 20:52:55,026:INFO: Average reward: -0.10 +/- 0.00
2017-11-28 20:52:55,779:INFO: Evaluating...
2017-11-28 20:52:55,809:INFO: Average reward: -0.50 +/- 0.00
2017-11-28 20:52:56,576:INFO: Evaluating...
2017-11-28 20:52:56,604:INFO: Average reward: -0.10 +/- 0.00
2017-11-28 20:52:57,366:INFO: Evaluating...
2017-11-28 20:52:57,394:INFO: Average reward: -0.10 +/- 0.00
2017-11-28 20:52:58,138:INFO: - Training done.
2017-11-28 20:52:58,161:INFO: Evaluating...
2017-11-28 20:52:58,194:INFO: Average reward: -0.10 +/- 0.00
2017-11-28 21:10:09,597:INFO: Evaluating...
2017-11-28 21:10:09,634:INFO: Average reward: -0.30 +/- 0.00
2017-11-28 21:10:10,317:INFO: Evaluating...
2017-11-28 21:10:10,347:INFO: Average reward: 0.50 +/- 0.00
2017-11-28 21:10:11,113:INFO: Evaluating...
2017-11-28 21:10:11,145:INFO: Average reward: 0.10 +/- 0.00
2017-11-28 21:10:11,894:INFO: Evaluating...
2017-11-28 21:10:11,925:INFO: Average reward: -0.10 +/- 0.00
2017-11-28 21:10:12,685:INFO: Evaluating...
2017-11-28 21:10:12,717:INFO: Average reward: 0.50 +/- 0.00
2017-11-28 21:10:13,506:INFO: Evaluating...
2017-11-28 21:10:13,539:INFO: Average reward: 1.90 +/- 0.00
2017-11-28 21:10:14,291:INFO: Evaluating...
2017-11-28 21:10:14,322:INFO: Average reward: 2.10 +/- 0.00
2017-11-28 21:10:15,084:INFO: Evaluating...
2017-11-28 21:10:15,114:INFO: Average reward: 2.00 +/- 0.00
2017-11-28 21:10:15,876:INFO: Evaluating...
2017-11-28 21:10:15,907:INFO: Average reward: 2.10 +/- 0.00
2017-11-28 21:10:16,665:INFO: Evaluating...
2017-11-28 21:10:16,695:INFO: Average reward: 2.10 +/- 0.00
2017-11-28 21:10:17,432:INFO: - Training done.
2017-11-28 21:10:17,453:INFO: Evaluating...
2017-11-28 21:10:17,486:INFO: Average reward: 2.10 +/- 0.00
================================================
FILE: assignment2/results/q2_linear/model.weights/checkpoint
================================================
model_checkpoint_path: "."
all_model_checkpoint_paths: "."
================================================
FILE: assignment2/results/q3_nature/log.txt
================================================
2017-11-28 21:36:35,366:INFO: Evaluating...
2017-11-28 21:36:35,752:INFO: Average reward: 0.00 +/- 0.00
2017-11-28 21:36:36,569:INFO: Evaluating...
2017-11-28 21:36:36,868:INFO: Average reward: -0.50 +/- 0.00
2017-11-28 21:36:40,918:INFO: Evaluating...
2017-11-28 21:36:41,207:INFO: Average reward: 0.00 +/- 0.00
2017-11-28 21:36:45,230:INFO: Evaluating...
2017-11-28 21:36:45,520:INFO: Average reward: 0.50 +/- 0.00
2017-11-28 21:36:49,710:INFO: Evaluating...
2017-11-28 21:36:50,002:INFO: Average reward: 2.00 +/- 0.00
2017-11-28 21:36:54,073:INFO: Evaluating...
2017-11-28 21:36:54,361:INFO: Average reward: 2.00 +/- 0.00
2017-11-28 21:36:58,412:INFO: Evaluating...
2017-11-28 21:36:58,698:INFO: Average reward: 2.00 +/- 0.00
2017-11-28 21:37:02,752:INFO: Evaluating...
2017-11-28 21:37:03,044:INFO: Average reward: 2.10 +/- 0.00
2017-11-28 21:37:07,233:INFO: Evaluating...
2017-11-28 21:37:07,513:INFO: Average reward: 2.10 +/- 0.00
2017-11-28 21:37:09,855:INFO: - Training done.
2017-11-28 21:37:09,959:INFO: Evaluating...
2017-11-28 21:37:10,247:INFO: Average reward: 2.10 +/- 0.00
================================================
FILE: assignment2/results/q3_nature/model.weights/checkpoint
================================================
model_checkpoint_path: "."
all_model_checkpoint_paths: "."
================================================
FILE: assignment2/results/q4_train_atari_linear/log.txt
================================================
2017-11-29 16:06:16,994:INFO: Making new env: Pong-v0
2017-11-29 16:06:17,179:INFO: Creating monitor directory results/q4_train_atari_linear/monitor/
2017-11-29 16:06:17,187:INFO: Starting new video recorder writing to /home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.5469.video000000.mp4
2017-11-29 16:06:18,628:INFO: Finished writing results. You can upload them to the scoreboard via gym.upload('/home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor')
2017-11-29 16:06:18,629:INFO: Evaluating...
2017-11-29 16:07:00,357:INFO: Average reward: -20.98 +/- 0.02
2017-11-29 16:30:31,583:INFO: Evaluating...
2017-11-30 12:01:58,705:INFO: Making new env: Pong-v0
2017-11-30 12:01:58,917:INFO: Starting new video recorder writing to /home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.3758.video000000.mp4
2017-11-30 12:02:01,397:INFO: Finished writing results. You can upload them to the scoreboard via gym.upload('/home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor')
2017-11-30 12:02:01,397:INFO: Evaluating...
2017-11-30 12:02:40,550:INFO: Average reward: -20.98 +/- 0.02
2017-11-30 14:37:22,473:INFO: Making new env: Pong-v0
2017-11-30 14:37:22,717:INFO: Starting new video recorder writing to /home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.2799.video000000.mp4
2017-11-30 14:37:26,391:INFO: Finished writing results. You can upload them to the scoreboard via gym.upload('/home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor')
2017-11-30 14:37:26,392:INFO: Evaluating...
2017-11-30 14:38:24,987:INFO: Average reward: -20.90 +/- 0.06
2017-11-30 15:03:46,854:INFO: Evaluating...
================================================
FILE: assignment2/results/q4_train_atari_linear/model.weights/checkpoint
================================================
model_checkpoint_path: "."
all_model_checkpoint_paths: "."
================================================
FILE: assignment2/results/q4_train_atari_linear/monitor/openaigym.episode_batch.0.2799.stats.json
================================================
{"timestamps": [1512023846.375136], "initial_reset_timestamp": 1512023842.709429, "episode_types": ["t"], "episode_lengths": [1254], "episode_rewards": [-21.0]}
================================================
FILE: assignment2/results/q4_train_atari_linear/monitor/openaigym.episode_batch.0.3758.stats.json
================================================
{"timestamps": [1512014521.383348], "initial_reset_timestamp": 1512014518.909645, "episode_types": ["t"], "episode_lengths": [1005], "episode_rewards": [-21.0]}
================================================
FILE: assignment2/results/q4_train_atari_linear/monitor/openaigym.episode_batch.0.5469.stats.json
================================================
{"timestamps": [1511942778.615624], "initial_reset_timestamp": 1511942777.179417, "episode_types": ["t"], "episode_lengths": [1195], "episode_rewards": [-21.0]}
================================================
FILE: assignment2/results/q4_train_atari_linear/monitor/openaigym.manifest.0.2799.manifest.json
================================================
{"env_info": {"env_id": "Pong-v0", "gym_version": "0.9.3"}, "stats": "openaigym.episode_batch.0.2799.stats.json", "videos": [["openaigym.video.0.2799.video000000.mp4", "openaigym.video.0.2799.video000000.meta.json"]]}
================================================
FILE: assignment2/results/q4_train_atari_linear/monitor/openaigym.manifest.0.3758.manifest.json
================================================
{"env_info": {"env_id": "Pong-v0", "gym_version": "0.9.3"}, "stats": "openaigym.episode_batch.0.3758.stats.json", "videos": [["openaigym.video.0.3758.video000000.mp4", "openaigym.video.0.3758.video000000.meta.json"]]}
================================================
FILE: assignment2/results/q4_train_atari_linear/monitor/openaigym.manifest.0.5469.manifest.json
================================================
{"env_info": {"env_id": "Pong-v0", "gym_version": "0.9.3"}, "stats": "openaigym.episode_batch.0.5469.stats.json", "videos": [["openaigym.video.0.5469.video000000.mp4", "openaigym.video.0.5469.video000000.meta.json"]]}
================================================
FILE: assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.2799.video000000.meta.json
================================================
{"encoder_version": {"cmdline": ["avconv", "-nostats", "-loglevel", "error", "-y", "-r", "30", "-f", "rawvideo", "-s:v", "160x210", "-pix_fmt", "rgb24", "-i", "-", "-vcodec", "libx264", "-pix_fmt", "yuv420p", "/home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.2799.video000000.mp4"], "version": "avconv version 9.18-6:9.18-0ubuntu0.14.04.1, Copyright (c) 2000-2014 the Libav developers\n built on Mar 16 2015 13:19:10 with gcc 4.8 (Ubuntu 4.8.2-19ubuntu1)\navconv 9.18-6:9.18-0ubuntu0.14.04.1\nlibavutil 52. 3. 0 / 52. 3. 0\nlibavcodec 54. 35. 0 / 54. 35. 0\nlibavformat 54. 20. 4 / 54. 20. 4\nlibavdevice 53. 2. 0 / 53. 2. 0\nlibavfilter 3. 3. 0 / 3. 3. 0\nlibavresample 1. 0. 1 / 1. 0. 1\nlibswscale 2. 1. 1 / 2. 1. 1\n", "backend": "avconv"}, "content_type": "video/mp4", "episode_id": 0}
================================================
FILE: assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.3758.video000000.meta.json
================================================
{"encoder_version": {"cmdline": ["avconv", "-nostats", "-loglevel", "error", "-y", "-r", "30", "-f", "rawvideo", "-s:v", "160x210", "-pix_fmt", "rgb24", "-i", "-", "-vcodec", "libx264", "-pix_fmt", "yuv420p", "/home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.3758.video000000.mp4"], "version": "avconv version 9.18-6:9.18-0ubuntu0.14.04.1, Copyright (c) 2000-2014 the Libav developers\n built on Mar 16 2015 13:19:10 with gcc 4.8 (Ubuntu 4.8.2-19ubuntu1)\navconv 9.18-6:9.18-0ubuntu0.14.04.1\nlibavutil 52. 3. 0 / 52. 3. 0\nlibavcodec 54. 35. 0 / 54. 35. 0\nlibavformat 54. 20. 4 / 54. 20. 4\nlibavdevice 53. 2. 0 / 53. 2. 0\nlibavfilter 3. 3. 0 / 3. 3. 0\nlibavresample 1. 0. 1 / 1. 0. 1\nlibswscale 2. 1. 1 / 2. 1. 1\n", "backend": "avconv"}, "content_type": "video/mp4", "episode_id": 0}
================================================
FILE: assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.5469.video000000.meta.json
================================================
{"encoder_version": {"cmdline": ["avconv", "-nostats", "-loglevel", "error", "-y", "-r", "30", "-f", "rawvideo", "-s:v", "160x210", "-pix_fmt", "rgb24", "-i", "-", "-vcodec", "libx264", "-pix_fmt", "yuv420p", "/home/zengliang/CS234/assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.5469.video000000.mp4"], "version": "avconv version 9.18-6:9.18-0ubuntu0.14.04.1, Copyright (c) 2000-2014 the Libav developers\n built on Mar 16 2015 13:19:10 with gcc 4.8 (Ubuntu 4.8.2-19ubuntu1)\navconv 9.18-6:9.18-0ubuntu0.14.04.1\nlibavutil 52. 3. 0 / 52. 3. 0\nlibavcodec 54. 35. 0 / 54. 35. 0\nlibavformat 54. 20. 4 / 54. 20. 4\nlibavdevice 53. 2. 0 / 53. 2. 0\nlibavfilter 3. 3. 0 / 3. 3. 0\nlibavresample 1. 0. 1 / 1. 0. 1\nlibswscale 2. 1. 1 / 2. 1. 1\n", "backend": "avconv"}, "content_type": "video/mp4", "episode_id": 0}
================================================
FILE: assignment2/utils/__init__.py
================================================
================================================
FILE: assignment2/utils/general.py
================================================
import time
import sys
import logging
import numpy as np
from collections import deque
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
def export_plot(ys, ylabel, filename):
"""
Export a plot in filename
Args:
ys: (list) of float / int to plot
filename: (string) directory
"""
plt.figure()
plt.plot(range(len(ys)), ys)
plt.xlabel("Epoch")
plt.ylabel(ylabel)
plt.savefig(filename)
plt.close()
def get_logger(filename):
"""
Return a logger instance to a file
"""
logger = logging.getLogger('logger')
logger.setLevel(logging.DEBUG)
logging.basicConfig(format='%(message)s', level=logging.DEBUG)
handler = logging.FileHandler(filename)
handler.setLevel(logging.DEBUG)
handler.setFormatter(logging.Formatter('%(asctime)s:%(levelname)s: %(message)s'))
logging.getLogger().addHandler(handler)
return logger
class Progbar(object):
"""Progbar class copied from keras (https://github.com/fchollet/keras/)
Displays a progress bar.
Small edit : added strict arg to update
# Arguments
target: Total number of steps expected.
interval: Minimum visual progress update interval (in seconds).
"""
def __init__(self, target, width=30, verbose=1, discount=0.9):
self.width = width
self.target = target
self.sum_values = {}
self.exp_avg = {}
self.unique_values = []
self.start = time.time()
self.total_width = 0
self.seen_so_far = 0
self.verbose = verbose
self.discount = discount
def update(self, current, values=[], exact=[], strict=[], exp_avg=[]):
"""
Updates the progress bar.
# Arguments
current: Index of current step.
values: List of tuples (name, value_for_last_step).
The progress bar will display averages for these values.
exact: List of tuples (name, value_for_last_step).
The progress bar will display these values directly.
"""
for k, v in values:
if k not in self.sum_values:
self.sum_values[k] = [v * (current - self.seen_so_far), current - self.seen_so_far]
self.unique_values.append(k)
else:
self.sum_values[k][0] += v * (current - self.seen_so_far)
self.sum_values[k][1] += (current - self.seen_so_far)
for k, v in exact:
if k not in self.sum_values:
self.unique_values.append(k)
self.sum_values[k] = [v, 1]
for k, v in strict:
if k not in self.sum_values:
self.unique_values.append(k)
self.sum_values[k] = v
for k, v in exp_avg:
if k not in self.exp_avg:
self.exp_avg[k] = v
else:
self.exp_avg[k] *= self.discount
self.exp_avg[k] += (1-self.discount)*v
self.seen_so_far = current
now = time.time()
if self.verbose == 1:
prev_total_width = self.total_width
sys.stdout.write("\b" * prev_total_width)
sys.stdout.write("\r")
numdigits = int(np.floor(np.log10(self.target))) + 1
barstr = '%%%dd/%%%dd [' % (numdigits, numdigits)
bar = barstr % (current, self.target)
prog = float(current)/self.target
prog_width = int(self.width*prog)
if prog_width > 0:
bar += ('='*(prog_width-1))
if current < self.target:
bar += '>'
else:
bar += '='
bar += ('.'*(self.width-prog_width))
bar += ']'
sys.stdout.write(bar)
self.total_width = len(bar)
if current:
time_per_unit = (now - self.start) / current
else:
time_per_unit = 0
eta = time_per_unit*(self.target - current)
info = ''
if current < self.target:
info += ' - ETA: %ds' % eta
else:
info += ' - %ds' % (now - self.start)
for k in self.unique_values:
if type(self.sum_values[k]) is list:
info += ' - %s: %.4f' % (k, self.sum_values[k][0] / max(1, self.sum_values[k][1]))
else:
info += ' - %s: %s' % (k, self.sum_values[k])
for k, v in self.exp_avg.iteritems():
info += ' - %s: %.4f' % (k, v)
self.total_width += len(info)
if prev_total_width > self.total_width:
info += ((prev_total_width-self.total_width) * " ")
sys.stdout.write(info)
sys.stdout.flush()
if current >= self.target:
sys.stdout.write("\n")
if self.verbose == 2:
if current >= self.target:
info = '%ds' % (now - self.start)
for k in self.unique_values:
info += ' - %s: %.4f' % (k, self.sum_values[k][0] / max(1, self.sum_values[k][1]))
sys.stdout.write(info + "\n")
def add(self, n, values=[]):
self.update(self.seen_so_far+n, values)
================================================
FILE: assignment2/utils/preprocess.py
================================================
import numpy as np
def greyscale(state):
"""
Preprocess state (210, 160, 3) image into
a (80, 80, 1) image in grey scale
"""
state = np.reshape(state, [210, 160, 3]).astype(np.float32)
# grey scale
state = state[:, :, 0] * 0.299 + state[:, :, 1] * 0.587 + state[:, :, 2] * 0.114
# karpathy
state = state[35:195] # crop
state = state[::2,::2] # downsample by factor of 2
state = state[:, :, np.newaxis]
return state.astype(np.uint8)
def blackandwhite(state):
"""
Preprocess state (210, 160, 3) image into
a (80, 80, 1) image in grey scale
"""
# erase background
state[state==144] = 0
state[state==109] = 0
state[state!=0] = 1
# karpathy
state = state[35:195] # crop
state = state[::2,::2, 0] # downsample by factor of 2
state = state[:, :, np.newaxis]
return state.astype(np.uint8)
================================================
FILE: assignment2/utils/replay_buffer.py
================================================
import numpy as np
import random
def sample_n_unique(sampling_f, n):
"""Helper function. Given a function `sampling_f` that returns
comparable objects, sample n such unique objects.
"""
res = []
while len(res) < n:
candidate = sampling_f()
if candidate not in res:
res.append(candidate)
return res
class ReplayBuffer(object):
"""
Taken from Berkeley's Assignment
"""
def __init__(self, size, frame_history_len):
"""This is a memory efficient implementation of the replay buffer.
The sepecific memory optimizations use here are:
- only store each frame once rather than k times
even if every observation normally consists of k last frames
- store frames as np.uint8 (actually it is most time-performance
to cast them back to float32 on GPU to minimize memory transfer
time)
- store frame_t and frame_(t+1) in the same buffer.
For the tipical use case in Atari Deep RL buffer with 1M frames the total
memory footprint of this buffer is 10^6 * 84 * 84 bytes ~= 7 gigabytes
Warning! Assumes that returning frame of zeros at the beginning
of the episode, when there is less frames than `frame_history_len`,
is acceptable.
Parameters
----------
size: int
Max number of transitions to store in the buffer. When the buffer
overflows the old memories are dropped.
frame_history_len: int
Number of memories to be retried for each observation.
"""
self.size = size
self.frame_history_len = frame_history_len
self.next_idx = 0
self.num_in_buffer = 0
self.obs = None
self.action = None
self.reward = None
self.done = None
def can_sample(self, batch_size):
"""Returns true if `batch_size` different transitions can be sampled from the buffer."""
return batch_size + 1 <= self.num_in_buffer
def _encode_sample(self, idxes):
obs_batch = np.concatenate([self._encode_observation(idx)[None] for idx in idxes], 0)
act_batch = self.action[idxes]
rew_batch = self.reward[idxes]
next_obs_batch = np.concatenate([self._encode_observation(idx + 1)[None] for idx in idxes], 0)
done_mask = np.array([1.0 if self.done[idx] else 0.0 for idx in idxes], dtype=np.float32)
return obs_batch, act_batch, rew_batch, next_obs_batch, done_mask
def sample(self, batch_size):
"""Sample `batch_size` different transitions.
i-th sample transition is the following:
when observing `obs_batch[i]`, action `act_batch[i]` was taken,
after which reward `rew_batch[i]` was received and subsequent
observation next_obs_batch[i] was observed, unless the epsiode
was done which is represented by `done_mask[i]` which is equal
to 1 if episode has ended as a result of that action.
Parameters
----------
batch_size: int
How many transitions to sample.
Returns
-------
obs_batch: np.array
Array of shape
(batch_size, img_h, img_w, img_c * frame_history_len)
and dtype np.uint8
act_batch: np.array
Array of shape (batch_size,) and dtype np.int32
rew_batch: np.array
Array of shape (batch_size,) and dtype np.float32
next_obs_batch: np.array
Array of shape
(batch_size, img_h, img_w, img_c * frame_history_len)
and dtype np.uint8
done_mask: np.array
Array of shape (batch_size,) and dtype np.float32
"""
assert self.can_sample(batch_size)
idxes = sample_n_unique(lambda: random.randint(0, self.num_in_buffer - 2), batch_size)
return self._encode_sample(idxes)
def encode_recent_observation(self):
"""Return the most recent `frame_history_len` frames.
Returns
-------
observation: np.array
Array of shape (img_h, img_w, img_c * frame_history_len)
and dtype np.uint8, where observation[:, :, i*img_c:(i+1)*img_c]
encodes frame at time `t - frame_history_len + i`
"""
assert self.num_in_buffer > 0
return self._encode_observation((self.next_idx - 1) % self.size)
def _encode_observation(self, idx):
end_idx = idx + 1 # make noninclusive
start_idx = end_idx - self.frame_history_len
# this checks if we are using low-dimensional observations, such as RAM
# state, in which case we just directly return the latest RAM.
# if len(self.obs.shape) <= 2:
# return self.obs[end_idx-1]
# if there weren't enough frames ever in the buffer for context
if start_idx < 0 and self.num_in_buffer != self.size:
start_idx = 0
for idx in range(start_idx, end_idx - 1):
if self.done[idx % self.size]:
start_idx = idx + 1
missing_context = self.frame_history_len - (end_idx - start_idx)
# if zero padding is needed for missing context
# or we are on the boundry of the buffer
if start_idx < 0 or missing_context > 0:
frames = [np.zeros_like(self.obs[0]) for _ in range(missing_context)]
for idx in range(start_idx, end_idx):
frames.append(self.obs[idx % self.size])
return np.concatenate(frames, 2)
else:
# this optimization has potential to saves about 30% compute time \o/
img_h, img_w = self.obs.shape[1], self.obs.shape[2]
return self.obs[start_idx:end_idx].transpose(1, 2, 0, 3).reshape(img_h, img_w, -1)
def store_frame(self, frame):
"""Store a single frame in the buffer at the next available index, overwriting
old frames if necessary.
Parameters
----------
frame: np.array
Array of shape (img_h, img_w, img_c) and dtype np.uint8
the frame to be stored
Returns
-------
idx: int
Index at which the frame is stored. To be used for `store_effect` later.
"""
if self.obs is None:
self.obs = np.empty([self.size] + list(frame.shape), dtype=np.uint8)
self.action = np.empty([self.size], dtype=np.int32)
self.reward = np.empty([self.size], dtype=np.float32)
self.done = np.empty([self.size], dtype=np.bool)
self.obs[self.next_idx] = frame
ret = self.next_idx
self.next_idx = (self.next_idx + 1) % self.size
self.num_in_buffer = min(self.size, self.num_in_buffer + 1)
return ret
def store_effect(self, idx, action, reward, done):
"""Store effects of action taken after obeserving frame stored
at index idx. The reason `store_frame` and `store_effect` is broken
up into two functions is so that once can call `encode_recent_observation`
in between.
Paramters
---------
idx: int
Index in buffer of recently observed frame (returned by `store_frame`).
action: int
Action that was performed upon observing this frame.
reward: float
Reward that was received when the actions was performed.
done: bool
True if episode was finished after performing that action.
"""
self.action[idx] = action
self.reward[idx] = reward
self.done[idx] = done
================================================
FILE: assignment2/utils/test_env.py
================================================
import numpy as np
class ActionSpace(object):
def __init__(self, n):
self.n = n
def sample(self):
return np.random.randint(0, self.n)
class ObservationSpace(object):
def __init__(self, shape):
self.shape = shape
self.bad_state = np.random.randint(0, 50, shape, dtype=np.uint8)
self.normal_state = np.random.randint(100, 150, shape, dtype=np.uint8)
self.good_state = np.random.randint(200, 250, shape, dtype=np.uint8)
self.states = [self.bad_state, self.normal_state, self.good_state]
class EnvTest(object):
"""
Adapted from Igor Gitman, CMU / Karan Goel
"""
def __init__(self, shape=(84, 84, 3)):
#3 states
self.rewards = [-0.1, 0, 0.1]
self.cur_state = 0
self.num_iters = 0
self.was_in_second = False
self.action_space = ActionSpace(4)
self.observation_space = ObservationSpace(shape)
def reset(self):
self.cur_state = 0
self.num_iters = 0
self.was_in_second = False
return self.observation_space.states[self.cur_state]
def step(self, action):
assert(0 <= action <= 3)
self.num_iters += 1
if action < 3:
self.cur_state = action
reward = self.rewards[self.cur_state]
if self.was_in_second is True:
reward *= -10
if self.cur_state == 1:
self.was_in_second = True
else:
self.was_in_second = False
return self.observation_space.states[self.cur_state], reward, self.num_iters >= 5, {'ale.lives':0}
def render(self):
print(self.cur_state)
================================================
FILE: assignment2/utils/viewer.py
================================================
import pyglet
class SimpleImageViewer(object):
"""
Modified version of gym viewer to chose format (RBG or I)
see source here https://github.com/openai/gym/blob/master/gym/envs/classic_control/rendering.py
"""
def __init__(self, display=None):
self.window = None
self.isopen = False
self.display = display
def imshow(self, arr):
if self.window is None:
height, width, channels = arr.shape
self.window = pyglet.window.Window(width=width, height=height, display=self.display)
self.width = width
self.height = height
self.isopen = True
##########################
####### old version ######
# assert arr.shape == (self.height, self.width, I), "You passed in an image with the wrong number shape"
# image = pyglet.image.ImageData(self.width, self.height, 'RGB', arr.tobytes())
##########################
##########################
####### new version ######
nchannels = arr.shape[-1]
if nchannels == 1:
_format = "I"
elif nchannels == 3:
_format = "RGB"
else:
raise NotImplementedError
image = pyglet.image.ImageData(self.width, self.height, _format, arr.tobytes())
##########################
self.window.clear()
self.window.switch_to()
self.window.dispatch_events()
image.blit(0,0)
self.window.flip()
def close(self):
if self.isopen:
self.window.close()
self.isopen = False
def __del__(self):
self.close()
================================================
FILE: assignment2/utils/wrappers.py
================================================
import numpy as np
import gym
from gym import spaces
from viewer import SimpleImageViewer
from collections import deque
class MaxAndSkipEnv(gym.Wrapper):
"""
Wrapper from Berkeley's Assignment
Takes a max pool over the last n states
"""
def __init__(self, env=None, skip=4):
"""Return only every `skip`-th frame"""
super(MaxAndSkipEnv, self).__init__(env)
# most recent raw observations (for max pooling across time steps)
self._obs_buffer = deque(maxlen=2)
self._skip = skip
def _step(self, action):
total_reward = 0.0
done = None
for _ in range(self._skip):
obs, reward, done, info = self.env.step(action)
self._obs_buffer.append(obs)
total_reward += reward
if done:
break
max_frame = np.max(np.stack(self._obs_buffer), axis=0)
return max_frame, total_reward, done, info
def _reset(self):
"""Clear past frame buffer and init. to first obs. from inner env."""
self._obs_buffer.clear()
obs = self.env.reset()
self._obs_buffer.append(obs)
return obs
class PreproWrapper(gym.Wrapper):
"""
Wrapper for Pong to apply preprocessing
Stores the state into variable self.obs
"""
def __init__(self, env, prepro, shape, overwrite_render=True, high=255):
"""
Args:
env: (gym env)
prepro: (function) to apply to a state for preprocessing
shape: (list) shape of obs after prepro
overwrite_render: (bool) if True, render is overwriten to vizualise effect of prepro
grey_scale: (bool) if True, assume grey scale, else black and white
high: (int) max value of state after prepro
"""
super(PreproWrapper, self).__init__(env)
self.overwrite_render = overwrite_render
self.viewer = None
self.prepro = prepro
self.observation_space = spaces.Box(low=0, high=high, shape=shape)
self.high = high
def _step(self, action):
"""
Overwrites _step function from environment to apply preprocess
"""
obs, reward, done, info = self.env.step(action)
self.obs = self.prepro(obs)
return self.obs, reward, done, info
def _reset(self):
self.obs = self.prepro(self.env.reset())
return self.obs
def _render(self, mode='human', close=False):
"""
Overwrite _render function to vizualize preprocessing
"""
if self.overwrite_render:
if close:
if self.viewer is not None:
self.viewer.close()
self.viewer = None
return
img = self.obs
if mode == 'rgb_array':
return img
elif mode == 'human':
from gym.envs.classic_control import rendering
if self.viewer is None:
self.viewer = SimpleImageViewer()
self.viewer.imshow(img)
else:
super(PongWrapper, self)._render(mode, close)
================================================
FILE: assignment3/discrete_env.py
================================================
import numpy as np
from gym import Env, spaces
from gym.utils import seeding
def categorical_sample(prob_n, np_random):
"""
Sample from categorical distribution
Each row specifies class probabilities
"""
prob_n = np.asarray(prob_n)
csprob_n = np.cumsum(prob_n)
return (csprob_n > np_random.rand()).argmax()
class DiscreteEnv(Env):
"""
Has the following members
- nS: number of states
- nA: number of actions
- P: transitions (*)
- isd: initial state distribution (**)
(*) dictionary dict of dicts of lists, where
P[s][a] == [(probability, nextstate, reward, done), ...]
(**) list or array of length nS
"""
def __init__(self, nS, nA, P, isd):
self.P = P
self.isd = isd
self.lastaction=None # for rendering
self.nS = nS
self.nA = nA
self.action_space = spaces.Discrete(self.nA)
self.observation_space = spaces.Discrete(self.nS)
self._seed()
self._reset()
def _seed(self, seed=None):
self.np_random, seed = seeding.np_random(seed)
return [seed]
def _reset(self):
self.s = categorical_sample(self.isd, self.np_random)
self.lastaction=None
return self.s
def _step(self, a):
transitions = self.P[self.s][a]
i = categorical_sample([t[0] for t in transitions], self.np_random)
p, s, r, d= transitions[i]
self.s = s
self.lastaction=a
return (s, r, d, {"prob" : p})
================================================
FILE: assignment3/frozen_lake.py
================================================
import numpy as np
import sys
from six import StringIO, b
from gym import utils
import discrete_env
LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3
MAPS = {
"4x4": [
"SHHH",
"FHHH",
"FHHH",
"FFFG"
]
}
class FrozenLakeEnv(discrete_env.DiscreteEnv):
"""
Winter is here. You and your friends were tossing around a frisbee at the park
when you made a wild throw that left the frisbee out in the middle of the lake.
The water is mostly frozen, but there are a few holes where the ice has melted.
If you step into one of those holes, you'll fall into the freezing water.
At this time, there's an international frisbee shortage, so it's absolutely imperative that
you navigate across the lake and retrieve the disc.
However, the ice is slippery, so you won't always move in the direction you intend.
The surface is described using a grid like the following
SHHH
FHHH
FHHH
FFFG
S : starting point, safe
F : frozen surface, safe
H : hole, you cannot move to these place
G : goal, where the frisbee is located
The episode ends when you reach the goal or fall in a hole or reach max steps
You receive a reward of 1 if you reach the goal, and zero otherwise.
"""
metadata = {'render.modes': ['human', 'ansi']}
def __init__(self, desc=None, map_name="4x4",is_slippery=False):
if desc is None and map_name is None:
raise ValueError('Must provide either desc or map_name')
elif desc is None:
desc = MAPS[map_name]
self.desc = desc = np.asarray(desc,dtype='c')
self.nrow, self.ncol = nrow, ncol = desc.shape
nA = 4
nS = nrow * ncol
isd = np.array(desc == b'S').astype('float64').ravel()
isd /= isd.sum()
P = {s : {a : [] for a in range(nA)} for s in range(nS)}
self.a_true = []
for s in range(nS):
a_true_table = np.arange(4)
np.random.shuffle(a_true_table)
self.a_true.append(a_true_table)
def to_s(row, col):
return row*ncol + col
def inc(row, col, a):
a_true_table = self.a_true[to_s(row, col)]
if a_true_table[a]==0: # left
col = max(col-1,0)
elif a_true_table[a]==1: # down
row = min(row+1,nrow-1)
elif a_true_table[a]==2: # right
col = min(col+1,ncol-1)
elif a_true_table[a]==3: # up
row = max(row-1,0)
return (row, col)
for row in range(nrow):
for col in range(ncol):
s = to_s(row, col)
if desc[row, col] == b"H":
continue
for a in range(4):
li = P[s][a]
letter = desc[row, col]
if letter in b'GH':
li.append((1.0, s, 0, True))
else:
if is_slippery:
for b in [(a-1)%4, a, (a+1)%4]:
newrow, newcol = inc(row, col, b)
newstate = to_s(newrow, newcol)
newletter = desc[newrow, newcol]
# if meet hole, stay at original place
if newletter == b'H':
li.append((1.0, s, 0.0, False))
continue
done = bytes(newletter) in b'GH'
rew = float(newletter == b'G')
li.append((0.8 if b==a else 0.1, newstate, rew, done))
else:
newrow, newcol = inc(row, col, a)
newstate = to_s(newrow, newcol)
newletter = desc[newrow, newcol]
# if meet hole, stay at original place
if newletter == b'H':
li.append((1.0, s, 0.0, False))
continue
done = bytes(newletter) in b'GH'
rew = float(newletter == b'G')
li.append((1.0, newstate, rew, done))
super(FrozenLakeEnv, self).__init__(nS, nA, P, isd)
def _render(self, mode='human', close=False):
if close:
return
outfile = StringIO() if mode == 'ansi' else sys.stdout
row, col = self.s // self.ncol, self.s % self.ncol
desc = self.desc.tolist()
desc = [[c.decode('utf-8') for c in line] for line in desc]
desc[row][col] = utils.colorize(desc[row][col], "red", highlight=True)
if self.lastaction is not None:
outfile.write(" ({})\n".format(["Left","Down","Right","Up"][self.lastaction]))
else:
outfile.write("\n")
outfile.write("\n".join(''.join(line) for line in desc)+"\n")
return outfile
================================================
FILE: assignment3/q1.py
================================================
import math
import gym
from frozen_lake import *
import numpy as np
import time
from utils import *
import matplotlib.pyplot as plt
from tqdm import *
def rmax(env, gamma, m, R_max, epsilon, num_episodes, max_step = 6):
"""Learn state-action values using the Rmax algorithm
Args:
----------
env: gym.core.Environment
Environment to compute Q function for. Must have nS, nA, and P as
attributes.
gamma: float
Discount factor. Number in range [0, 1)
m: int
Threshold of visitance
R_max: float
The estimated max reward that could be obtained in the game
epsilon:
accuracy paramter
num_episodes: int
Number of episodes of training.
max_step: Int
max number of steps in each episode
Returns
-------
np.array
An array of shape [env.nS x env.nA] representing state-action values
"""
Q = np.ones((env.nS, env.nA)) * R_max / (1 - gamma)
R = np.zeros((env.nS, env.nA))
nSA = np.zeros((env.nS, env.nA))
nSASP = np.zeros((env.nS, env.nA, env.nS))
########################################################
# YOUR CODE HERE #
########################################################
total_score = 0
average_score = np.zeros(num_episodes)
for time in range(num_episodes):
is_done = False
cur_state = env.reset()
for _ in range(max_step):
if is_done:
break
action = np.argmax(Q[cur_state])
(next_state, reward, is_done, _) = env.step(action)
total_score += reward
if nSA[cur_state][action] < m:
nSA[cur_state][action] += 1
R[cur_state][action] += reward
nSASP[cur_state][action][next_state] +=1
if nSA[cur_state][action] == m:
up_bound = int(np.ceil(np.log(1.0/(epsilon*(1.0-gamma)))/(1.0-gamma)))
for i in range(up_bound):
for s in range(env.nS):
for a in range(env.nA):
if nSA[s][a] >= m:
q_temp = R[s][a] / nSA[s][a]
for j in range(env.nS):
prob = nSASP[s][a][j] / nSA[s][a]
q_temp += gamma*prob*np.max(Q[j])
Q[s][a] = q_temp
cur_state = next_state
average_score[time] = total_score / (time+1)
########################################################
# END YOUR CODE #
########################################################
return (Q, average_score)
def main():
env = FrozenLakeEnv(is_slippery=False)
print env.__doc__
for m in tqdm(np.arange(1,20,2)):
(Q, average_score) = rmax(env, gamma = 0.99, m=m, R_max = 1, epsilon = 0.1, num_episodes = 1000)
render_single_Q(env, Q)
plt.plot(np.arange(1000),np.array(average_score))
plt.title('The running average score of the R-max learning agent')
plt.xlabel('traning episodes')
plt.ylabel('score')
plt.legend(['m = '+str(i) for i in np.arange(1,20,2)], loc='upper right')
#plt.show()
plt.savefig('r-max.jpg')
if __name__ == '__main__':
print "haha"
main()
================================================
FILE: assignment3/q2.py
================================================
import math
import gym
from frozen_lake import *
import numpy as np
import time
from utils import *
from tqdm import *
import matplotlib.pyplot as plt
def learn_Q_QLearning(env, num_episodes=10000, gamma = 0.99, lr = 0.1, e = 0.2, max_step=6):
"""Learn state-action values using the Q-learning algorithm with epsilon-greedy exploration strategy(no decay)
Feel free to reuse your assignment1's code
Parameters
----------
env: gym.core.Environment
Environment to compute Q function for. Must have nS, nA, and P as attributes.
num_episodes: int
Number of episodes of training.
gamma: float
Discount factor. Number in range [0, 1)
learning_rate: float
Learning rate. Number in range [0, 1)
e: float
Epsilon value used in the epsilon-greedy method.
max_step: Int
max number of steps in each episode
Returns
-------
np.array
An array of shape [env.nS x env.nA] representing state-action values
"""
Q = np.zeros((env.nS, env.nA))
########################################################
# YOUR CODE HERE #
########################################################
total_score = 0
average_score = np.zeros(num_episodes)
for i in range(num_episodes):
done = False
state = env.reset()
for _ in range(max_step):
if done:
break
if np.random.rand() > e:
action = np.argmax(Q[state])
else:
action = np.random.randint(env.nA)
nextstate, reward, done, _ = env.step(action)
Q[state][action] = (1-lr)*Q[state][action]+lr*(reward+gamma*np.max(Q[nextstate]))
state = nextstate
total_score += reward
average_score[i] = total_score / (i+1)
########################################################
# END YOUR CODE #
########################################################
return (Q, average_score)
def main():
env = FrozenLakeEnv(is_slippery=False)
for e in tqdm(np.linspace(0,1,11)):
(Q, average_score) = learn_Q_QLearning(env, num_episodes = 10000, gamma = 0.99, lr = 0.1, e = e)
render_single_Q(env, Q)
plt.plot(np.arange(10000), np.array(average_score))
plt.title('The running average score of the Q-learning agent')
plt.xlabel('traning episodes')
plt.ylabel('score')
plt.legend(['e = '+str(i) for i in np.linspace(0,1,11)], loc='upper right')
#plt.show()
plt.savefig('q-learning.jpg')
if __name__ == '__main__':
main()
================================================
FILE: assignment3/q3.py
================================================
import math
import gym
from frozen_lake import *
import numpy as np
import time
from utils import *
import matplotlib.pyplot as plt
from tqdm import *
def rmax(env, gamma, m, R_max, epsilon, num_episodes, max_step = 6, e = 0.7):
"""Learn state-action values using the Rmax algorithm
Args:
----------
env: gym.core.Environment
Environment to compute Q function for. Must have nS, nA, and P as
attributes.
gamma: float
Discount factor. Number in range [0, 1)
m: int
Threshold of visitance
R_max: float
The estimated max reward that could be obtained in the game
epsilon:
accuracy paramter
num_episodes: int
Number of episodes of training.
max_step: Int
max number of steps in each episode
Returns
-------
np.array
An array of shape [env.nS x env.nA] representing state-action values
"""
Q = np.ones((env.nS, env.nA)) * R_max / (1 - gamma)
R = np.zeros((env.nS, env.nA))
nSA = np.zeros((env.nS, env.nA))
nSASP = np.zeros((env.nS, env.nA, env.nS))
########################################################
# YOUR CODE HERE #
########################################################
total_score = 0
average_score = np.zeros(num_episodes)
for time in range(num_episodes):
is_done = False
cur_state = env.reset()
for _ in range(max_step):
if is_done:
break
if np.random.rand() > e:
action = np.argmax(Q[cur_state])
else:
action = np.random.randint(env.nA)
(next_state, reward, is_done, _) = env.step(action)
total_score += reward
if nSA[cur_state][action] < m:
nSA[cur_state][action] += 1
R[cur_state][action] += reward
nSASP[cur_state][action][next_state] +=1
if nSA[cur_state][action] == m:
up_bound = int(np.ceil(np.log(1.0/(epsilon*(1.0-gamma)))/(1.0-gamma)))
for i in range(up_bound):
for s in range(env.nS):
for a in range(env.nA):
if nSA[s][a] >= m:
q_temp = R[s][a] / nSA[s][a]
for j in range(env.nS):
prob = nSASP[s][a][j] / nSA[s][a]
q_temp += gamma*prob*np.max(Q[j])
Q[s][a] = q_temp
cur_state = next_state
average_score[time] = total_score / (time+1)
########################################################
# END YOUR CODE #
########################################################
return (Q, average_score)
def main():
env = FrozenLakeEnv(is_slippery=False)
print env.__doc__
(Q, average_score) = rmax(env, gamma = 0.99, m=1, R_max = 1, epsilon = 0.1, num_episodes = 1000)
render_single_Q(env, Q)
plt.plot(np.arange(1000),np.array(average_score))
plt.title('The running average score of the R-max with e-greedy learning agent')
plt.xlabel('traning episodes')
plt.ylabel('score')
#plt.show()
plt.savefig('r-max+e_greedy.jpg')
if __name__ == '__main__':
print "haha"
main()
================================================
FILE: assignment3/requirements.txt
================================================
matplotlib
numpy
six
================================================
FILE: assignment3/utils.py
================================================
import math
import gym
from frozen_lake import *
import numpy as np
import time
def render_single_Q(env, Q, max_step = 6):
"""Renders Q function once on environment.
Parameters
----------
env: gym.core.Environment
Environment to play Q function on. Must have nS, nA, and P as
attributes.
Q: np.array of shape [env.nS x env.nA]
Q function
"""
state = env.reset()
done = False
episode_reward = 0
count = 0
while not done:
env.render()
time.sleep(0.5) # Seconds between frames. Modify as you wish.
action = np.argmax(Q[state])
state, reward, done, _ = env.step(action)
episode_reward += reward
count += 1
if count >= max_step:
break
print "Episode reward: %d" % episode_reward
gitextract_onbbhi0g/
├── LICENSE
├── README.md
├── assignment1/
│ ├── Makefile
│ ├── collect_submission.sh
│ ├── lake_envs.py
│ ├── log
│ ├── model_based_learning.py
│ ├── model_free_learning.py
│ ├── requirements.txt
│ └── vi_and_pi.py
├── assignment2/
│ ├── .gitignore
│ ├── Makefile
│ ├── README.md
│ ├── collect_submission.sh
│ ├── configs/
│ │ ├── __init__.py
│ │ ├── frozen_lake.py
│ │ ├── q2_linear.py
│ │ ├── q3_nature.py
│ │ ├── q4_train_atari_linear.py
│ │ ├── q5_train_atari_nature.py
│ │ ├── q6_bonus_question.py
│ │ └── test.py
│ ├── core/
│ │ ├── __init__.py
│ │ ├── deep_q_learning.py
│ │ └── q_learning.py
│ ├── q1_schedule.py
│ ├── q2_linear.py
│ ├── q3_nature.py
│ ├── q4_train_atari_linear.py
│ ├── q5_train_atari_nature.py
│ ├── q6_double_q_learning.py
│ ├── q6_dueling.py
│ ├── requirements.txt
│ ├── results/
│ │ ├── q2_linear/
│ │ │ ├── events.out.tfevents.1511874609.zengliang-PU551LD
│ │ │ ├── log.txt
│ │ │ └── model.weights/
│ │ │ ├── .data-00000-of-00001
│ │ │ ├── .index
│ │ │ ├── .meta
│ │ │ └── checkpoint
│ │ ├── q3_nature/
│ │ │ ├── events.out.tfevents.1511876195.zengliang-PU551LD
│ │ │ ├── log.txt
│ │ │ └── model.weights/
│ │ │ ├── .index
│ │ │ ├── .meta
│ │ │ └── checkpoint
│ │ └── q4_train_atari_linear/
│ │ ├── log.txt
│ │ ├── model.weights/
│ │ │ ├── .data-00000-of-00001
│ │ │ ├── .index
│ │ │ ├── .meta
│ │ │ └── checkpoint
│ │ └── monitor/
│ │ ├── openaigym.episode_batch.0.2799.stats.json
│ │ ├── openaigym.episode_batch.0.3758.stats.json
│ │ ├── openaigym.episode_batch.0.5469.stats.json
│ │ ├── openaigym.manifest.0.2799.manifest.json
│ │ ├── openaigym.manifest.0.3758.manifest.json
│ │ ├── openaigym.manifest.0.5469.manifest.json
│ │ ├── openaigym.video.0.2799.video000000.meta.json
│ │ ├── openaigym.video.0.3758.video000000.meta.json
│ │ └── openaigym.video.0.5469.video000000.meta.json
│ └── utils/
│ ├── __init__.py
│ ├── general.py
│ ├── preprocess.py
│ ├── replay_buffer.py
│ ├── test_env.py
│ ├── viewer.py
│ └── wrappers.py
└── assignment3/
├── discrete_env.py
├── frozen_lake.py
├── q1.py
├── q2.py
├── q3.py
├── requirements.txt
└── utils.py
SYMBOL INDEX (135 symbols across 29 files)
FILE: assignment1/model_based_learning.py
function initialize_P (line 15) | def initialize_P(nS, nA):
function initialize_counts (line 34) | def initialize_counts(nS, nA):
function initialize_rewards (line 53) | def initialize_rewards(nS, nA):
function counts_and_rewards_to_P (line 72) | def counts_and_rewards_to_P(counts, rewards, terminal_state):
function update_mdp_model_with_history (line 113) | def update_mdp_model_with_history(counts, rewards, history):
function learn_with_mdp_model (line 147) | def learn_with_mdp_model(env, method=None, num_episodes=5000, gamma = 0....
function render_single (line 216) | def render_single(env, policy):
function main (line 242) | def main():
FILE: assignment1/model_free_learning.py
function learn_Q_QLearning (line 12) | def learn_Q_QLearning(env, num_episodes=2000, gamma=0.95, lr=0.1, e=0.8,...
function learn_Q_SARSA (line 66) | def learn_Q_SARSA(env, num_episodes=2000, gamma=0.95, lr=0.1, e=0.8, dec...
function render_single_Q (line 116) | def render_single_Q(env, Q):
function main (line 142) | def main():
FILE: assignment1/vi_and_pi.py
function value_iteration (line 12) | def value_iteration(P, nS, nA, gamma=0.9, max_iteration=20, tol=1e-3):
function policy_evaluation (line 68) | def policy_evaluation(P, nS, nA, policy, gamma=0.9, max_iteration=100, t...
function policy_improvement (line 111) | def policy_improvement(P, nS, nA, value_from_policy, policy, gamma=0.9):
function policy_iteration (line 152) | def policy_iteration(P, nS, nA, gamma=0.9, max_iteration=200, tol=1e-3):
function example (line 194) | def example(env):
function render_single (line 215) | def render_single(env, policy):
FILE: assignment2/configs/frozen_lake.py
class config (line 1) | class config():
FILE: assignment2/configs/q2_linear.py
class config (line 1) | class config():
FILE: assignment2/configs/q3_nature.py
class config (line 1) | class config():
FILE: assignment2/configs/q4_train_atari_linear.py
class config (line 1) | class config():
FILE: assignment2/configs/q5_train_atari_nature.py
class config (line 1) | class config():
FILE: assignment2/configs/q6_bonus_question.py
class config (line 1) | class config():
FILE: assignment2/configs/test.py
class config (line 1) | class config():
FILE: assignment2/core/deep_q_learning.py
class DQN (line 9) | class DQN(QN):
method add_placeholders_op (line 13) | def add_placeholders_op(self):
method get_q_values_op (line 17) | def get_q_values_op(self, scope, reuse=False):
method add_update_target_op (line 24) | def add_update_target_op(self, q_scope, target_q_scope):
method add_loss_op (line 37) | def add_loss_op(self, q, target_q):
method add_optimizer_op (line 44) | def add_optimizer_op(self, scope):
method process_state (line 51) | def process_state(self, state):
method build (line 69) | def build(self):
method initialize (line 94) | def initialize(self):
method add_summary (line 116) | def add_summary(self):
method save (line 153) | def save(self):
method get_best_action (line 163) | def get_best_action(self, state):
method update_step (line 177) | def update_step(self, t, replay_buffer, lr):
method update_target_params (line 221) | def update_target_params(self):
FILE: assignment2/core/q_learning.py
class QN (line 16) | class QN(object):
method __init__ (line 20) | def __init__(self, env, config, logger=None):
method build (line 43) | def build(self):
method policy (line 51) | def policy(self):
method save (line 58) | def save(self):
method initialize (line 68) | def initialize(self):
method get_best_action (line 75) | def get_best_action(self, state):
method get_action (line 87) | def get_action(self, state):
method update_target_params (line 100) | def update_target_params(self):
method init_averages (line 107) | def init_averages(self):
method update_averages (line 122) | def update_averages(self, rewards, max_q_values, q_values, scores_eval):
method train (line 144) | def train(self, exp_schedule, lr_schedule):
method train_step (line 241) | def train_step(self, t, replay_buffer, lr):
method evaluate (line 267) | def evaluate(self, env=None, num_episodes=None):
method record (line 323) | def record(self):
method run (line 335) | def run(self, exp_schedule, lr_schedule):
FILE: assignment2/q1_schedule.py
class LinearSchedule (line 5) | class LinearSchedule(object):
method __init__ (line 6) | def __init__(self, eps_begin, eps_end, nsteps):
method update (line 19) | def update(self, t):
class LinearExploration (line 48) | class LinearExploration(LinearSchedule):
method __init__ (line 49) | def __init__(self, env, eps_begin, eps_end, nsteps):
method get_action (line 61) | def get_action(self, best_action):
function test1 (line 91) | def test1():
function test2 (line 104) | def test2():
function test3 (line 112) | def test3():
function your_test (line 120) | def your_test():
FILE: assignment2/q2_linear.py
class Linear (line 12) | class Linear(DQN):
method add_placeholders_op (line 16) | def add_placeholders_op(self):
method get_q_values_op (line 72) | def get_q_values_op(self, state, scope, reuse=False):
method add_update_target_op (line 117) | def add_update_target_op(self, q_scope, target_q_scope):
method add_loss_op (line 165) | def add_loss_op(self, q, target_q):
method add_optimizer_op (line 211) | def add_optimizer_op(self, scope):
FILE: assignment2/q3_nature.py
class NatureQN (line 13) | class NatureQN(Linear):
method get_q_values_op (line 19) | def get_q_values_op(self, state, scope, reuse=False):
FILE: assignment2/q6_double_q_learning.py
class MyDQN (line 17) | class MyDQN(NatureQN):
method add_loss_op (line 38) | def add_loss_op(self, q, target_q):
FILE: assignment2/q6_dueling.py
class MyDQN (line 16) | class MyDQN(Linear):
method get_q_values_op (line 37) | def get_q_values_op(self, state, scope, reuse=False):
FILE: assignment2/utils/general.py
function export_plot (line 11) | def export_plot(ys, ylabel, filename):
function get_logger (line 27) | def get_logger(filename):
class Progbar (line 41) | class Progbar(object):
method __init__ (line 51) | def __init__(self, target, width=30, verbose=1, discount=0.9):
method update (line 63) | def update(self, current, values=[], exact=[], strict=[], exp_avg=[]):
method add (line 156) | def add(self, n, values=[]):
FILE: assignment2/utils/preprocess.py
function greyscale (line 3) | def greyscale(state):
function blackandwhite (line 22) | def blackandwhite(state):
FILE: assignment2/utils/replay_buffer.py
function sample_n_unique (line 4) | def sample_n_unique(sampling_f, n):
class ReplayBuffer (line 15) | class ReplayBuffer(object):
method __init__ (line 19) | def __init__(self, size, frame_history_len):
method can_sample (line 56) | def can_sample(self, batch_size):
method _encode_sample (line 60) | def _encode_sample(self, idxes):
method sample (line 70) | def sample(self, batch_size):
method encode_recent_observation (line 107) | def encode_recent_observation(self):
method _encode_observation (line 120) | def _encode_observation(self, idx):
method store_frame (line 146) | def store_frame(self, frame):
method store_effect (line 174) | def store_effect(self, idx, action, reward, done):
FILE: assignment2/utils/test_env.py
class ActionSpace (line 3) | class ActionSpace(object):
method __init__ (line 4) | def __init__(self, n):
method sample (line 7) | def sample(self):
class ObservationSpace (line 11) | class ObservationSpace(object):
method __init__ (line 12) | def __init__(self, shape):
class EnvTest (line 20) | class EnvTest(object):
method __init__ (line 24) | def __init__(self, shape=(84, 84, 3)):
method reset (line 34) | def reset(self):
method step (line 41) | def step(self, action):
method render (line 56) | def render(self):
FILE: assignment2/utils/viewer.py
class SimpleImageViewer (line 4) | class SimpleImageViewer(object):
method __init__ (line 9) | def __init__(self, display=None):
method imshow (line 15) | def imshow(self, arr):
method close (line 48) | def close(self):
method __del__ (line 54) | def __del__(self):
FILE: assignment2/utils/wrappers.py
class MaxAndSkipEnv (line 8) | class MaxAndSkipEnv(gym.Wrapper):
method __init__ (line 13) | def __init__(self, env=None, skip=4):
method _step (line 20) | def _step(self, action):
method _reset (line 34) | def _reset(self):
class PreproWrapper (line 42) | class PreproWrapper(gym.Wrapper):
method __init__ (line 47) | def __init__(self, env, prepro, shape, overwrite_render=True, high=255):
method _step (line 65) | def _step(self, action):
method _reset (line 74) | def _reset(self):
method _render (line 79) | def _render(self, mode='human', close=False):
FILE: assignment3/discrete_env.py
function categorical_sample (line 6) | def categorical_sample(prob_n, np_random):
class DiscreteEnv (line 16) | class DiscreteEnv(Env):
method __init__ (line 31) | def __init__(self, nS, nA, P, isd):
method _seed (line 44) | def _seed(self, seed=None):
method _reset (line 48) | def _reset(self):
method _step (line 53) | def _step(self, a):
FILE: assignment3/frozen_lake.py
class FrozenLakeEnv (line 23) | class FrozenLakeEnv(discrete_env.DiscreteEnv):
method __init__ (line 50) | def __init__(self, desc=None, map_name="4x4",is_slippery=False):
method _render (line 128) | def _render(self, mode='human', close=False):
FILE: assignment3/q1.py
function rmax (line 11) | def rmax(env, gamma, m, R_max, epsilon, num_episodes, max_step = 6):
function main (line 79) | def main():
FILE: assignment3/q2.py
function learn_Q_QLearning (line 10) | def learn_Q_QLearning(env, num_episodes=10000, gamma = 0.99, lr = 0.1, e...
function main (line 63) | def main():
FILE: assignment3/q3.py
function rmax (line 11) | def rmax(env, gamma, m, R_max, epsilon, num_episodes, max_step = 6, e = ...
function main (line 82) | def main():
FILE: assignment3/utils.py
function render_single_Q (line 8) | def render_single_Q(env, Q, max_step = 6):
Condensed preview — 72 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (152K chars).
[
{
"path": "LICENSE",
"chars": 1067,
"preview": "MIT License\n\nCopyright (c) 2017 Liang Zeng\n\nPermission is hereby granted, free of charge, to any person obtaining a copy"
},
{
"path": "README.md",
"chars": 1998,
"preview": "## My Solution to Assignments of CS234\nThis is my solution to three assignments of CS234.<br>\n[CS234: Deep Reinforcement"
},
{
"path": "assignment1/Makefile",
"chars": 103,
"preview": "submit:\n\tsh collect_submission.sh\n\nclean:\n\trm -f assignment1.zip\n\trm -f *.pyc *.png *.npy utils/*.pyc\n\n"
},
{
"path": "assignment1/collect_submission.sh",
"chars": 58,
"preview": "rm -f assignment1.zip\nzip -r assignment1.zip *.py *.ipynb\n"
},
{
"path": "assignment1/lake_envs.py",
"chars": 691,
"preview": "# coding: utf-8\n\"\"\"Defines some frozen lake maps.\"\"\"\nfrom gym.envs.toy_text import frozen_lake, discrete\nfrom gym.envs.r"
},
{
"path": "assignment1/log",
"chars": 5213,
"preview": "\n Winter is here. You and your friends were tossing around a frisbee at the park\n when you made a wild throw that "
},
{
"path": "assignment1/model_based_learning.py",
"chars": 10334,
"preview": "### Episodic Model Based Learning using Maximum Likelihood Estimate of the Environment\n\n# Do not change the arguments an"
},
{
"path": "assignment1/model_free_learning.py",
"chars": 5603,
"preview": "### Episode model free learning using Q-learning and SARSA\n\n# Do not change the arguments and output types of any of the"
},
{
"path": "assignment1/requirements.txt",
"chars": 16,
"preview": "matplotlib\nnumpy"
},
{
"path": "assignment1/vi_and_pi.py",
"chars": 7896,
"preview": "### MDP Value Iteration and Policy Iteratoin\n# You might not need to use all parameters\n\nimport numpy as np\nimport gym\ni"
},
{
"path": "assignment2/.gitignore",
"chars": 8,
"preview": "/results"
},
{
"path": "assignment2/Makefile",
"chars": 110,
"preview": "submit:\r\n\tsh collect_submission.sh\r\n\r\nclean:\r\n\trm -f assignment1.zip\r\n\trm -f *.pyc *.png *.npy utils/*.pyc\r\n\r\n"
},
{
"path": "assignment2/README.md",
"chars": 1635,
"preview": "# RL with Atari\r\n\r\n## Install\r\n\r\nFirst, install gym and atari environments. You may need to install other dependencies d"
},
{
"path": "assignment2/collect_submission.sh",
"chars": 150,
"preview": "rm -f assignment2.zip \r\nzip -r assignment2.zip . -x \"*.pyc\" \"*.git*\" \"*weights/*\" \"*README.md\" \"*collect_submission.sh\" "
},
{
"path": "assignment2/configs/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "assignment2/configs/frozen_lake.py",
"chars": 1103,
"preview": "class config():\r\n # env config\r\n render_train = False\r\n render_test = False\r\n env_name = \"P"
},
{
"path": "assignment2/configs/q2_linear.py",
"chars": 1091,
"preview": "class config():\r\n # env config\r\n render_train = False\r\n render_test = False\r\n overwrite_render = Tr"
},
{
"path": "assignment2/configs/q3_nature.py",
"chars": 1095,
"preview": "class config():\r\n # env config\r\n render_train = False\r\n render_test = False\r\n overwrite_render = Tr"
},
{
"path": "assignment2/configs/q4_train_atari_linear.py",
"chars": 1266,
"preview": "class config():\r\n # env config\r\n render_train = False\r\n render_test = False\r\n env_name = \"P"
},
{
"path": "assignment2/configs/q5_train_atari_nature.py",
"chars": 1266,
"preview": "class config():\r\n # env config\r\n render_train = False\r\n render_test = False\r\n env_name = \"P"
},
{
"path": "assignment2/configs/q6_bonus_question.py",
"chars": 1267,
"preview": "class config():\r\n # env config\r\n render_train = False\r\n render_test = False\r\n env_name = \"P"
},
{
"path": "assignment2/configs/test.py",
"chars": 1155,
"preview": "class config():\r\n # env config\r\n render_train = True\r\n render_test = False\r\n env_name = \"Po"
},
{
"path": "assignment2/core/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "assignment2/core/deep_q_learning.py",
"chars": 7026,
"preview": "import os\r\nimport numpy as np\r\nimport tensorflow as tf\r\nimport time\r\n\r\nfrom q_learning import QN\r\n\r\n\r\nclass DQN(QN):\r\n "
},
{
"path": "assignment2/core/q_learning.py",
"chars": 11194,
"preview": "import os\r\nimport gym\r\nimport numpy as np\r\nimport logging\r\nimport time\r\nimport sys\r\nfrom gym import wrappers\r\nfrom colle"
},
{
"path": "assignment2/q1_schedule.py",
"chars": 3976,
"preview": "import numpy as np\r\nfrom utils.test_env import EnvTest\r\n\r\n\r\nclass LinearSchedule(object):\r\n def __init__(self, eps_be"
},
{
"path": "assignment2/q2_linear.py",
"chars": 12005,
"preview": "import tensorflow as tf\r\nimport tensorflow.contrib.layers as layers\r\n\r\nfrom utils.general import get_logger\r\nfrom utils."
},
{
"path": "assignment2/q3_nature.py",
"chars": 3293,
"preview": "import tensorflow as tf\r\nimport tensorflow.contrib.layers as layers\r\n\r\nfrom utils.general import get_logger\r\nfrom utils."
},
{
"path": "assignment2/q4_train_atari_linear.py",
"chars": 1550,
"preview": "import gym\r\nfrom utils.preprocess import greyscale\r\nfrom utils.wrappers import PreproWrapper, MaxAndSkipEnv\r\n\r\nfrom q1_s"
},
{
"path": "assignment2/q5_train_atari_nature.py",
"chars": 1548,
"preview": "import gym\r\nfrom utils.preprocess import greyscale\r\nfrom utils.wrappers import PreproWrapper, MaxAndSkipEnv\r\n\r\nfrom q1_s"
},
{
"path": "assignment2/q6_double_q_learning.py",
"chars": 4406,
"preview": "import gym\r\nfrom utils.preprocess import greyscale\r\nfrom utils.wrappers import PreproWrapper, MaxAndSkipEnv\r\n\r\nimport te"
},
{
"path": "assignment2/q6_dueling.py",
"chars": 4926,
"preview": "import gym\r\nfrom utils.preprocess import greyscale\r\nfrom utils.wrappers import PreproWrapper, MaxAndSkipEnv\r\n\r\nimport te"
},
{
"path": "assignment2/requirements.txt",
"chars": 22,
"preview": "matplotlib\r\nnumpy\r\nsix"
},
{
"path": "assignment2/results/q2_linear/log.txt",
"chars": 2395,
"preview": "2017-11-28 20:52:49,822:INFO: Evaluating...\n2017-11-28 20:52:50,064:INFO: Average reward: -0.50 +/- 0.00\n2017-11-28 20:5"
},
{
"path": "assignment2/results/q2_linear/model.weights/checkpoint",
"chars": 59,
"preview": "model_checkpoint_path: \".\"\nall_model_checkpoint_paths: \".\"\n"
},
{
"path": "assignment2/results/q3_nature/log.txt",
"chars": 1088,
"preview": "2017-11-28 21:36:35,366:INFO: Evaluating...\n2017-11-28 21:36:35,752:INFO: Average reward: 0.00 +/- 0.00\n2017-11-28 21:36"
},
{
"path": "assignment2/results/q3_nature/model.weights/checkpoint",
"chars": 59,
"preview": "model_checkpoint_path: \".\"\nall_model_checkpoint_paths: \".\"\n"
},
{
"path": "assignment2/results/q4_train_atari_linear/log.txt",
"chars": 1756,
"preview": "2017-11-29 16:06:16,994:INFO: Making new env: Pong-v0\n2017-11-29 16:06:17,179:INFO: Creating monitor directory results/q"
},
{
"path": "assignment2/results/q4_train_atari_linear/model.weights/checkpoint",
"chars": 59,
"preview": "model_checkpoint_path: \".\"\nall_model_checkpoint_paths: \".\"\n"
},
{
"path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.episode_batch.0.2799.stats.json",
"chars": 160,
"preview": "{\"timestamps\": [1512023846.375136], \"initial_reset_timestamp\": 1512023842.709429, \"episode_types\": [\"t\"], \"episode_lengt"
},
{
"path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.episode_batch.0.3758.stats.json",
"chars": 160,
"preview": "{\"timestamps\": [1512014521.383348], \"initial_reset_timestamp\": 1512014518.909645, \"episode_types\": [\"t\"], \"episode_lengt"
},
{
"path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.episode_batch.0.5469.stats.json",
"chars": 160,
"preview": "{\"timestamps\": [1511942778.615624], \"initial_reset_timestamp\": 1511942777.179417, \"episode_types\": [\"t\"], \"episode_lengt"
},
{
"path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.manifest.0.2799.manifest.json",
"chars": 217,
"preview": "{\"env_info\": {\"env_id\": \"Pong-v0\", \"gym_version\": \"0.9.3\"}, \"stats\": \"openaigym.episode_batch.0.2799.stats.json\", \"video"
},
{
"path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.manifest.0.3758.manifest.json",
"chars": 217,
"preview": "{\"env_info\": {\"env_id\": \"Pong-v0\", \"gym_version\": \"0.9.3\"}, \"stats\": \"openaigym.episode_batch.0.3758.stats.json\", \"video"
},
{
"path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.manifest.0.5469.manifest.json",
"chars": 217,
"preview": "{\"env_info\": {\"env_id\": \"Pong-v0\", \"gym_version\": \"0.9.3\"}, \"stats\": \"openaigym.episode_batch.0.5469.stats.json\", \"video"
},
{
"path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.2799.video000000.meta.json",
"chars": 864,
"preview": "{\"encoder_version\": {\"cmdline\": [\"avconv\", \"-nostats\", \"-loglevel\", \"error\", \"-y\", \"-r\", \"30\", \"-f\", \"rawvideo\", \"-s:v\","
},
{
"path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.3758.video000000.meta.json",
"chars": 864,
"preview": "{\"encoder_version\": {\"cmdline\": [\"avconv\", \"-nostats\", \"-loglevel\", \"error\", \"-y\", \"-r\", \"30\", \"-f\", \"rawvideo\", \"-s:v\","
},
{
"path": "assignment2/results/q4_train_atari_linear/monitor/openaigym.video.0.5469.video000000.meta.json",
"chars": 864,
"preview": "{\"encoder_version\": {\"cmdline\": [\"avconv\", \"-nostats\", \"-loglevel\", \"error\", \"-y\", \"-r\", \"30\", \"-f\", \"rawvideo\", \"-s:v\","
},
{
"path": "assignment2/utils/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "assignment2/utils/general.py",
"chars": 5451,
"preview": "import time\r\nimport sys\r\nimport logging\r\nimport numpy as np\r\nfrom collections import deque\r\nimport matplotlib\r\nmatplotli"
},
{
"path": "assignment2/utils/preprocess.py",
"chars": 929,
"preview": "import numpy as np\r\n\r\ndef greyscale(state):\r\n \"\"\"\r\n Preprocess state (210, 160, 3) image into\r\n a (80, 80, 1) i"
},
{
"path": "assignment2/utils/replay_buffer.py",
"chars": 7916,
"preview": "import numpy as np\r\nimport random\r\n\r\ndef sample_n_unique(sampling_f, n):\r\n \"\"\"Helper function. Given a function `samp"
},
{
"path": "assignment2/utils/test_env.py",
"chars": 1721,
"preview": "import numpy as np\r\n\r\nclass ActionSpace(object):\r\n def __init__(self, n):\r\n self.n = n\r\n\r\n def sample(self)"
},
{
"path": "assignment2/utils/viewer.py",
"chars": 1727,
"preview": "import pyglet\r\n\r\n\r\nclass SimpleImageViewer(object):\r\n \"\"\"\r\n Modified version of gym viewer to chose format (RBG or"
},
{
"path": "assignment2/utils/wrappers.py",
"chars": 3257,
"preview": "import numpy as np\r\nimport gym\r\nfrom gym import spaces\r\nfrom viewer import SimpleImageViewer\r\nfrom collections import de"
},
{
"path": "assignment3/discrete_env.py",
"chars": 1515,
"preview": "import numpy as np\n\nfrom gym import Env, spaces\nfrom gym.utils import seeding\n\ndef categorical_sample(prob_n, np_random)"
},
{
"path": "assignment3/frozen_lake.py",
"chars": 5108,
"preview": "import numpy as np\nimport sys\nfrom six import StringIO, b\n\nfrom gym import utils\nimport discrete_env\n\nLEFT = 0\nDOWN = 1\n"
},
{
"path": "assignment3/q1.py",
"chars": 3424,
"preview": "import math\nimport gym\nfrom frozen_lake import *\nimport numpy as np\nimport time\nfrom utils import *\nimport matplotlib.py"
},
{
"path": "assignment3/q2.py",
"chars": 2676,
"preview": "import math\nimport gym\nfrom frozen_lake import *\nimport numpy as np\nimport time\nfrom utils import *\nfrom tqdm import *\ni"
},
{
"path": "assignment3/q3.py",
"chars": 3438,
"preview": "import math\nimport gym\nfrom frozen_lake import *\nimport numpy as np\nimport time\nfrom utils import *\nimport matplotlib.py"
},
{
"path": "assignment3/requirements.txt",
"chars": 20,
"preview": "matplotlib\nnumpy\nsix"
},
{
"path": "assignment3/utils.py",
"chars": 739,
"preview": "import math\nimport gym\nfrom frozen_lake import *\nimport numpy as np\nimport time\n\n\ndef render_single_Q(env, Q, max_step ="
}
]
// ... and 10 more files (download for full content)
About this extraction
This page contains the full source code of the zlpure/CS234 GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 72 files (136.8 KB), approximately 38.3k tokens, and a symbol index with 135 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.