Showing preview only (2,322K chars total). Download the full file or copy to clipboard to get everything.
Repository: dennybritz/reinforcement-learning
Branch: master
Commit: 2b832284894a
Files: 64
Total size: 2.2 MB
Directory structure:
gitextract_nce2oqyi/
├── .gitignore
├── DP/
│ ├── Gamblers Problem Solution.ipynb
│ ├── Gamblers Problem.ipynb
│ ├── Policy Evaluation Solution.ipynb
│ ├── Policy Evaluation.ipynb
│ ├── Policy Iteration Solution.ipynb
│ ├── Policy Iteration.ipynb
│ ├── README.md
│ ├── Value Iteration Solution.ipynb
│ └── Value Iteration.ipynb
├── DQN/
│ ├── .gitignore
│ ├── Breakout Playground.ipynb
│ ├── Deep Q Learning Solution.ipynb
│ ├── Deep Q Learning.ipynb
│ ├── Double DQN Solution.ipynb
│ ├── README.md
│ └── dqn.py
├── FA/
│ ├── MountainCar Playground.ipynb
│ ├── Q-Learning with Value Function Approximation Solution.ipynb
│ ├── Q-Learning with Value Function Approximation.ipynb
│ └── README.md
├── Introduction/
│ └── README.md
├── LICENSE
├── MC/
│ ├── Blackjack Playground.ipynb
│ ├── MC Control with Epsilon-Greedy Policies Solution.ipynb
│ ├── MC Control with Epsilon-Greedy Policies.ipynb
│ ├── MC Prediction Solution.ipynb
│ ├── MC Prediction.ipynb
│ ├── Off-Policy MC Control with Weighted Importance Sampling Solution.ipynb
│ ├── Off-Policy MC Control with Weighted Importance Sampling.ipynb
│ └── README.md
├── MDP/
│ └── README.md
├── PolicyGradient/
│ ├── CliffWalk Actor Critic Solution.ipynb
│ ├── CliffWalk REINFORCE with Baseline Solution.ipynb
│ ├── Continuous MountainCar Actor Critic Solution.ipynb
│ ├── README.md
│ └── a3c/
│ ├── README.md
│ ├── estimator_test.py
│ ├── estimators.py
│ ├── policy_monitor.py
│ ├── policy_monitor_test.py
│ ├── train.py
│ ├── worker.py
│ └── worker_test.py
├── README.md
├── TD/
│ ├── Cliff Environment Playground.ipynb
│ ├── Q-Learning Solution.ipynb
│ ├── Q-Learning.ipynb
│ ├── README.md
│ ├── SARSA Solution.ipynb
│ ├── SARSA.ipynb
│ └── Windy Gridworld Playground.ipynb
├── __init__.py
└── lib/
├── __init__.py
├── atari/
│ ├── __init__.py
│ ├── helpers.py
│ └── state_processor.py
├── envs/
│ ├── __init__.py
│ ├── blackjack.py
│ ├── cliff_walking.py
│ ├── discrete.py
│ ├── gridworld.py
│ └── windy_gridworld.py
└── plotting.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
experiments/
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# IPython Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# dotenv
.env
# virtualenv
venv/
ENV/
# Spyder project settings
.spyderproject
# Rope project settings
.ropeproject
### IPythonNotebook ###
# Temporary data
.ipynb_checkpoints/
================================================
FILE: DP/Gamblers Problem Solution.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### This is Example 4.3. Gambler’s Problem from Sutton's book.\n",
"\n",
"A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. \n",
"If the coin comes up heads, he wins as many dollars as he has staked on that flip; \n",
"if it is tails, he loses his stake. The game ends when the gambler wins by reaching his goal of $100, \n",
"or loses by running out of money. \n",
"\n",
"On each flip, the gambler must decide what portion of his capital to stake, in integer numbers of dollars. \n",
"This problem can be formulated as an undiscounted, episodic, finite MDP. \n",
"\n",
"The state is the gambler’s capital, s ∈ {1, 2, . . . , 99}.\n",
"The actions are stakes, a ∈ {0, 1, . . . , min(s, 100 − s)}. \n",
"The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n",
"\n",
"The state-value function then gives the probability of winning from each state. A policy is a mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal. Let p_h denote the probability of the coin coming up heads. If p_h is known, then the entire problem is known and it can be solved, for instance, by value iteration.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import sys\n",
"import matplotlib.pyplot as plt\n",
"if \"../\" not in sys.path:\n",
" sys.path.append(\"../\") "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"\n",
"### Exercise 4.9 (programming)\n",
"\n",
"Implement value iteration for the gambler’s problem and solve it for p_h = 0.25 and p_h = 0.55."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def value_iteration_for_gamblers(p_h, theta=0.0001, discount_factor=1.0):\n",
" \"\"\"\n",
" Args:\n",
" p_h: Probability of the coin coming up heads\n",
" \"\"\"\n",
" # The reward is zero on all transitions except those on which the gambler reaches his goal,\n",
" # when it is +1.\n",
" rewards = np.zeros(101)\n",
" rewards[100] = 1 \n",
" \n",
" # We introduce two dummy states corresponding to termination with capital of 0 and 100\n",
" V = np.zeros(101)\n",
" \n",
" def one_step_lookahead(s, V, rewards):\n",
" \"\"\"\n",
" Helper function to calculate the value for all action in a given state.\n",
" \n",
" Args:\n",
" s: The gambler’s capital. Integer.\n",
" V: The vector that contains values at each state. \n",
" rewards: The reward vector.\n",
" \n",
" Returns:\n",
" A vector containing the expected value of each action. \n",
" Its length equals to the number of actions.\n",
" \"\"\"\n",
" A = np.zeros(101)\n",
" stakes = range(1, min(s, 100-s)+1) # Your minimum bet is 1, maximum bet is min(s, 100-s).\n",
" for a in stakes:\n",
" # rewards[s+a], rewards[s-a] are immediate rewards.\n",
" # V[s+a], V[s-a] are values of the next states.\n",
" # This is the core of the Bellman equation: The expected value of your action is \n",
" # the sum of immediate rewards and the value of the next state.\n",
" A[a] = p_h * (rewards[s+a] + V[s+a]*discount_factor) + (1-p_h) * (rewards[s-a] + V[s-a]*discount_factor)\n",
" return A\n",
" \n",
" while True:\n",
" # Stopping condition\n",
" delta = 0\n",
" # Update each state...\n",
" for s in range(1, 100):\n",
" # Do a one-step lookahead to find the best action\n",
" A = one_step_lookahead(s, V, rewards)\n",
" # print(s,A,V) # if you want to debug.\n",
" best_action_value = np.max(A)\n",
" # Calculate delta across all states seen so far\n",
" delta = max(delta, np.abs(best_action_value - V[s]))\n",
" # Update the value function. Ref: Sutton book eq. 4.10. \n",
" V[s] = best_action_value \n",
" # Check if we can stop \n",
" if delta < theta:\n",
" break\n",
" \n",
" # Create a deterministic policy using the optimal value function\n",
" policy = np.zeros(100)\n",
" for s in range(1, 100):\n",
" # One step lookahead to find the best action for this state\n",
" A = one_step_lookahead(s, V, rewards)\n",
" best_action = np.argmax(A)\n",
" # Always take the best action\n",
" policy[s] = best_action\n",
" \n",
" return policy, V"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimized Policy:\n",
"[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 12. 11. 15. 16. 17.\n",
" 18. 6. 20. 21. 3. 23. 24. 25. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.\n",
" 11. 12. 38. 11. 10. 9. 42. 7. 44. 5. 46. 47. 48. 49. 50. 1. 2. 3.\n",
" 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 11. 10. 9. 17. 7. 19. 5. 21.\n",
" 22. 23. 24. 25. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 12. 11.\n",
" 10. 9. 8. 7. 6. 5. 4. 3. 2. 1.]\n",
"\n",
"Optimized Value Function:\n",
"[0.00000000e+00 7.24792480e-05 2.89916992e-04 6.95257448e-04\n",
" 1.16010383e-03 1.76906586e-03 2.78102979e-03 4.03504074e-03\n",
" 4.66214120e-03 5.59997559e-03 7.08471239e-03 9.03964043e-03\n",
" 1.11241192e-02 1.56793594e-02 1.61464431e-02 1.69517994e-02\n",
" 1.86512806e-02 1.98249817e-02 2.24047303e-02 2.73845196e-02\n",
" 2.83388495e-02 3.04937363e-02 3.61633897e-02 3.84953022e-02\n",
" 4.44964767e-02 6.25000000e-02 6.27174377e-02 6.33700779e-02\n",
" 6.45857723e-02 6.59966059e-02 6.78135343e-02 7.08430894e-02\n",
" 7.46098323e-02 7.64884604e-02 7.93035477e-02 8.37541372e-02\n",
" 8.96225423e-02 9.58723575e-02 1.09538078e-01 1.10939329e-01\n",
" 1.13360151e-01 1.18457374e-01 1.21977661e-01 1.29716907e-01\n",
" 1.44653559e-01 1.47520113e-01 1.53983246e-01 1.70990169e-01\n",
" 1.77987434e-01 1.95990576e-01 2.50000000e-01 2.50217438e-01\n",
" 2.50870078e-01 2.52085772e-01 2.53496606e-01 2.55313534e-01\n",
" 2.58343089e-01 2.62109832e-01 2.63988460e-01 2.66803548e-01\n",
" 2.71254137e-01 2.77122542e-01 2.83372357e-01 2.97038078e-01\n",
" 2.98439329e-01 3.00860151e-01 3.05957374e-01 3.09477661e-01\n",
" 3.17216907e-01 3.32153559e-01 3.35020113e-01 3.41483246e-01\n",
" 3.58490169e-01 3.65487434e-01 3.83490576e-01 4.37500000e-01\n",
" 4.38152558e-01 4.40122454e-01 4.43757317e-01 4.47991345e-01\n",
" 4.53440603e-01 4.62529268e-01 4.73829497e-01 4.79468031e-01\n",
" 4.87912680e-01 5.01265085e-01 5.18867627e-01 5.37617932e-01\n",
" 5.78614419e-01 5.82817988e-01 5.90080452e-01 6.05372123e-01\n",
" 6.15934510e-01 6.39150720e-01 6.83960814e-01 6.92560339e-01\n",
" 7.11950883e-01 7.62970611e-01 7.83963162e-01 8.37972371e-01\n",
" 0.00000000e+00]\n",
"\n"
]
}
],
"source": [
"policy, v = value_iteration_for_gamblers(0.25)\n",
"\n",
"print(\"Optimized Policy:\")\n",
"print(policy)\n",
"print(\"\")\n",
"\n",
"print(\"Optimized Value Function:\")\n",
"print(v)\n",
"print(\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Show your results graphically, as in Figure 4.3.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEWCAYAAACJ0YulAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAIABJREFUeJzt3Xd8HNW5//HPI8mqlmRky7jjbmMb\nQhGmJKGH0OEmJIFAQkng5hJCCKSQhCSENNJuknshxaH3UPIjhksghNCbLReMC26Si1xlSbZlyerP\n748ZKWtZZW1rtCrf9+u1L+3Mnp15zs5qnznnTDF3R0REBCAp0QGIiEjPoaQgIiItlBRERKSFkoKI\niLRQUhARkRZKCiIi0kJJIUJmNsbMdplZchcs6z4z+3FXxNVquW5mE8PnfzSz70Wwjv80s99GsNxL\nzewfXb3crrS/283MDjezt6KIqaczszPM7OluWtd3zOyuA3j/GjM7PXx+vZnd3nXRJYaSQhcIvxi7\nwwTQ/Bjh7uvcfaC7N0a8/ivMrDFc704zW2hm5+7rctz9S+7+oy6OLRW4BfjlAS5nbJjAUprnufvD\n7n7Ggca4j3G0/AhEyd0XAdvN7Lwo12NmqWb2azMrCb8/xWb2m5jX96m+XbTz8lOg5cfVAteb2WIz\nqwpjfcLMDjvA9eDuP3X3L4br2es7to9mAZeZ2dADjSuRlBS6znlhAmh+bOzm9b/t7gOBQcDdwONm\nltfNMbTlAuADd9+Q6EB6oYeB/4x4Hd8GCoCZQDZwCrAg4nW2y8yOAXLd/Z2Y2b8DvgpcD+QBk4Gn\ngXO6P8L2uXsN8Hfg84mO5UAoKUSo9Z6Hmb1iZj8yszfNrNLM/mFmQ2LKP2Fmm81sh5m9ZmbT93Wd\n7t4E3ANkAOPD5V5tZqvMrNzMZpvZiHbi3WMvz8wuCFsdO81stZmdaWafMrN5rd53UwfN/bOAV1uV\nb7eeZpYR7rmuDV9/w8wygNfCItvDPdrjwxbSGzHvPcHM5obvm2tmJ8S81uFn3yq+IWb2rJltDz+z\n180sycweBMYAz4QxfLOz+rRabraZvWxm/xPu/aaZ2a/MbJ2ZbbGg+y4j5i2vAKeZWVoby7rYzApb\nzfuamc0On59tZkvDum4ws6+3FRNwDPD/3H2jB9a4+wPhMvapvmZ2DXAp8M2w/DPh/BFm9pSZlVrQ\nErm+nVig1ffFzCYBXwYucfd/uXutu1eHrcTbwzLnmNmC8Hu63sxujXl/8//gNWa20cw2mdlNMa/f\namYPhZNtfccmmNm/zKzMzLaZ2cNmNqiD+F+hhyWrfebuehzgA1gDnN7G/LGAAynh9CvAaoI9nYxw\n+vaY8lcR7K2lAb8FFsa8dh/w43bWfwXwRvg8hWCvqhLIBU4FtgFHhcv9X+C1mPc6MLH1Ogj2HHcA\nHyPYeRgJTA2XUQ4cGrOMBcAn24ltLvCpVvM6qued4ecyEkgGTgjL7fFZtlHvPKAC+Fz4GVwSTg+O\n57NvFd/PgD8CA8LHRwFrb1vHs92AwcCc2G0Ylp0dxp4NPAP8rNWydwKHtxFjZriNJ7X6rC8On28C\nPho+Pwg4qp263gKsA64FDmuuZ0ff7XjqGzOdBMwDvg+kEuyoFAEfbyeeJ4BvxEx/CVjbyf/fyWHs\nScDhwBbgwlb/g48CWWG50uY6AbcCD7X1/xrOm0jwP5AG5BMkjt+29/kQ/J+VJ/L36EAfCQ+gLzzC\nL8YuYHv4eDqcv8eXjOCH6JaY910LPN/OMgeF780Np/f4Z2tV9gqgIVz3NuCdmC/93cAvYsoOBOqB\nseF0e0nhT8Bv2lnfH4CfhM+nE/z4prVTdiVwZgefXUs9w3/q3cCH2ijX1j/sFfw7KXwOmNPqPW8D\nV+zHZ38b8Lfmz6WNbb3XDkAn2+0eYDF7/tgZUAVMiJl3PFDcankbgBPbWddDwPfD55MIkkRmOL2O\noOspp5PvbjLBnvibQC2wEbj8AOsbmxSOBda1es+3gXvbWd6LwJdipr8LvLOP/4+/bf7uxnxvpsa8\n/gvg7vD5rXSQFNpY9oXAgvY+n3A7NO5LvD3toe6jrnOhuw8KHxd2UG5zzPNqgh9pzCzZzG4Pu2l2\nEnzZANrs4mjDO+G6h7j7ce7+z3D+CGBtcyF33wWUEeyJd2Q0wZ51W+4HPmtmRvBj/Li717ZTtoJg\nrxLotJ5DgPQO1tuRPeoZWsue9Wzzs2/DL4FVwD/MrMjMbm5vpXFut3MIWid/jJmXT7C3Py/sptoO\nPB/Oj5VNkOzb8ghBiwjgswQ7I9Xh9CeBs4G1ZvaqmR3f1gLcvdHd73T3DxP8wP8EuMfMDj2A+sY6\nBBjRXMewnt8BDm6n/B7fF4Lv6vB2yjbHdGzYLVdqZjsIWhet41kf83wtwfelU2Y21MweC7vgdhIk\n4o7+J7MJWti9lpJCz/FZgkHZ0wn2mseG8+0Al7uR4B8zWJhZFkFXRmcDv+uBCW294MEgYB1Bt8pn\ngQc7WM4igi6bZh3VcxtQ0856O7uc7x71DI2h83ruvSL3Sne/yd3HA+cBN5rZae3EEc92+zPBD/5z\n4ecPQV13A9NjdiZyPThYIFhAMPaTCixvJ9R/AEPM7AiC5PBITB3muvsFwFCCQdnH46j3bne/k+CH\nedp+1rd1+fUErZ9BMY9sdz+7nTBaf19eAkaZWUEHoT9C0A032t1zCZJv6/+b0THPxxB8X1pr6zv2\ns3D+4e6eA1zWxrJjHQq818HrPZ6SQs+RTdB8LyPYg/xpFy33EeBKMzsiHLD8KfCuu6/p5H13h+87\nLRxkHWlmU2NefwC4A2hw9zfaXgQAzwEnxUy3W0//9yD5f4eDk8nhYF8aQT9wE+HgeTvrmWxmnzWz\nFDP7DMEP27Od1HMvZnaumU0MW0I7gcbwAUF/dWwM8W636wh+3J81s4ywrn8GfmPhIYzhZ/zxmPec\nDPyrvVaYuzcATxK0bPIIul6aDzO91Mxy3b0+pg5t1fUGMzvZggH+FDO7PKxT8xFI+1rf1uXnADvN\n7FvhOpLNbIYFRxm1ZY/vi7uvBH4PPBrGmWpm6RYMtDe34LIJ+vFrzGwmQeJq7XtmlhkOil8J/KWN\nMm19x7IJu4bNbCTwjXbibnYSwRFIvZaSQs/xAEGzdgOwlGBc4IC5+0vA94CnCAYfJwAXx/G+OQT/\nPL8haA6/yp574g8CM+i4lQDB4OlU+/cRT53V8+vA+wSDpuXAz4GksFvkJ8CbYTfEca3iLQPOBW4i\n+MH6JnCuu2/rrK5tmAT8k+DH4G3g9+7+Svjaz4Bbwhi+Hkd9muNz4BqCPee/mVk68C2Cbqp3wq6J\nfwJTYt52KXt2ObXlEYK99ifCJNHsc8CacLlfItjDbctu4NcEXWvbCMYXPunuRftZ37uBaWH5pz04\nR+c84AigOFzHXQStjL24+3xgh5kdGzP7eoIdkDsJutJWA/9B8N2CYHzoNjOrJBjQbqtV9CrBZ/0S\n8Ct33+ukx3a+Yz8kGDzeAfwf8Ne24gYIt+nZBN2rvVbzERUi+8SCQye3EhzVsrKTstcA09z9hm4J\nrg+w4MSsWe7e5lhAX2ZmZwDXdjI2F++yxhIkowGtkmaXM7OvEHRhfTPK9URNSUH2i5ndSLAnfmqi\nYxFpT3cmhb5if0/nln7MzNYQDLYd8J6ciPQsaimIiEgLDTSLiEiLXtd9NGTIEB87dmyiwxAR6VXm\nzZu3zd1bnxy5l16XFMaOHUthYWHnBUVEpIWZtT7jv03qPhIRkRZKCiIi0kJJQUREWigpiIhICyUF\nERFpoaQgIiItlBRERKSFkoKISA/X1OT89LllLCpp7yZ8XUdJQUSkh1uxtZJZrxWxcsuuyNelpCAi\n0sPNKS4HYOa4vMjXpaQgItLDvVtczojcdEYdlBH5upQURER6MHdnTnE5M8flEdw2PFpKCiIiPdja\nsmpKK2s5phu6jkBJQUSkR2seTzi2LyQFMzvTzJab2Sozu7mN18eY2ctmtsDMFpnZ2VHGIyLS27xb\nXE5eVioT8gd2y/oiSwpmlgzcCZwFTAMuMbNprYrdAjzu7kcCFwO/jyoeEZHeaM6aMmaO7Z7xBIi2\npTATWOXuRe5eBzwGXNCqjAM54fNcYGOE8YiI9CqbduxmffnubjkUtVmUSWEksD5muiScF+tW4DIz\nKwGeA77S1oLM7BozKzSzwtLS0ihiFRHpcbrz/IRmUSaFtto63mr6EuA+dx8FnA08aGZ7xeTus9y9\nwN0L8vM7vcWoiEifMKe4nIFpKRw6PKfzwl0kyqRQAoyOmR7F3t1DXwAeB3D3t4F0YEiEMYmI9Bpz\nisspGHsQyUndM54A0SaFucAkMxtnZqkEA8mzW5VZB5wGYGaHEiQF9Q+JSL+3dWcNK7fu6tauI4gw\nKbh7A3Ad8AKwjOAooyVmdpuZnR8Wuwm42szeAx4FrnD31l1MIiL9zs+fX05KkvHx6cO6db0pUS7c\n3Z8jGECOnff9mOdLgQ9HGYOISG/z9uoynppfwrUnT+i28xOa6YxmEZEepLahke8+/T6j8zL4yqmT\nun39kbYURERk3/zp1SKKSqu478pjyEhN7vb1q6UgItJDbNy+mzteXsU5hw3n5ClDExKDkoKISA/x\n0rIt1DU0cdMZkxMWg5KCiEgP8XZRGSNy0xk3JCthMSgpiIj0AE1NzturyzhuwuBuu/hdW5QURER6\ngOVbKqmorueECYm9qIOSgohID/D26jIAjp8wOKFxKCmIiPQAb60u45DBmYwclJHQOJQUREQSrLHJ\nebe4jOPHJ7aVAEoKIiIJt2TjDiprGhLedQRKCiIiCdcynqCWgoiIvLW6jAn5WQzNSU90KEoKIiKJ\nVN/YxNw15Qk/FLWZkoKISALNW1tBdV1jjxhPACUFEZGEqWto4rZnljJkYBofmdQzWgq6dLaISILc\n+fIqlm7ayazPHU1O+oBEhwOopSAikhCLN+zgzpdX8R9HjuSMbr7lZkeUFEREulltQyNff+I98rJS\n+cF50xIdzh7UfSQi0o1Wba3kG08u4oPNldx9eQGDMlMTHdIelBRERLpBQ2MTs14v4rf/XElmajK/\nu/gITjv04ESHtRclBRGRbnDfW2v4xfPLOWvGMG67YAb52WmJDqlNSgoiIt3g+cWbmTEyhz9cdnSi\nQ+mQBppFRCJWUVXH/HUVnDq153UXtaakICISsddWltLkcMqU/ESH0iklBRGRiL38wVYGZ6XyoVGD\nEh1Kp5QUREQi1NjkvLqilJMm55OUZIkOp1NKCiIiEVq4fjsV1fWcMnVookOJi5KCiEiEXlm+leQk\n48RJPX88AZQUREQi9a8PtnL0mIPIzewZF7zrjJKCiEhEtuysYcnGnZw8tXe0EkBJQUQkMq8s3wrA\nqb1kPAGUFEREIlHf2MSs14oYn5/FlIOzEx1O3JQUREQi8NicdawureLmM6di1vMPRW2mpCAi0sV2\n1tTzm3+u5LjxeXxsWs+/tEUsJQURkS5258urqKiu45ZzpvWqVgIoKYiIdKn15dXc+8YaPnHkKGaM\nzE10OPtMSUFEpIts3L6bLz00j6Qk+MbHpyQ6nP0SaVIwszPNbLmZrTKzm9sp82kzW2pmS8zskSjj\nERGJytw15Zx/xxusLavm95cexbDc9ESHtF8iu8mOmSUDdwIfA0qAuWY2292XxpSZBHwb+LC7V5hZ\n7zmYV0Qk9Nf5JXzrqUWMOiiTx645molDe88hqK1Feee1mcAqdy8CMLPHgAuApTFlrgbudPcKAHff\nGmE8IiJd7tUVpXzjyUUcOy6PP1x2NLkZveNyFu2JsvtoJLA+ZroknBdrMjDZzN40s3fM7My2FmRm\n15hZoZkVlpaWRhSuiMi+Wb65kusens+koQOZ9fmCXp8QINqk0NZxWN5qOgWYBJwMXALcZWZ73YXC\n3We5e4G7F+Tn955riIhI31VaWctV980lPTWZe644hoFpfeOW91EmhRJgdMz0KGBjG2X+5u717l4M\nLCdIEiIiPZK78/zizVz0x7coq6rl7ssLGDEoI9FhdZkok8JcYJKZjTOzVOBiYHarMk8DpwCY2RCC\n7qSiCGMSEdkvTU3Ou0VlfGbWO3zpoXkMSE7ivitncngvuMXmvtin9o6ZHQSMdvdFnZV19wYzuw54\nAUgG7nH3JWZ2G1Do7rPD184ws6VAI/ANdy/b51qIiERk6cadPDZ3HS8s2cyWnbUMzkrlxxfO4OJj\nRpOS3PdO9TL31t38rQqYvQKcT5BAFgKlwKvufmPk0bWhoKDACwsLE7FqEelnauobmfmTf1LX2MTJ\nk4dy5oxhnD7t4F45fmBm89y9oLNy8dQs1913mtkXgXvd/Qdm1mlLQUSkt3tr9TZ21jRw75XHcMqU\n/nEaVTxtnxQzGw58Gng24nhERHqMFxZvITsthRMmDE50KN0mnqRwG0Hf/2p3n2tm44GV0YYlIpJY\nDY1NvLhsC6dMHUpaSnKiw+k2nXYfufsTwBMx00XAJ6MMSkQk0QrXVlBeVcfHpw9LdCjdqtOWgplN\nNrOXzGxxOH24md0SfWgiIonz/OLNpKYkcfKU/nXCbDzdR38muGhdPUB4OOrFUQYlIpJI7s4/lmzm\nxElDyOqFRxodiHiSQqa7z2k1ryGKYEREeoL3N+xg444azuhnXUcQX1LYZmYTCK9bZGYXAZsijUpE\nJIFeWLKZ5CTj9EN71/2Vu0I87aIvA7OAqWa2ASgGLo00KhGRBGlqcv6+eDMzx+aRl5Wa6HC6XTxJ\nwd39dDPLApLcvdLMxkUdmIhIItz31hqKSqv4yqkTEx1KQsTTffQUgLtXuXtlOO/J6EISEUmMlVsq\nuf35Dzht6lAuPKL17V/6h3ZbCmY2FZgO5JrZJ2JeygF6581HRUTaUdfQxNceX8jAtBRu/+ThmLV1\nS5i+r6PuoynAucAg4LyY+ZUEt9EUEekz/uellSzesJM/fe5o8rPTEh1OwrSbFNz9b8DfzOx4d3+7\nG2MSEelW7xSV8ftXVnHR0aP63RnMrcUz0LzAzL5M0JXU0m3k7ldFFpWISDcp21XLVx9bwCGDs7j1\n/OmJDifh4hlofhAYBnwceJXgtpqVHb5DRKQXaGpybnriPSqq67njs0f2yvskdLV4ksJEd/8eUOXu\n9wPnAIdFG5aISPTueqOIV5aX8r1zDmX6iNxEh9MjxJMU6sO/281sBpALjI0sIhGRbvCvD7bwi+eX\nc9aMYVx23CGJDqfHiKetNCu8N/P3gNnAQOD7kUYlIhKhV1eU8qUH5zNtRA4/v6j/Hn7alnjup3BX\n+PRVYHy04YiIROutVdu45oFCJg4dyANXzSQnfUCiQ+pROk0KZjYI+DxBl1FLeXe/PrqwRES6VlOT\n8+jcdfz42WWMHZzFQ188lkGZ/e/aRp2Jp/voOeAd4H2gKdpwRES6XvG2Km5+ahHvFpdzwoTB/O7i\nI/vlxe7iEU9SSHf3GyOPRESki63aWsm9b67hyXklpKYk8fNPHsanC0ZrDKED8SSFB83sauBZoLZ5\npruXRxaViMgBWF26i1tnL+H1ldtITUniwiNGcNMZUzg4R5dt60w8SaEO+CXwXcIb7YR/NegsIj3O\nu0VlXPPgPJIMvn7GZC6ZOYbBA/vvtYz2VTxJ4UaCE9i2RR2MiMiB+NvCDXzjiUWMysvgvitmMmZw\nZqJD6nXiSQpLgOqoAxER2V+1DY38+h8rmPVaEceOy+NPnztaRxbtp3iSQiOw0MxeZs8xBR2SKiIJ\n98Hmndzw2EI+2FzJpceO4fvnTSMtJTnRYfVa8SSFp8OHiEiPsWZbFQ++s5YH315LTsYA7rmigFOn\nHpzosHq9eM5ovr87AhER6UxVbQMvL9/KE4UlvLqilJQk47wPjeCWcw7VYHIX6eh2nI+7+6fN7H3+\nfdRRC3c/PNLIRESALTtreHVFKS8u3cJrK0qpbWji4Jw0vnb6ZC6ZOZqhOsy0S3XUUvhq+Pfc7ghE\nRPqnxiansqaeXbUNVNU2snHHbopKqygq3cW8tRV8sDm4fcvw3HQumTmGs2YMo2BsHslJOgEtCh3d\njnNT+PRad/9W7Gtm9nPgW3u/S0Rk31z0x7dYsG77XvNz0lOYMTKXb581lRMn5zN1WLbORO4G8Qw0\nf4y9E8BZbcwTEdknWytrWLBuO+cePpwTJ+WTlZbC0Jw0xg/JIi8rVUkgAToaU/gv4Fpggpktinkp\nG3gz6sBEpO+bW1wBwBc/Op4jRg9KcDQCHbcUHgH+DvwMuDlmfqWueyQiXWFOcRmZqclMH5GT6FAk\n1O7tON19h7uvAW4BNrv7WmAccFl4jwURkQPybnE5Rx9yEAOS47kzsHSHeLbEU0CjmU0E7iZIDI9E\nGpWI9Hk7qutZvqWSY8bmJToUiRFPUmhy9wbgE8Bv3f1rwPB4Fm5mZ5rZcjNbZWY3d1DuIjNzMyuI\nL2wR6e0K15bjDjPHKSn0JPEkhXozu4TglpzPhvM6vampmSUDdxIcqTQNuMTMprVRLhu4Hng33qBF\npPebU1xOanKSBph7mHiSwpXA8cBP3L3YzMYBD8XxvpnAKncvcvc64DHggjbK/Qj4BVATZ8wi0ge8\nW1zOh0bnkj5AF6/rSdpNCmaWA+DuS939end/NJwuJr4xhZHA+pjpknBe7DqOBEa7+7OISL9RVdvA\n4g07NJ7QA3XUUnil+YmZvdTqtXiumtrWWSct11AysyTgN8BNnS7I7BozKzSzwtLS0jhWLSI92YJ1\n22loco0n9EAdJYXYH/XWWy6e0wxLgNEx06OAjTHT2cAM4BUzWwMcB8xua7DZ3We5e4G7F+Tn58ex\nahHpyeYUl5FkcPQhByU6FGmlo6Tg7Txva7otc4FJZjbOzFKBi4HZLQsIzoMY4u5j3X0s8A5wvrsX\nxhe6iPRWc9aUM31ELtnpnR6zIt2sozOah5rZjQStgubnhNOd7q67e4OZXQe8ACQD97j7EjO7DSh0\n99kdL0FE+qLFG3ZQuKaCqz4yLtGhSBs6Sgp/Jujiaf0c4K54Fu7uzwHPtZr3/XbKnhzPMkWk96qq\nbeD6RxcweGAqXzppQqLDkTZ0dOnsH3ZnICLS9/3wmSUUl1Xx8BePJS8rNdHhSBt0wRER6RbPvLeR\nxwtLuPbkCZwwYUiiw5F2xHM/BRGR/bartoG7Xi/iT68WccToQdxw+uREhyQdUFIQkUi4Ow+9u47f\nvriCsqo6zj5sGD84b7quiNrDdZoUzOxg4KfACHc/K7x+0fHufnfk0YlIr/XQu+v43tOLOW58Hnef\ndaiucdRLxJOy7yM4rHREOL0CuCGqgESk93tv/XZ+9MxSTp06lEe+eJwSQi8ST1IY4u6PA00QnH8A\nNEYalYj0WhVVdVz78Hzys9P4709/iKQk3We5N4lnTKHKzAYTnsVsZscBOyKNSkR6pd11jdz4+EJK\nK2t58r+OZ1CmDjvtbeJJCjcSXJ5igpm9SXA280WRRiUivUp1XQMPv7OOP71WxLZdtfz4whkcPkpd\nRr1Rp0nB3eeb2UnAFIJLXCx39/rIIxORHsvdea9kB4VrylmwfjtvrdpGRXU9H5k4hK+efpQuid2L\nxXP00edbzTrKzHD3ByKKSUR6uF//YwV3vLwKgJGDMvjopHwuP+EQjj5EyaC3i6f76JiY5+nAacB8\nQElBpB96dM467nh5FZ86ehTf+PgUhuakJzok6ULxdB99JXbazHKBByOLSER6rJeXb+WWpxdz0uR8\nfvaJw0jRiWh9zv6c0VwNTOrqQESkZ3v5g61c98h8phyczZ2XHqWE0EfFM6bwDP++qU4SMA14PMqg\nRKTnWFdWzW3PLuWfy7YwIT+Le688hoFpukJOXxXPlv1VzPMGYK27l0QUj4j0ABu27+aNlaW8tnIb\nLy7dQkqScfNZU7nqw+NITVELoS+LZ0zh1e4IREQSq7Kmnmfe28Rf5q7jvZLg/NSh2Wl88qiRXH/a\nJIbnZiQ4QukO7SYFM6uk7XsxG+DunhNZVCISGXdne3U9m3fWUFRaxbJNO1m6aSdvry5jd30jUw7O\n5jtnT+XkKUOZNHQgZrpMRX/S0Z3Xstt7TUR6n7qGJm564j3+sWQztQ1NLfOTk4yJ+QO58MiRfLpg\nFEeMHqRE0I/FPVpkZkMJzlMAwN3XRRKRiHS5hsYmvvrYAv6+eDOXzBzDxKEDGZ6bzpi8TCYOHUj6\ngOREhyg9RDxHH50P/Jrg0tlbgUOAZcD0aEMTka7Q1OR888lF/H3xZr537jS+8JFxiQ5JerB4DiP4\nEXAcsMLdxxGc0fxmpFGJSJfYsH03N/xlIX9dsIGbPjZZCUE6FU/3Ub27l5lZkpklufvLZvbzyCMT\nkf1WvK2KP7yyir/O3wDADadP4rpTJyY4KukN4kkK281sIPAa8LCZbSU4X0FEehB3p3BtBX9+rYgX\nl20hNTmJS48dwzUnTWDkIB1OKvGJJylcANQAXwMuBXKB26IMSkTit3lHDf/3/ib+tnADi0p2kJsx\ngGtPnsDlJ4xlaLYuVif7pqPzFO4AHnH3t2Jm3x99SCLSHndn884a3lu/nQXrt1O4poL56ypwh2nD\nc7jtgulcdPQoMlN1GQrZPx19c1YCvzaz4cBfgEfdfWH3hCUiABu37+aV5aW8uXobRaVVrC2rorou\nuEX6gGRj2ohcbjhtMud+aDgT8gcmOFrpCzo6ee13wO/M7BDgYuBeM0sHHgUec/cV3RSjSL/Q3Aoo\nXFNB4Zpy3i0u54PNlQCMyE1n6vAcjh8/mLFDMjlsZC7TRuSQlqLzC6RrmXtbV7Jop7DZkcA9wOHu\nnpBvY0FBgRcWFiZi1SL7zd2pqmuktLKW0spatlbWsHlH8Ni4Yzdry6pZW1bNrtrgGI7M1GSOGnMQ\nJ03O55Sp+UzI1+Um5MCY2Tx3L+isXDwnrw0AziRoLZwGvAr88IAjFOmDZr22mr/O30BDk9PY5NTU\nN1JV20BVXSONTXvvgKUPSGJEbgZjBmdyzNg8xg3J4qgxB3Ho8Gzdr0ASoqOB5o8BlwDnAHOAx4Br\n3L2qm2IT6VXcnbteLyZ9QDIzRuaQkpREakoSA9NSyEpLJid9APnZaS2P4TkZ5GSkqAUgPUpHLYXv\nAI8AX3f38m6KR6TXWltWzdb8d16TAAAQVklEQVTKWn584QwuO+6QRIcjsl86Gmg+pTsDEent5qwJ\n9p2OHZeX4EhE9p86LUW6yJzicg7KHMDEoTo0VHovJQWRLjKnuJxjxuZpjEB6NSUFkS6wacdu1pVX\nM1NdR9LLKSmIdIE5xc3jCYMTHInIgVFSEOkCc9eUk5WazKHDdRdb6d2UFES6wJzico4em6cTzqTX\ni/QbbGZnmtlyM1tlZje38fqNZrbUzBaZ2UvhdZZEepXyqjpWbNmlQ1GlT4gsKZhZMnAncBYwDbjE\nzKa1KrYAKHD3w4EngV9EFY9IVOaG5ydokFn6gihbCjOBVe5e5O51BJfJuCC2gLu/7O7V4eQ7wKgI\n4xGJxNziclJTkjh8VG6iQxE5YFEmhZHA+pjpknBee74A/L2tF8zsGjMrNLPC0tLSLgxR5MDU1Dfy\n4rItHDF6kC5jLX1ClEmhrTN42rxOt5ldBhQAv2zrdXef5e4F7l6Qn5/fhSGKHJif/N8y1pZV8+VT\nJiY6FJEuEeU9+0qA0THTo4CNrQuZ2enAd4GT3L02wnhEutQLSzbz4Dtr+eJHxnHSZO2sSN8QZUth\nLjDJzMaZWSrB/RhmxxYIb9rzJ+B8d98aYSwiXWrTjt1866lFTB+RwzfOnJLocES6TGRJwd0bgOuA\nF4BlwOPuvsTMbjOz88NivwQGAk+Y2UIzm93O4kR6jPdLdnD1A4XUNTTxv5ccqbEE6VOi7D7C3Z8D\nnms17/sxz0+Pcv0iXWltWRW/+scKnnlvIwdlDuC/P30E4/N1RVTpWyJNCiJ9RVHpLs6/400am5yv\nnDqRq08cT076gESHJdLllBREOrG7rpH/emg+qSlJ/O3LH2Z0XmaiQxKJjJKCSAfcne8+/T4rtlby\nwFUzlRCkz9PVu0Q68Je56/nr/A3ccNpkPjpJh51K36eWgkgbVm2t5H//tYpn3tvIiZPz+cqpOjlN\n+gclBRGgsclZsaWS+esqeGPlNp5fspmMAclcfeJ4rjtlIklJusWm9A9KCtLvLVhXwRfuL6S8qg6A\nwVmpfOmkCVz90fHkZaUmODqR7qWkIP3a2rIqvnB/IQPTUvj+udM4cswgxuRlYqaWgfRPSgrSb1VU\n1XHFvXNpcue+K4/RiWgiKClIP7Vjdz1XP1DIhu27eeSLxyohiISUFKRfaWpynppfws+f/4Dyqjr+\n55IjKRirO6aJNFNSkH6hpKKaN1Zu4y+F61mwbjtHjRnEfVfOZMZI3S1NJJaSgvRZ68ureWJeCc++\nt5GibVUAjMhN51ef+hCfOHKkDjMVaYOSgvQZ5VV1LNu0k6Ubd/LaylLeWLUNgA9PGMKlxx3CRycN\nYdLQgTqySKQDSgrSa/39/U08Onc9m3fsZtOOGiprGlpeG3VQBtefOolPHzOakYMyEhilSO+ipCC9\n0rOLNnL9owsYk5fJlGHZHD9+MKMOyuTQ4TkcOjybwQPTEh2iSK+kpCC9zj+XbuGGxxZScEge9181\nk4xU3flMpKsoKUiv4e68sGQz1z+2kGkjcrj7igIlBJEupqQgPV5Tk/Pisi3c+fIqFpXsYOqwbB64\naibZuvOZSJdTUpAeq6a+kacXbOCuN4pZtXUXY/Iy+dknDuMTR40kLUUtBJEoKClIj7NqayWzF27k\nkTnr2LarjmnDc/jdxUdwzmHDSUnWfaFEoqSkIAm3o7qehSXbmb+2gheWbOaDzZWYwUmT87n6o+M5\nYcJgnVsg0k2UFKRbrS+v5vWV21ixpZI1ZVUUb6tibVk1AGZw1JiDuPW8aZx92HCG5qQnOFqR/kdJ\nQSJVtquWwrUVzCku59UVpazauguAzNRkDhmcxfQROXy6YDRHjB7E4aNyNXgskmBKCnJAGpuc8qo6\nNu+oYfPOGjZu383asmrWllWxunQXa8JWQGpKEseOy+OzM8dw8pR8xg3JUpeQSA+kpCDtWlSynScK\nS2hoaqK+0alraKKqtoGqugZ27m6gdFct5VV1NDb5Hu9LH5DE2MFZTB2Ww8Uzx3DM2IOYMTJXRwyJ\n9AJKCtKmrTtruOLeueyuayQ7PYWUJGNAShJZqSkMTEtheG46h4/KJT87jfzsNA7OSWd4bjrDctPJ\nH5imVoBIL6WkIHtpanJufPw9qusaePYrH2Hi0OxEhyQi3UQHfcteZr1exBurtvGD86YrIYj0M0oK\nsofCNeX86oXlnDVjGBcfMzrR4YhIN1P3kQAwb20Ff3p1NS8u28KI3Axu/8ThGhcQ6YeUFPohd6dw\nbQXvFpWxbHMlyzbupGhbFbkZA/jKKRO5/ISx5GbqfAGR/khJoR+pqW9k9nsbuffNNSzbtBOA0XkZ\nTB2Ww+ePP4RPFYwmK01fCZH+TL8AfUxTkzNvXQUlFdVUVNWzvbqO4rJqVm6ppGhbFXUNTUw5OJvb\nP3EY5xw+XGcQi8gelBT6AHdnbVk1f51fwlPzN7Bh++6W18xg5KAMJg0dyImT8zl5cj7H6wJzItIO\nJYVeorHJ2bKzhpKK3ZRW1rJtVy2bd9awdONOFm/YQVlVHWbw0Un53HzWVKaPyGFQZiq5GQNITlIC\nEJH4KCn0ADX1jbyyvJQF6yrYXd/I7rpGdtc3srOmgcqaeiqq6tiwfTf1jXteTiI5yZg0dCCnTB3K\nYSNz+di0gxkxKCNBtRCRvkBJIQGqahtYXbqL1aW7eHt1GX9fvJnKmgZSk5PISksmfUAyGQOSyc4Y\nQE56CiMHZXDmjOGMyctk1EEZ5GenMWRgGnlZqWoFiEiXijQpmNmZwO+AZOAud7+91etpwAPA0UAZ\n8Bl3XxNlTFFzd3bVNrC9up6tlTVs3F7Dph27WVdeTfG2KopKq9i0o6al/MC0FM6YfjAXHDGSD08Y\nrDuLiUhCRZYUzCwZuBP4GFACzDWz2e6+NKbYF4AKd59oZhcDPwc+E1VM8XB3ahuaqKlvpKa+iaq6\nBqprG9lV28CO3XWUV9VTUV1HaWUtWytrKK2sZefu4MqhVbUNVNY00NDqqqEAOekpjM8fyPHjBzM+\nP4uJQwcycehADhmcxQAlAhHpIaJsKcwEVrl7EYCZPQZcAMQmhQuAW8PnTwJ3mJm5+96/qgfo8bnr\nmfV6EU3u4NDkTkOT09DoNDQ1UdsQPOoamuJaXnZaCvk5aQzNTmPskEyyUlPISE0mN2MAgzIHMCgz\nlfyBaQwflM6IQRnk6NBPEekFokwKI4H1MdMlwLHtlXH3BjPbAQwGtsUWMrNrgGsAxowZs1/BDMoc\nwJSDs8EgyQwDUpKNlCQjJTmJtJQk0lKSSUtJIn1AMukDgr+ZqclkpaaQmZbMoIxU8rJSGZQ5gPQB\nujeAiPQ9USaFtkZAW7cA4imDu88CZgEUFBTsVyvijOnDOGP6sP15q4hIvxFlZ3YJEHuZzVHAxvbK\nmFkKkAuURxiTiIh0IMqkMBeYZGbjzCwVuBiY3arMbODy8PlFwL+iGE8QEZH4RNZ9FI4RXAe8QHBI\n6j3uvsTMbgMK3X02cDfwoJmtImghXBxVPCIi0rlIz1Nw9+eA51rN+37M8xrgU1HGICIi8dMB8iIi\n0kJJQUREWigpiIhICyUFERFpYb3tCFAzKwXW7ufbh9DqbOl+oj/Wuz/WGfpnvftjnWHf632Iu+d3\nVqjXJYUDYWaF7l6Q6Di6W3+sd3+sM/TPevfHOkN09Vb3kYiItFBSEBGRFv0tKcxKdAAJ0h/r3R/r\nDP2z3v2xzhBRvfvVmIKIiHSsv7UURESkA0oKIiLSot8kBTM708yWm9kqM7s50fFEwcxGm9nLZrbM\nzJaY2VfD+Xlm9qKZrQz/HpToWLuamSWb2QIzezacHmdm74Z1/kt4+fY+xcwGmdmTZvZBuM2P7yfb\n+mvh93uxmT1qZul9bXub2T1mttXMFsfMa3PbWuB/wt+2RWZ21IGsu18kBTNLBu4EzgKmAZeY2bTE\nRhWJBuAmdz8UOA74cljPm4GX3H0S8FI43dd8FVgWM/1z4DdhnSuALyQkqmj9Dnje3acCHyKof5/e\n1mY2ErgeKHD3GQSX5b+Yvre97wPObDWvvW17FjApfFwD/OFAVtwvkgIwE1jl7kXuXgc8BlyQ4Ji6\nnLtvcvf54fNKgh+JkQR1vT8sdj9wYWIijIaZjQLOAe4Kpw04FXgyLNIX65wDnEhwTxLcvc7dt9PH\nt3UoBcgI79aYCWyij21vd3+Nve9C2d62vQB4wAPvAIPMbPj+rru/JIWRwPqY6ZJwXp9lZmOBI4F3\ngYPdfRMEiQMYmrjIIvFb4JtAUzg9GNju7g3hdF/c3uOBUuDesNvsLjPLoo9va3ffAPwKWEeQDHYA\n8+j72xva37Zd+vvWX5KCtTGvzx6La2YDgaeAG9x9Z6LjiZKZnQtsdfd5sbPbKNrXtncKcBTwB3c/\nEqiij3UVtSXsR78AGAeMALIIuk9a62vbuyNd+n3vL0mhBBgdMz0K2JigWCJlZgMIEsLD7v7XcPaW\n5uZk+HdrouKLwIeB881sDUG34KkELYdBYfcC9M3tXQKUuPu74fSTBEmiL29rgNOBYncvdfd64K/A\nCfT97Q3tb9su/X3rL0lhLjApPEIhlWBganaCY+pyYV/63cAyd//vmJdmA5eHzy8H/tbdsUXF3b/t\n7qPcfSzBdv2Xu18KvAxcFBbrU3UGcPfNwHozmxLOOg1YSh/e1qF1wHFmlhl+35vr3ae3d6i9bTsb\n+Hx4FNJxwI7mbqb90W/OaDazswn2IJOBe9z9JwkOqcuZ2UeA14H3+Xf/+ncIxhUeB8YQ/FN9yt1b\nD2L1emZ2MvB1dz/XzMYTtBzygAXAZe5em8j4upqZHUEwuJ4KFAFXEuzo9eltbWY/BD5DcLTdAuCL\nBH3ofWZ7m9mjwMkEl8feAvwAeJo2tm2YHO8gOFqpGrjS3Qv3e939JSmIiEjn+kv3kYiIxEFJQURE\nWigpiIhICyUFERFpoaQgIiItlBREQmY2zMweM7PVZrbUzJ4zs8n7sZy7mi+4aGbfifM9a8xsyL6u\nS6Sr6ZBUEVpO/HsLuN/d/xjOOwLIdvfXD2C5u9x9YBzl1hBc+XPb/q5LpCuopSASOAWob04IAO6+\nEFhgZi+Z2Xwze9/MLoDggoPhfQzuD69h/6SZZYavvWJmBWZ2O8HVPBea2cPha0+b2bzwfgDXJKCe\nIh1SUhAJzCC42mZrNcB/uPtRBInj12GrAmAKMMvdDwd2AtfGvtHdbwZ2u/sR4aU3AK5y96OBAuB6\nMxscQV1E9puSgkjHDPipmS0C/klwOYWDw9fWu/ub4fOHgI/Esbzrzew94B2Ci5hN6uJ4RQ5ISudF\nRPqFJfz7gmqxLgXygaPdvT7s+08PX2s9INfhAF14babTgePdvdrMXolZlkiPoJaCSOBfQJqZXd08\nw8yOAQ4huF9DvZmdEk43G2Nmx4fPLwHeaGO59eHlzAFygYowIUwluGWqSI+ipCACeHAY3n8AHwsP\nSV0C3Ao8BxSYWSFBq+GDmLctAy4Pu5byaPveuLOAReFA8/NASlj+RwRdSCI9ig5JFdkP4e1Onw1v\nHi/SZ6ilICIiLdRSEBGRFmopiIhICyUFERFpoaQgIiItlBRERKSFkoKIiLT4/4EmbUnRp+/0AAAA\nAElFTkSuQmCC\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x1d9de122198>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plotting Final Policy (action stake) vs State (Capital)\n",
"\n",
"# x axis values\n",
"x = range(100)\n",
"# corresponding y axis values\n",
"y = v[:100]\n",
" \n",
"# plotting the points \n",
"plt.plot(x, y)\n",
" \n",
"# naming the x axis\n",
"plt.xlabel('Capital')\n",
"# naming the y axis\n",
"plt.ylabel('Value Estimates')\n",
" \n",
"# giving a title to the graph\n",
"plt.title('Final Policy (action stake) vs State (Capital)')\n",
" \n",
"# function to show the plot\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEWCAYAAABrDZDcAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAGoxJREFUeJzt3Xu8HGV9x/HP13AXQghJMJDEgA0X\naw2XIwWxlIu0SCmJBSkUMW3B9KJyEbWovFpQq9AqUK9tCmKK3CQg1xZJYyKlhUAihIsgCYgQE5MA\nCQEBTeDXP+Y5sBzO7pk9Z2f27M73/Xrta3dm5/KbmWR/53memedRRGBmZtX1pnYHYGZm7eVEYGZW\ncU4EZmYV50RgZlZxTgRmZhXnRGBmVnFOBDZsSTpB0q1DWH+BpJNbGVMT+x5S7H229bik97ZiWzXb\nPEjS8prpByUd1Mp9WOdwIrAhk/RnkhZJel7SSkn/Jek9Q91uRFwWEX9Qs5+Q9FtD3W6rSJqcYnq+\n5rUE3hh7gTF8R9Jv0r6fkTRX0u7NbicifjsiFhQQonUAJwIbEkkfBy4EvgjsAEwCvglMa2dcJRsV\nEVun19Q27P+fImJrYAKwGvhOG2KwDuZEYIMmaVvgc8BHIuLaiPhVRGyIiBsj4pNpmX0l3SFpXSot\nfF3SZjXbCEmnSHpM0lOS/lnSm9J3fy7p9vT5trTKkvTX759K2k7STZLWSFqbPk/IEfeOkl6UNLpm\n3l5p/5tK+i1JP5L0bJp31SDOzaux1xznX0tammL9hiSl794m6YeSnk77u0zSqGb3GREvAJcD70jb\n3VzShZJWpNeFkjavE++r1U+SRkj6jKRHJT0nabGkiSnmr/RZ70ZJpzUbqw0vTgQ2FPsDWwDfb7DM\ny8DpwJi0/KHA3/ZZ5v1AD7A3WUniL/tuJCIOTB+npr+8ryL793sJ8FayksiLwNcHCjoiVgB3AEfX\nzP4zYE5EbAA+D9wKbEf2V/bXBtpmTkcC7wKmAscCf5jmC/gSsCOwBzAROLvZjUvaGjgBuCfN+iyw\nH7Bn2ue+wFk5NvVx4HjgCGAk2fV4AZgNHF+TqMeQXc8rmo3VhhcnAhuK7YGnImJjvQUiYnFE3BkR\nGyPiceDfgN/vs9h5EfFMRDxBVs10fJ6dR8TTEXFNRLwQEc8B/9jPtuu5vHc/6S/z49I8gA1kyWXH\niHgpIm7vfxOveiqVeNZJ+kSD5c6NiHXpOOeT/UATEcsiYm5E/Doi1gDnN3EcAJ+QtA5YBmwN/Hma\nfwLwuYhYnbZ7DnBiju2dDJwVET+NzJJ0ru8CniX78YfsnC2IiFVNxGrDkBOBDcXTwBhJm9RbQNKu\nqcrml5LWk7UljOmz2JM1n39O9pfxgCRtJenfJP08bfs2YJSkETlWnwPsL2lH4EAggP9J332K7K/0\nu9LdNG8oofQxJiJGpdeXGyz3y5rPL5D9aCNpnKQrJf0iHcd3eeM5auTLad9viYijIuLRNH9HsvPZ\nK++5nQg8Wue72cAH0+cPApc2EacNU04ENhR3AC8B0xss8y3gYWBKRIwEPkP2I1trYs3nScCKnPs/\nA9gN+N207d7qo77bf4OIWEdW/XMsWbXQFZG64o2IX0bEhyNiR+CvgG8WfLfSl8gS0TvTcXyQHMeQ\nwwqykk2vvOf2SeBtdb77LjBN0lSyaqzrhhShDQtOBDZoEfEs8PfANyRNT3+hbyrpfZL+KS22DbAe\neD7d1vg3/Wzqk6nhdyJwKlCvcXYVsEvN9DZk7QLrUsPvPzR5CJcDHyJrK+itFkLSB2oandeS/Ui/\n3OS2m7EN8DzZcewEfLJF270COEvS2FSf//dkP+QDuQj4vKQpyrxT0vYAEbEcuJusJHBNRLzYolit\njZwIbEgi4nyyxsWzgDVkf01+lNf+UvwE2V/czwH/Tv8/8tcDi4F7gZuBi+vs7mxgdqqLP5asPWFL\n4CngTuCWJsO/AZgCrIqIJTXz3wUslPR8WubUiPhZk9tuxjlkDeXPkh3/tS3a7heARcB9wP3Aj9O8\ngZwPfI+sxLSe7HpsWfP9bOB3cLVQ15AHprF2khRk1UbL2h2L5SPpQLKSxeSIeKXd8djQuURgZrlJ\n2pSs+u4iJ4Hu4URgZrlI2gNYB4wnq5azLuGqITOzinOJwMys4uo+CDScjBkzJiZPntzuMMzMOsri\nxYufioixAy3XEYlg8uTJLFq0qN1hmJl1FEk/H3gpVw2ZmVWeE4GZWcU5EZiZVZwTgZlZxTkRmJlV\nnBOBmVnFFXr7qKTHyXqdfBnYGBE9qbvgq4DJwOPAsRGxtsg4zMysvjJKBAdHxJ4R0ZOmzwTmRcQU\nYF6aNjOzNmlH1dA0sv7MSe+NRrcyM7OCFZ0IArhV0mJJM9O8HSJiJUB6H9ffipJmSlokadGaNWsK\nDtNs8C6Y+wgXzH2k3WGYDVrRXUwcEBErJI0D5kp6OO+KETELmAXQ09PjLlLNzApSaIkgIlak99XA\n94F9gVWSxgOk99VFxmBmZo0VlggkvVnSNr2fgT8AHiAbA3ZGWmwG2Xi1ZmbWJkVWDe0AfF9S734u\nj4hbJN0NfE/SScATwAcKjMGs5WrbA04/bNc2RmLWGoUlgoh4DJjaz/yngUOL2q+ZmTXHTxabmVWc\nE4GZWcV1xAhlZu3m5wSsm7lEYGZWcU4EZmYV50RgZlZxbiMwq8PtAlYVLhGYmVWcE4GZWcU5EZiZ\nVZzbCMxquF3AqsglAjOzinMiMDOrOCcCM7OKcyIwM6s4JwIzs4pzIjAzqzgnAjOzinMiMDOrOD9Q\nZpXkAejNXuMSgZlZxTkRmJlVnBOBmVnFORGYmVWcE4GZWcU5EZiZVZwTgZlZxfk5Autqfl7AbGAu\nEZiZVZwTgZlZxTkRmJlVnBOBmVnFFZ4IJI2QdI+km9L0zpIWSloq6SpJmxUdg5mZ1VdGieBU4KGa\n6fOACyJiCrAWOKmEGMzMrI5CE4GkCcAfARelaQGHAHPSIrOB6UXGYGZmjRVdIrgQ+BTwSpreHlgX\nERvT9HJgp/5WlDRT0iJJi9asWVNwmGZm1VVYIpB0JLA6IhbXzu5n0ehv/YiYFRE9EdEzduzYQmI0\nM7Ninyw+ADhK0hHAFsBIshLCKEmbpFLBBGBFgTGYmdkACisRRMSnI2JCREwGjgN+GBEnAPOBY9Ji\nM4Dri4rBzMwG1o7nCP4O+LikZWRtBhe3IQYzM0tK6XQuIhYAC9Lnx4B9y9ivmZkNzE8Wm5lVnBOB\nmVnFORFYR7lg7iOvG2PAzIbOicDMrOKcCMzMKs6JwMys4hrePippC+BI4PeAHYEXgQeAmyPiweLD\nMzOzotVNBJLOBv6Y7P7/hcBqsq4idgXOTUnijIi4r/gwzcysKI1KBHdHxNl1vjtf0jhgUutDMjOz\nMtVNBBFxc+20pDdHxK9qvl9NVkowM7MONmBjsaR3S/oJaZQxSVMlfbPwyMzMrBR57hq6APhD4GmA\niFgCHFhkUGZmVp5ct49GxJN9Zr1cQCxmZtYGeXoffVLSu4GQtBlwCq8fjN7MzDpYnhLBXwMfIRtb\neDmwZ5o2M7MukKdE8EoaWexVknYmtRmYmVlny1MiuFHSyN4JSXsANxYXkpmZlSlPIvgiWTLYWtI+\nwBzgg8WGZWZmZRmwaigibpa0KXArsA0wPSKWFh6ZmZmVolFfQ18DombWSOAx4GOSiIhTig7OzMyK\n16hEsKjP9OIiAzEzs/Zo1NfQ7DIDMTOz9hiwjUDSFOBLwNvJuqEGICJ2KTAuMzMrSZ67hi4BvgVs\nBA4G/gO4tMigzMysPHkSwZYRMQ9QRPw8jVFwSLFhmZlZWfI8WfySpDcBSyV9FPgFMK7YsMzMrCx5\nSgSnAVuRdTa3D9nDZB8qMigzMytPnkQwOSKej4jlEfEXEXE0HqLSzKxr5EkEn845z8zMOlCjJ4vf\nBxwB7CTpqzVfjSS7g8jMzLpAo8biFWRPFx/F658qfg44vcigzMysPI2eLF4CLJF0eURsAJC0HTAx\nItaWFaCZmRUrTxvBXEkjJY0GlgCXSDp/oJUkbSHpLklLJD0o6Zw0f2dJCyUtlXRVGv7SzMzaJE8i\n2DYi1gN/AlwSEfsA782x3q+BQyJiKtnwlodL2g84D7ggIqYAa4GTBhe6mZm1Qp5EsImk8cCxwE15\nNxyZ59PkpukVZE8lz0nzZwPT84drZmatlicRfA74AbAsIu6WtAuQa2AaSSMk3QusBuYCjwLrIqL3\nrqPlwE7Nh21mZq2SZ4Syq4Gra6YfA47Os/GIeBnYU9Io4PvAHv0t1t+6kmYCMwEmTfLza2ZmRalb\nIpB0Vmogrvf9IZKOzLOTiFgHLAD2A0ZJ6k1AE8huU+1vnVkR0RMRPWPHjs2zGzMzG4RGJYL7yQat\nfwn4MbCGbDyCKWSNv/9NNrB9vySNBTZExDpJW5I1MJ8HzAeOAa4EZgDXt+A4zMxskBo9R3A9cH0a\nmOYAYDywHvguMDMiXhxg2+OB2ZJGkJU8vhcRN0n6CXClpC8A9wAXt+A4zMxskPK0ESwlZ+Nwn/Xu\nA/bqZ/5jwL7Nbs/MzIqR564hMzPrYk4EZmYVN2AiaHTnkJmZdb48JYKFkq6WdIQkFR6RmZmVKk8i\n2BWYBZwILJP0RUm7FhuWmZmVZcBEkPoMmhsRxwMnk937f5ekH0nav/AIzcysUAPePippe7IB608E\nVgEfA24ge6jsamDnIgM0M7NiDZgIgDuAS4HpEbG8Zv4iSf9aTFhmZlaWPIlgt4jot2O4iDivxfGY\nmVnJ8jQW35p6DwWy4Sol/aDAmMzMrER5EsHY1HsoAGm84nHFhWRmZmXKkwhelvTqgACS3kqdMQTM\nzKzz5Gkj+Cxwu6QfpekDSQPGmJlZ58vT++gtkvYmG1RGwOkR8VThkZmZWSkajVC2e3rfG5hENpLY\nL4BJaZ6ZmXWBRiWCM4APA1/p57sADikkImurC+Y+8urn0w9zTyLWer3/xvzva/hoNELZh9P7weWF\nY2ZmZaubCCT9SaMVI+La1odjZmZla1Q19McNvgvAicDMrAs0qhr6izIDsfapbRcwK4LbBYa3PCOU\nbSvpfEmL0usrkrYtIzgzMytenieLvw08BxybXuuBS4oMyszMypPnyeK3RcTRNdPnSLq3qIDMzKxc\neUoEL0p6T++EpAOAF4sLyczMypSnRPA3wOzULiDgGbLhKq2DuYHYiuYG4s6Rp6+he4Gpkkam6fWF\nR2VmZqXJc9fQ9pK+CiwA5kv6lzSOsZmZdYE8bQRXAmuAo4Fj0uerigzKzMzKk6eNYHREfL5m+guS\nphcVkJl1LrcLdKY8JYL5ko6T9Kb0Oha4uejAzMysHHkSwV8BlwO/Tq8rgY9Lek6SG47NzDpcnruG\ntikjEDMza488bQTWwTzQjBXN7QKdL0/V0KBImihpvqSHJD0o6dQ0f7SkuZKWpvftiorBzMwGVlgi\nADYCZ0TEHmQD339E0tuBM4F5ETEFmJemzcysTRqNUDa60YoR8cwA368EVqbPz0l6CNgJmAYclBab\nTfag2t/ljtjMzFqqURvBYrKRyNTPdwHskncnkiYDewELgR1SkiAiVkoaV2edmcBMgEmTJuXdleE6\nWzNrTqMRynZuxQ4kbQ1cA5wWEeul/vJKv/ufBcwC6OnpiVbEYmZmb5TrrqHUoDsF2KJ3XkTclmO9\nTcmSwGU1g92vkjQ+lQbGA6ubD9vMzFolT6dzJwO3AT8AzknvZ+dYT8DFwEMRcX7NVzfwWjfWM4Dr\nmwvZzMxaKU+J4FTgXcCdEXGwpN3JEsJADgBOBO6vGdHsM8C5wPcknQQ8AXyg+bDNrEx+HqW75UkE\nL0XES5KQtHlEPCxpt4FWiojb6b+hGeDQpqI0M7PC5EkEyyWNAq4D5kpaC6woNiwzMytLnr6G3p8+\nni1pPrAtcEuhUZmZWWny3jU0AtgB+Fma9Ray+n1rIz8vYGatMGAikPQx4B+AVcAraXYA7ywwLjMz\nK0neu4Z2i4iniw7GzMzKl6fTuSeBZ4sOxMzM2iNPieAxYIGkm8lGKAOgz0NiVhK3C1iR/LxANeVJ\nBE+k12bpZWZmXSTP7aN5niI2M7MO1Wg8ggsj4jRJN5LdJfQ6EXFUoZGZmVkpGpUILk3vXy4jEDMz\na49GiWANQET8qKRYrA43EJtZkRrdPnpd7wdJ15QQi5mZtUGjRFDbc2juYSnNzKyzNEoEUeezmZl1\nkUZtBFMlrScrGWyZPpOmIyJGFh5dhbldwMzK0mjw+hFlBmJmZu2Rp68hMzPrYk4EZmYV50RgZlZx\nTgRmZhXnRGBmVnFOBGZmFZdr8Hoz60weaMbycInAzKzinAjMzCrOicDMrOLcRtAGtf0IuU+hgfkc\nNae2XcAGVu98Venfm0sEZmYV50RgZlZxTgRmZhXnNoKS5Knndl34a3y+mpOnXcDPFLzG5+v1CisR\nSPq2pNWSHqiZN1rSXElL0/t2Re3fzMzyKbJq6DvA4X3mnQnMi4gpwLw0bWZmbVRYIoiI24Bn+sye\nBsxOn2cD04vav5mZ5VN2G8EOEbESICJWShpXb0FJM4GZAJMmTSopvNZyHXZzhnK+qniuW3G+Brt+\nJxrq8xXd/G9s2N41FBGzIqInInrGjh3b7nDMzLpW2YlglaTxAOl9dcn7NzOzPspOBDcAM9LnGcD1\nJe/fzMz6KPL20SuAO4DdJC2XdBJwLnCYpKXAYWnazMzaqLDG4og4vs5Xhxa1z3arYgPcUBXRANfN\nna75fDWnqGPrtobjYdtYbGZm5XAiMDOrOCcCM7OKc6dzQ9TN9atF6bb61aL5fDWn7P+T3XB9XCIw\nM6s4JwIzs4pzIjAzqzi3EQyC2wWaNxzqUTvpuvl8NWe4xDocrttguERgZlZxTgRmZhXnRGBmVnFu\nI8hpuNRB1jMc6yaHY0zDmc9Xc/x/snVcIjAzqzgnAjOzinMiMDOrOLcRNDDc6yCHo06qF+3Vzuvc\niecL2he3z1cxXCIwM6s4JwIzs4pzIjAzqzi3EfThdoHmeJzm5g33+uLhptvO13A8HpcIzMwqzonA\nzKzinAjMzCrObQRUp12gVcdZlfPVSsOxXng4q8r5Gi7H6RKBmVnFORGYmVWcE4GZWcU5EZiZVVxl\nG4vd4Nkcn6/mDZeGwE5R9fPVzuN3icDMrOKcCMzMKs6JwMys4irVRuB67ub4fDWv6vXczfL56l/Z\n56UtJQJJh0v6qaRlks5sRwxmZpYpPRFIGgF8A3gf8HbgeElvLzsOMzPLtKNEsC+wLCIei4jfAFcC\n09oQh5mZAYqIcncoHQMcHhEnp+kTgd+NiI/2WW4mMDNN7gb8dAi7HQM8NYT1O5GPuRp8zNUw2GN+\na0SMHWihdjQWq595b8hGETELmNWSHUqLIqKnFdvqFD7mavAxV0PRx9yOqqHlwMSa6QnAijbEYWZm\ntCcR3A1MkbSzpM2A44Ab2hCHmZnRhqqhiNgo6aPAD4ARwLcj4sGCd9uSKqYO42OuBh9zNRR6zKU3\nFpuZ2fDiLibMzCrOicDMrOK6PhFUoTsLSRMlzZf0kKQHJZ2a5o+WNFfS0vS+XbtjbSVJIyTdI+mm\nNL2zpIXpeK9KNyN0FUmjJM2R9HC63vtX4Dqfnv5dPyDpCklbdNu1lvRtSaslPVAzr9/rqsxX02/a\nfZL2Hur+uzoRVKg7i43AGRGxB7Af8JF0nGcC8yJiCjAvTXeTU4GHaqbPAy5Ix7sWOKktURXrX4Bb\nImJ3YCrZ8XftdZa0E3AK0BMR7yC7weQ4uu9afwc4vM+8etf1fcCU9JoJfGuoO+/qREBFurOIiJUR\n8eP0+TmyH4edyI51dlpsNjC9PRG2nqQJwB8BF6VpAYcAc9IiXXW8AJJGAgcCFwNExG8iYh1dfJ2T\nTYAtJW0CbAWspMuudUTcBjzTZ3a96zoN+I/I3AmMkjR+KPvv9kSwE/BkzfTyNK9rSZoM7AUsBHaI\niJWQJQtgXPsia7kLgU8Br6Tp7YF1EbExTXfjtd4FWANckqrELpL0Zrr4OkfEL4AvA0+QJYBngcV0\n/7WG+te15b9r3Z4IcnVn0S0kbQ1cA5wWEevbHU9RJB0JrI6IxbWz+1m02671JsDewLciYi/gV3RR\nNVB/Ur34NGBnYEfgzWRVI31127VupOX/1rs9EVSmOwtJm5Ilgcsi4to0e1VvkTG9r25XfC12AHCU\npMfJqvsOISshjErVB9Cd13o5sDwiFqbpOWSJoVuvM8B7gZ9FxJqI2ABcC7yb7r/WUP+6tvx3rdsT\nQSW6s0j14xcDD0XE+TVf3QDMSJ9nANeXHVsRIuLTETEhIiaTXdMfRsQJwHzgmLRY1xxvr4j4JfCk\npN3SrEOBn9Cl1zl5AthP0lbp33nvMXf1tU7qXdcbgA+lu4f2A57trUIatIjo6hdwBPAI8Cjw2XbH\nU9AxvoesaHgfcG96HUFWbz4PWJreR7c71gKO/SDgpvR5F+AuYBlwNbB5u+Mr4Hj3BBala30dsF23\nX2fgHOBh4AHgUmDzbrvWwBVkbSAbyP7iP6nedSWrGvpG+k27n+yOqiHt311MmJlVXLdXDZmZ2QCc\nCMzMKs6JwMys4pwIzMwqzonAzKzinAis0iS9RdKVkh6V9BNJ/ylp10Fs56LeDg0lfSbnOo9LGtPs\nvsxazbePWmWlB5T+D5gdEf+a5u0JbBMR/zOE7T4fEVvnWO5xsnvAnxrsvsxawSUCq7KDgQ29SQAg\nIu4F7pE0T9KPJd0vaRpkHfqlcQBmp37g50jaKn23QFKPpHPJesq8V9Jl6bvrJC1OferPbMNxmjXk\nRGBV9g6yniz7egl4f0TsTZYsvpJKDwC7AbMi4p3AeuBva1eMiDOBFyNiz8i6vQD4y4jYB+gBTpG0\nfQHHYjZoTgRmbyTgi5LuA/6brIvfHdJ3T0bE/6bP3yXr3mMgp0haAtxJ1lnYlBbHazYkmwy8iFnX\nepDXOi6rdQIwFtgnIjakuvwt0nd9G9UaNrJJOoisB839I+IFSQtqtmU2LLhEYFX2Q2BzSR/unSHp\nXcBbycY72CDp4DTda5Kk/dPn44Hb+9nuhtQtOMC2wNqUBHYnG0rUbFhxIrDKiuyWufcDh6XbRx8E\nzgb+E+iRtIisdPBwzWoPATNStdFo+h8vdhZwX2osvgXYJC3/ebLqIbNhxbePmuWUhgG9KbJB1M26\nhksEZmYV5xKBmVnFuURgZlZxTgRmZhXnRGBmVnFOBGZmFedEYGZWcf8PGuEWwOrW2QgAAAAASUVO\nRK5CYII=\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x1d9e016fe48>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plotting Capital vs Final Policy\n",
"\n",
"# x axis values\n",
"x = range(100)\n",
"# corresponding y axis values\n",
"y = policy\n",
" \n",
"# plotting the bars\n",
"plt.bar(x, y, align='center', alpha=0.5)\n",
" \n",
"# naming the x axis\n",
"plt.xlabel('Capital')\n",
"# naming the y axis\n",
"plt.ylabel('Final policy (stake)')\n",
" \n",
"# giving a title to the graph\n",
"plt.title('Capital vs Final Policy')\n",
" \n",
"# function to show the plot\n",
"plt.show()\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
================================================
FILE: DP/Gamblers Problem.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### This is Example 4.3. Gambler’s Problem from Sutton's book.\n",
"\n",
"A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. \n",
"If the coin comes up heads, he wins as many dollars as he has staked on that flip; \n",
"if it is tails, he loses his stake. The game ends when the gambler wins by reaching his goal of $100, \n",
"or loses by running out of money. \n",
"\n",
"On each flip, the gambler must decide what portion of his capital to stake, in integer numbers of dollars. \n",
"This problem can be formulated as an undiscounted, episodic, finite MDP. \n",
"\n",
"The state is the gambler’s capital, s ∈ {1, 2, . . . , 99}.\n",
"The actions are stakes, a ∈ {0, 1, . . . , min(s, 100 − s)}. \n",
"The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.\n",
"\n",
"The state-value function then gives the probability of winning from each state. A policy is a mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal. Let p_h denote the probability of the coin coming up heads. If p_h is known, then the entire problem is known and it can be solved, for instance, by value iteration.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import sys\n",
"import matplotlib.pyplot as plt\n",
"if \"../\" not in sys.path:\n",
" sys.path.append(\"../\") "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"\n",
"### Exercise 4.9 (programming)\n",
"\n",
"Implement value iteration for the gambler’s problem and solve it for p_h = 0.25 and p_h = 0.55.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def value_iteration_for_gamblers(p_h, theta=0.0001, discount_factor=1.0):\n",
" \"\"\"\n",
" Args:\n",
" p_h: Probability of the coin coming up heads\n",
" \"\"\"\n",
" \n",
" def one_step_lookahead(s, V, rewards):\n",
" \"\"\"\n",
" Helper function to calculate the value for all action in a given state.\n",
" \n",
" Args:\n",
" s: The gambler’s capital. Integer.\n",
" V: The vector that contains values at each state. \n",
" rewards: The reward vector.\n",
" \n",
" Returns:\n",
" A vector containing the expected value of each action. \n",
" Its length equals to the number of actions.\n",
" \"\"\"\n",
" \n",
" # Implement!\n",
" \n",
" return A\n",
" \n",
" # Implement!\n",
" \n",
" return policy, V"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"policy, v = value_iteration_for_gamblers(0.25)\n",
"\n",
"print(\"Optimized Policy:\")\n",
"print(policy)\n",
"print(\"\")\n",
"\n",
"print(\"Optimized Value Function:\")\n",
"print(v)\n",
"print(\"\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Plotting Final Policy (action stake) vs State (Capital)\n",
"\n",
"# Implement!"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Plotting Capital vs Final Policy\n",
"\n",
"# Implement!\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
================================================
FILE: DP/Policy Evaluation Solution.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from IPython.core.debugger import set_trace\n",
"import numpy as np\n",
"import pprint\n",
"import sys\n",
"if \"../\" not in sys.path:\n",
" sys.path.append(\"../\") \n",
"from lib.envs.gridworld import GridworldEnv"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"pp = pprint.PrettyPrinter(indent=2)\n",
"env = GridworldEnv()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def policy_eval(policy, env, discount_factor=1.0, theta=0.00001):\n",
" \"\"\"\n",
" Evaluate a policy given an environment and a full description of the environment's dynamics.\n",
" \n",
" Args:\n",
" policy: [S, A] shaped matrix representing the policy.\n",
" env: OpenAI env. env.P represents the transition probabilities of the environment.\n",
" env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).\n",
" env.nS is a number of states in the environment. \n",
" env.nA is a number of actions in the environment.\n",
" theta: We stop evaluation once our value function change is less than theta for all states.\n",
" discount_factor: Gamma discount factor.\n",
" \n",
" Returns:\n",
" Vector of length env.nS representing the value function.\n",
" \"\"\"\n",
" # Start with a random (all 0) value function\n",
" V = np.zeros(env.nS)\n",
" while True:\n",
" delta = 0\n",
" # For each state, perform a \"full backup\"\n",
" for s in range(env.nS):\n",
" v = 0\n",
" # Look at the possible next actions\n",
" for a, action_prob in enumerate(policy[s]):\n",
" # For each action, look at the possible next states...\n",
" for prob, next_state, reward, done in env.P[s][a]:\n",
" # Calculate the expected value. Ref: Sutton book eq. 4.6.\n",
" v += action_prob * prob * (reward + discount_factor * V[next_state])\n",
" # How much our value function changed (across any states)\n",
" delta = max(delta, np.abs(v - V[s]))\n",
" V[s] = v\n",
" # Stop evaluating once our value function change is below a threshold\n",
" if delta < theta:\n",
" break\n",
" return np.array(V)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"random_policy = np.ones([env.nS, env.nA]) / env.nA\n",
"v = policy_eval(random_policy, env)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Value Function:\n",
"[ 0. -13.99993529 -19.99990698 -21.99989761 -13.99993529\n",
" -17.9999206 -19.99991379 -19.99991477 -19.99990698 -19.99991379\n",
" -17.99992725 -13.99994569 -21.99989761 -19.99991477 -13.99994569\n",
" 0. ]\n",
"\n",
"Reshaped Grid Value Function:\n",
"[[ 0. -13.99993529 -19.99990698 -21.99989761]\n",
" [-13.99993529 -17.9999206 -19.99991379 -19.99991477]\n",
" [-19.99990698 -19.99991379 -17.99992725 -13.99994569]\n",
" [-21.99989761 -19.99991477 -13.99994569 0. ]]\n",
"\n"
]
}
],
"source": [
"print(\"Value Function:\")\n",
"print(v)\n",
"print(\"\")\n",
"\n",
"print(\"Reshaped Grid Value Function:\")\n",
"print(v.reshape(env.shape))\n",
"print(\"\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Test: Make sure the evaluated policy is what we expected\n",
"expected_v = np.array([0, -14, -20, -22, -14, -18, -20, -20, -20, -20, -18, -14, -22, -20, -14, 0])\n",
"np.testing.assert_array_almost_equal(v, expected_v, decimal=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
================================================
FILE: DP/Policy Evaluation.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import sys\n",
"if \"../\" not in sys.path:\n",
" sys.path.append(\"../\") \n",
"from lib.envs.gridworld import GridworldEnv"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"env = GridworldEnv()"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def policy_eval(policy, env, discount_factor=1.0, theta=0.00001):\n",
" \"\"\"\n",
" Evaluate a policy given an environment and a full description of the environment's dynamics.\n",
" \n",
" Args:\n",
" policy: [S, A] shaped matrix representing the policy.\n",
" env: OpenAI env. env.P represents the transition probabilities of the environment.\n",
" env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).\n",
" env.nS is a number of states in the environment. \n",
" env.nA is a number of actions in the environment.\n",
" theta: We stop evaluation once our value function change is less than theta for all states.\n",
" discount_factor: Gamma discount factor.\n",
" \n",
" Returns:\n",
" Vector of length env.nS representing the value function.\n",
" \"\"\"\n",
" # Start with a random (all 0) value function\n",
" V = np.zeros(env.nS)\n",
" while True:\n",
" # TODO: Implement!\n",
" break\n",
" return np.array(V)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"random_policy = np.ones([env.nS, env.nA]) / env.nA\n",
"v = policy_eval(random_policy, env)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"ename": "AssertionError",
"evalue": "\nArrays are not almost equal to 2 decimals\n\n(mismatch 87.5%)\n x: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n 0., 0., 0.])\n y: array([ 0, -14, -20, -22, -14, -18, -20, -20, -20, -20, -18, -14, -22,\n -20, -14, 0])",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-22-235f39fb115c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Test: Make sure the evaluated policy is what we expected\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mexpected_v\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m14\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m20\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m22\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m14\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m18\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m20\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m20\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m20\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m20\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m18\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m14\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m22\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m20\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m14\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtesting\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0massert_array_almost_equal\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mv\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexpected_v\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdecimal\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/Users/dennybritz/venvs/tf/lib/python3.5/site-packages/numpy/testing/utils.py\u001b[0m in \u001b[0;36massert_array_almost_equal\u001b[0;34m(x, y, decimal, err_msg, verbose)\u001b[0m\n\u001b[1;32m 914\u001b[0m assert_array_compare(compare, x, y, err_msg=err_msg, verbose=verbose,\n\u001b[1;32m 915\u001b[0m \u001b[0mheader\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Arrays are not almost equal to %d decimals'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mdecimal\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 916\u001b[0;31m precision=decimal)\n\u001b[0m\u001b[1;32m 917\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 918\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/Users/dennybritz/venvs/tf/lib/python3.5/site-packages/numpy/testing/utils.py\u001b[0m in \u001b[0;36massert_array_compare\u001b[0;34m(comparison, x, y, err_msg, verbose, header, precision)\u001b[0m\n\u001b[1;32m 735\u001b[0m names=('x', 'y'), precision=precision)\n\u001b[1;32m 736\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mcond\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 737\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mAssertionError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 738\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 739\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mtraceback\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mAssertionError\u001b[0m: \nArrays are not almost equal to 2 decimals\n\n(mismatch 87.5%)\n x: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n 0., 0., 0.])\n y: array([ 0, -14, -20, -22, -14, -18, -20, -20, -20, -20, -18, -14, -22,\n -20, -14, 0])"
]
}
],
"source": [
"# Test: Make sure the evaluated policy is what we expected\n",
"expected_v = np.array([0, -14, -20, -22, -14, -18, -20, -20, -20, -20, -18, -14, -22, -20, -14, 0])\n",
"np.testing.assert_array_almost_equal(v, expected_v, decimal=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
================================================
FILE: DP/Policy Iteration Solution.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pprint\n",
"import sys\n",
"if \"../\" not in sys.path:\n",
" sys.path.append(\"../\") \n",
"from lib.envs.gridworld import GridworldEnv"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"pp = pprint.PrettyPrinter(indent=2)\n",
"env = GridworldEnv()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Taken from Policy Evaluation Exercise!\n",
"\n",
"def policy_eval(policy, env, discount_factor=1.0, theta=0.00001):\n",
" \"\"\"\n",
" Evaluate a policy given an environment and a full description of the environment's dynamics.\n",
" \n",
" Args:\n",
" policy: [S, A] shaped matrix representing the policy.\n",
" env: OpenAI env. env.P represents the transition probabilities of the environment.\n",
" env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).\n",
" env.nS is a number of states in the environment. \n",
" env.nA is a number of actions in the environment.\n",
" theta: We stop evaluation once our value function change is less than theta for all states.\n",
" discount_factor: Gamma discount factor.\n",
" \n",
" Returns:\n",
" Vector of length env.nS representing the value function.\n",
" \"\"\"\n",
" # Start with a random (all 0) value function\n",
" V = np.zeros(env.nS)\n",
" while True:\n",
" delta = 0\n",
" # For each state, perform a \"full backup\"\n",
" for s in range(env.nS):\n",
" v = 0\n",
" # Look at the possible next actions\n",
" for a, action_prob in enumerate(policy[s]):\n",
" # For each action, look at the possible next states...\n",
" for prob, next_state, reward, done in env.P[s][a]:\n",
" # Calculate the expected value\n",
" v += action_prob * prob * (reward + discount_factor * V[next_state])\n",
" # How much our value function changed (across any states)\n",
" delta = max(delta, np.abs(v - V[s]))\n",
" V[s] = v\n",
" # Stop evaluating once our value function change is below a threshold\n",
" if delta < theta:\n",
" break\n",
" return np.array(V)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def policy_improvement(env, policy_eval_fn=policy_eval, discount_factor=1.0):\n",
" \"\"\"\n",
" Policy Improvement Algorithm. Iteratively evaluates and improves a policy\n",
" until an optimal policy is found.\n",
" \n",
" Args:\n",
" env: The OpenAI environment.\n",
" policy_eval_fn: Policy Evaluation function that takes 3 arguments:\n",
" policy, env, discount_factor.\n",
" discount_factor: gamma discount factor.\n",
" \n",
" Returns:\n",
" A tuple (policy, V). \n",
" policy is the optimal policy, a matrix of shape [S, A] where each state s\n",
" contains a valid probability distribution over actions.\n",
" V is the value function for the optimal policy.\n",
" \n",
" \"\"\"\n",
"\n",
" def one_step_lookahead(state, V):\n",
" \"\"\"\n",
" Helper function to calculate the value for all action in a given state.\n",
" \n",
" Args:\n",
" state: The state to consider (int)\n",
" V: The value to use as an estimator, Vector of length env.nS\n",
" \n",
" Returns:\n",
" A vector of length env.nA containing the expected value of each action.\n",
" \"\"\"\n",
" A = np.zeros(env.nA)\n",
" for a in range(env.nA):\n",
" for prob, next_state, reward, done in env.P[state][a]:\n",
" A[a] += prob * (reward + discount_factor * V[next_state])\n",
" return A\n",
" \n",
" # Start with a random policy\n",
" policy = np.ones([env.nS, env.nA]) / env.nA\n",
" \n",
" while True:\n",
" # Evaluate the current policy\n",
" V = policy_eval_fn(policy, env, discount_factor)\n",
" \n",
" # Will be set to false if we make any changes to the policy\n",
" policy_stable = True\n",
" \n",
" # For each state...\n",
" for s in range(env.nS):\n",
" # The best action we would take under the current policy\n",
" chosen_a = np.argmax(policy[s])\n",
" \n",
" # Find the best action by one-step lookahead\n",
" # Ties are resolved arbitarily\n",
" action_values = one_step_lookahead(s, V)\n",
" best_a = np.argmax(action_values)\n",
" \n",
" # Greedily update the policy\n",
" if chosen_a != best_a:\n",
" policy_stable = False\n",
" policy[s] = np.eye(env.nA)[best_a]\n",
" \n",
" # If the policy is stable we've found an optimal policy. Return it\n",
" if policy_stable:\n",
" return policy, V"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Policy Probability Distribution:\n",
"[[1. 0. 0. 0.]\n",
" [0. 0. 0. 1.]\n",
" [0. 0. 0. 1.]\n",
" [0. 0. 1. 0.]\n",
" [1. 0. 0. 0.]\n",
" [1. 0. 0. 0.]\n",
" [1. 0. 0. 0.]\n",
" [0. 0. 1. 0.]\n",
" [1. 0. 0. 0.]\n",
" [1. 0. 0. 0.]\n",
" [0. 1. 0. 0.]\n",
" [0. 0. 1. 0.]\n",
" [1. 0. 0. 0.]\n",
" [0. 1. 0. 0.]\n",
" [0. 1. 0. 0.]\n",
" [1. 0. 0. 0.]]\n",
"\n",
"Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):\n",
"[[0 3 3 2]\n",
" [0 0 0 2]\n",
" [0 0 1 2]\n",
" [0 1 1 0]]\n",
"\n",
"Value Function:\n",
"[ 0. -1. -2. -3. -1. -2. -3. -2. -2. -3. -2. -1. -3. -2. -1. 0.]\n",
"\n",
"Reshaped Grid Value Function:\n",
"[[ 0. -1. -2. -3.]\n",
" [-1. -2. -3. -2.]\n",
" [-2. -3. -2. -1.]\n",
" [-3. -2. -1. 0.]]\n",
"\n"
]
}
],
"source": [
"policy, v = policy_improvement(env)\n",
"print(\"Policy Probability Distribution:\")\n",
"print(policy)\n",
"print(\"\")\n",
"\n",
"print(\"Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):\")\n",
"print(np.reshape(np.argmax(policy, axis=1), env.shape))\n",
"print(\"\")\n",
"\n",
"print(\"Value Function:\")\n",
"print(v)\n",
"print(\"\")\n",
"\n",
"print(\"Reshaped Grid Value Function:\")\n",
"print(v.reshape(env.shape))\n",
"print(\"\")\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Test the value function\n",
"expected_v = np.array([ 0, -1, -2, -3, -1, -2, -3, -2, -2, -3, -2, -1, -3, -2, -1, 0])\n",
"np.testing.assert_array_almost_equal(v, expected_v, decimal=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
================================================
FILE: DP/Policy Iteration.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pprint\n",
"import sys\n",
"if \"../\" not in sys.path:\n",
" sys.path.append(\"../\") \n",
"from lib.envs.gridworld import GridworldEnv"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pp = pprint.PrettyPrinter(indent=2)\n",
"env = GridworldEnv()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Taken from Policy Evaluation Exercise!\n",
"\n",
"def policy_eval(policy, env, discount_factor=1.0, theta=0.00001):\n",
" \"\"\"\n",
" Evaluate a policy given an environment and a full description of the environment's dynamics.\n",
" \n",
" Args:\n",
" policy: [S, A] shaped matrix representing the policy.\n",
" env: OpenAI env. env.P represents the transition probabilities of the environment.\n",
" env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).\n",
" env.nS is a number of states in the environment. \n",
" env.nA is a number of actions in the environment.\n",
" theta: We stop evaluation once our value function change is less than theta for all states.\n",
" discount_factor: Gamma discount factor.\n",
" \n",
" Returns:\n",
" Vector of length env.nS representing the value function.\n",
" \"\"\"\n",
" # Start with a random (all 0) value function\n",
" V = np.zeros(env.nS)\n",
" while True:\n",
" delta = 0\n",
" # For each state, perform a \"full backup\"\n",
" for s in range(env.nS):\n",
" v = 0\n",
" # Look at the possible next actions\n",
" for a, action_prob in enumerate(policy[s]):\n",
" # For each action, look at the possible next states...\n",
" for prob, next_state, reward, done in env.P[s][a]:\n",
" # Calculate the expected value\n",
" v += action_prob * prob * (reward + discount_factor * V[next_state])\n",
" # How much our value function changed (across any states)\n",
" delta = max(delta, np.abs(v - V[s]))\n",
" V[s] = v\n",
" # Stop evaluating once our value function change is below a threshold\n",
" if delta < theta:\n",
" break\n",
" return np.array(V)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def policy_improvement(env, policy_eval_fn=policy_eval, discount_factor=1.0):\n",
" \"\"\"\n",
" Policy Improvement Algorithm. Iteratively evaluates and improves a policy\n",
" until an optimal policy is found.\n",
" \n",
" Args:\n",
" env: The OpenAI envrionment.\n",
" policy_eval_fn: Policy Evaluation function that takes 3 arguments:\n",
" policy, env, discount_factor.\n",
" discount_factor: gamma discount factor.\n",
" \n",
" Returns:\n",
" A tuple (policy, V). \n",
" policy is the optimal policy, a matrix of shape [S, A] where each state s\n",
" contains a valid probability distribution over actions.\n",
" V is the value function for the optimal policy.\n",
" \n",
" \"\"\"\n",
" # Start with a random policy\n",
" policy = np.ones([env.nS, env.nA]) / env.nA\n",
" \n",
" while True:\n",
" # Implement this!\n",
" break\n",
" \n",
" return policy, np.zeros(env.nS)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Policy Probability Distribution:\n",
"[[ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]\n",
" [ 0.25 0.25 0.25 0.25]]\n",
"\n",
"Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):\n",
"[[0 0 0 0]\n",
" [0 0 0 0]\n",
" [0 0 0 0]\n",
" [0 0 0 0]]\n",
"\n",
"Value Function:\n",
"[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
"\n",
"Reshaped Grid Value Function:\n",
"[[ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]]\n",
"\n"
]
}
],
"source": [
"policy, v = policy_improvement(env)\n",
"print(\"Policy Probability Distribution:\")\n",
"print(policy)\n",
"print(\"\")\n",
"\n",
"print(\"Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):\")\n",
"print(np.reshape(np.argmax(policy, axis=1), env.shape))\n",
"print(\"\")\n",
"\n",
"print(\"Value Function:\")\n",
"print(v)\n",
"print(\"\")\n",
"\n",
"print(\"Reshaped Grid Value Function:\")\n",
"print(v.reshape(env.shape))\n",
"print(\"\")\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"ename": "AssertionError",
"evalue": "\nArrays are not almost equal to 2 decimals\n\n(mismatch 87.5%)\n x: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n 0., 0., 0.])\n y: array([ 0, -1, -2, -3, -1, -2, -3, -2, -2, -3, -2, -1, -3, -2, -1, 0])",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-15-55581f8eb5c9>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Test the value function\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mexpected_v\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtesting\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0massert_array_almost_equal\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mv\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexpected_v\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdecimal\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/Users/dennybritz/venvs/tf/lib/python3.5/site-packages/numpy/testing/utils.py\u001b[0m in \u001b[0;36massert_array_almost_equal\u001b[0;34m(x, y, decimal, err_msg, verbose)\u001b[0m\n\u001b[1;32m 914\u001b[0m assert_array_compare(compare, x, y, err_msg=err_msg, verbose=verbose,\n\u001b[1;32m 915\u001b[0m \u001b[0mheader\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Arrays are not almost equal to %d decimals'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mdecimal\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 916\u001b[0;31m precision=decimal)\n\u001b[0m\u001b[1;32m 917\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 918\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/Users/dennybritz/venvs/tf/lib/python3.5/site-packages/numpy/testing/utils.py\u001b[0m in \u001b[0;36massert_array_compare\u001b[0;34m(comparison, x, y, err_msg, verbose, header, precision)\u001b[0m\n\u001b[1;32m 735\u001b[0m names=('x', 'y'), precision=precision)\n\u001b[1;32m 736\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mcond\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 737\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mAssertionError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 738\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 739\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mtraceback\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mAssertionError\u001b[0m: \nArrays are not almost equal to 2 decimals\n\n(mismatch 87.5%)\n x: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n 0., 0., 0.])\n y: array([ 0, -1, -2, -3, -1, -2, -3, -2, -2, -3, -2, -1, -3, -2, -1, 0])"
]
}
],
"source": [
"# Test the value function\n",
"expected_v = np.array([ 0, -1, -2, -3, -1, -2, -3, -2, -2, -3, -2, -1, -3, -2, -1, 0])\n",
"np.testing.assert_array_almost_equal(v, expected_v, decimal=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
================================================
FILE: DP/README.md
================================================
## Model-Based RL: Policy and Value Iteration using Dynamic Programming
### Learning Goals
- Understand the difference between Policy Evaluation and Policy Improvement and how these processes interact
- Understand the Policy Iteration Algorithm
- Understand the Value Iteration Algorithm
- Understand the Limitations of Dynamic Programming Approaches
### Summary
- Dynamic Programming (DP) methods assume that we have a perfect model of the environment's Markov Decision Process (MDP). That's usually not the case in practice, but it's important to study DP anyway.
- Policy Evaluation: Calculates the state-value function `V(s)` for a given policy. In DP this is done using a "full backup". At each state, we look ahead one step at each possible action and next state. We can only do this because we have a perfect model of the environment.
- Full backups are basically the Bellman equations turned into updates.
- Policy Improvement: Given the correct state-value function for a policy we can act greedily with respect to it (i.e. pick the best action at each state). Then we are guaranteed to improve the policy or keep it fixed if it's already optimal.
- Policy Iteration: Iteratively perform Policy Evaluation and Policy Improvement until we reach the optimal policy.
- Value Iteration: Instead of doing multiple steps of Policy Evaluation to find the "correct" V(s) we only do a single step and improve the policy immediately. In practice, this converges faster.
- Generalized Policy Iteration: The process of iteratively doing policy evaluation and improvement. We can pick different algorithms for each of these steps but the basic idea stays the same.
- DP methods bootstrap: They update estimates based on other estimates (one step ahead).
### Lectures & Readings
**Required:**
- David Silver's RL Course Lecture 3 - Planning by Dynamic Programming ([video](https://www.youtube.com/watch?v=Nd1-UUMVfz4), [slides](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/DP.pdf))
**Optional:**
- [Reinforcement Learning: An Introduction](http://incompleteideas.net/book/RLbook2018.pdf) - Chapter 4: Dynamic Programming
### Exercises
- Implement Policy Evaluation in Python (Gridworld)
- [Exercise](Policy%20Evaluation.ipynb)
- [Solution](Policy%20Evaluation%20Solution.ipynb)
- Implement Policy Iteration in Python (Gridworld)
- [Exercise](Policy%20Iteration.ipynb)
- [Solution](Policy%20Iteration%20Solution.ipynb)
- Implement Value Iteration in Python (Gridworld)
- [Exercise](Value%20Iteration.ipynb)
- [Solution](Value%20Iteration%20Solution.ipynb)
- Implement Gambler's Problem
- [Exercise](Gamblers%20Problem.ipynb)
- [Solution](Gamblers%20Problem%20Solution.ipynb)
================================================
FILE: DP/Value Iteration Solution.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pprint\n",
"import sys\n",
"if \"../\" not in sys.path:\n",
" sys.path.append(\"../\") \n",
"from lib.envs.gridworld import GridworldEnv"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"pp = pprint.PrettyPrinter(indent=2)\n",
"env = GridworldEnv()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def value_iteration(env, theta=0.0001, discount_factor=1.0):\n",
" \"\"\"\n",
" Value Iteration Algorithm.\n",
" \n",
" Args:\n",
" env: OpenAI env. env.P represents the transition probabilities of the environment.\n",
" env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).\n",
" env.nS is a number of states in the environment. \n",
" env.nA is a number of actions in the environment.\n",
" theta: We stop evaluation once our value function change is less than theta for all states.\n",
" discount_factor: Gamma discount factor.\n",
" \n",
" Returns:\n",
" A tuple (policy, V) of the optimal policy and the optimal value function.\n",
" \"\"\"\n",
" \n",
" def one_step_lookahead(state, V):\n",
" \"\"\"\n",
" Helper function to calculate the value for all action in a given state.\n",
" \n",
" Args:\n",
" state: The state to consider (int)\n",
" V: The value to use as an estimator, Vector of length env.nS\n",
" \n",
" Returns:\n",
" A vector of length env.nA containing the expected value of each action.\n",
" \"\"\"\n",
" A = np.zeros(env.nA)\n",
" for a in range(env.nA):\n",
" for prob, next_state, reward, done in env.P[state][a]:\n",
" A[a] += prob * (reward + discount_factor * V[next_state])\n",
" return A\n",
" \n",
" V = np.zeros(env.nS)\n",
" while True:\n",
" # Stopping condition\n",
" delta = 0\n",
" # Update each state...\n",
" for s in range(env.nS):\n",
" # Do a one-step lookahead to find the best action\n",
" A = one_step_lookahead(s, V)\n",
" best_action_value = np.max(A)\n",
" # Calculate delta across all states seen so far\n",
" delta = max(delta, np.abs(best_action_value - V[s]))\n",
" # Update the value function. Ref: Sutton book eq. 4.10. \n",
" V[s] = best_action_value \n",
" # Check if we can stop \n",
" if delta < theta:\n",
" break\n",
" \n",
" # Create a deterministic policy using the optimal value function\n",
" policy = np.zeros([env.nS, env.nA])\n",
" for s in range(env.nS):\n",
" # One step lookahead to find the best action for this state\n",
" A = one_step_lookahead(s, V)\n",
" best_action = np.argmax(A)\n",
" # Always take the best action\n",
" policy[s, best_action] = 1.0\n",
" \n",
" return policy, V"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Policy Probability Distribution:\n",
"[[1. 0. 0. 0.]\n",
" [0. 0. 0. 1.]\n",
" [0. 0. 0. 1.]\n",
" [0. 0. 1. 0.]\n",
" [1. 0. 0. 0.]\n",
" [1. 0. 0. 0.]\n",
" [1. 0. 0. 0.]\n",
" [0. 0. 1. 0.]\n",
" [1. 0. 0. 0.]\n",
" [1. 0. 0. 0.]\n",
" [0. 1. 0. 0.]\n",
" [0. 0. 1. 0.]\n",
" [1. 0. 0. 0.]\n",
" [0. 1. 0. 0.]\n",
" [0. 1. 0. 0.]\n",
" [1. 0. 0. 0.]]\n",
"\n",
"Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):\n",
"[[0 3 3 2]\n",
" [0 0 0 2]\n",
" [0 0 1 2]\n",
" [0 1 1 0]]\n",
"\n",
"Value Function:\n",
"[ 0. -1. -2. -3. -1. -2. -3. -2. -2. -3. -2. -1. -3. -2. -1. 0.]\n",
"\n",
"Reshaped Grid Value Function:\n",
"[[ 0. -1. -2. -3.]\n",
" [-1. -2. -3. -2.]\n",
" [-2. -3. -2. -1.]\n",
" [-3. -2. -1. 0.]]\n",
"\n"
]
}
],
"source": [
"policy, v = value_iteration(env)\n",
"\n",
"print(\"Policy Probability Distribution:\")\n",
"print(policy)\n",
"print(\"\")\n",
"\n",
"print(\"Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):\")\n",
"print(np.reshape(np.argmax(policy, axis=1), env.shape))\n",
"print(\"\")\n",
"\n",
"print(\"Value Function:\")\n",
"print(v)\n",
"print(\"\")\n",
"\n",
"print(\"Reshaped Grid Value Function:\")\n",
"print(v.reshape(env.shape))\n",
"print(\"\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Test the value function\n",
"expected_v = np.array([ 0, -1, -2, -3, -1, -2, -3, -2, -2, -3, -2, -1, -3, -2, -1, 0])\n",
"np.testing.assert_array_almost_equal(v, expected_v, decimal=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
================================================
FILE: DP/Value Iteration.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pprint\n",
"import sys\n",
"if \"../\" not in sys.path:\n",
" sys.path.append(\"../\") \n",
"from lib.envs.gridworld import GridworldEnv"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pp = pprint.PrettyPrinter(indent=2)\n",
"env = GridworldEnv()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def value_iteration(env, theta=0.0001, discount_factor=1.0):\n",
" \"\"\"\n",
" Value Iteration Algorithm.\n",
" \n",
" Args:\n",
" env: OpenAI env. env.P represents the transition probabilities of the environment.\n",
" env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).\n",
" env.nS is a number of states in the environment. \n",
" env.nA is a number of actions in the environment.\n",
" theta: We stop evaluation once our value function change is less than theta for all states.\n",
" discount_factor: Gamma discount factor.\n",
" \n",
" Returns:\n",
" A tuple (policy, V) of the optimal policy and the optimal value function. \n",
" \"\"\"\n",
" \n",
"\n",
" V = np.zeros(env.nS)\n",
" policy = np.zeros([env.nS, env.nA])\n",
" \n",
" # Implement!\n",
" return policy, V"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Policy Probability Distribution:\n",
"[[ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]]\n",
"\n",
"Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):\n",
"[[0 0 0 0]\n",
" [0 0 0 0]\n",
" [0 0 0 0]\n",
" [0 0 0 0]]\n",
"\n",
"Value Function:\n",
"[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
"\n",
"Reshaped Grid Value Function:\n",
"[[ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0.]]\n",
"\n"
]
}
],
"source": [
"policy, v = value_iteration(env)\n",
"\n",
"print(\"Policy Probability Distribution:\")\n",
"print(policy)\n",
"print(\"\")\n",
"\n",
"print(\"Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):\")\n",
"print(np.reshape(np.argmax(policy, axis=1), env.shape))\n",
"print(\"\")\n",
"\n",
"print(\"Value Function:\")\n",
"print(v)\n",
"print(\"\")\n",
"\n",
"print(\"Reshaped Grid Value Function:\")\n",
"print(v.reshape(env.shape))\n",
"print(\"\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"ename": "AssertionError",
"evalue": "\nArrays are not almost equal to 2 decimals\n\n(mismatch 87.5%)\n x: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n 0., 0., 0.])\n y: array([ 0, -1, -2, -3, -1, -2, -3, -2, -2, -3, -2, -1, -3, -2, -1, 0])",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-7-55581f8eb5c9>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Test the value function\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mexpected_v\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtesting\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0massert_array_almost_equal\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mv\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexpected_v\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdecimal\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/Users/dennybritz/venvs/tf/lib/python3.5/site-packages/numpy/testing/utils.py\u001b[0m in \u001b[0;36massert_array_almost_equal\u001b[0;34m(x, y, decimal, err_msg, verbose)\u001b[0m\n\u001b[1;32m 914\u001b[0m assert_array_compare(compare, x, y, err_msg=err_msg, verbose=verbose,\n\u001b[1;32m 915\u001b[0m \u001b[0mheader\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Arrays are not almost equal to %d decimals'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mdecimal\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 916\u001b[0;31m precision=decimal)\n\u001b[0m\u001b[1;32m 917\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 918\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/Users/dennybritz/venvs/tf/lib/python3.5/site-packages/numpy/testing/utils.py\u001b[0m in \u001b[0;36massert_array_compare\u001b[0;34m(comparison, x, y, err_msg, verbose, header, precision)\u001b[0m\n\u001b[1;32m 735\u001b[0m names=('x', 'y'), precision=precision)\n\u001b[1;32m 736\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mcond\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 737\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mAssertionError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 738\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 739\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mtraceback\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mAssertionError\u001b[0m: \nArrays are not almost equal to 2 decimals\n\n(mismatch 87.5%)\n x: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n 0., 0., 0.])\n y: array([ 0, -1, -2, -3, -1, -2, -3, -2, -2, -3, -2, -1, -3, -2, -1, 0])"
]
}
],
"source": [
"# Test the value function\n",
"expected_v = np.array([ 0, -1, -2, -3, -1, -2, -3, -2, -2, -3, -2, -1, -3, -2, -1, 0])\n",
"np.testing.assert_array_almost_equal(v, expected_v, decimal=2)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
================================================
FILE: DQN/.gitignore
================================================
experiments/
================================================
FILE: DQN/Breakout Playground.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import gym\n",
"import numpy as np\n",
"from matplotlib import pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[2016-11-16 23:36:18,386] Making new env: Breakout-v0\n"
]
}
],
"source": [
"env = gym.envs.make(\"Breakout-v0\")"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Action space size: 6\n",
"['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']\n",
"Observation space shape: (210, 160, 3)\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAM8AAAEACAYAAAAUSCKKAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEGNJREFUeJzt3X+Q1PV9x/Hn+w4OQRQPBo4ACug10QxtL4xzaqgTSiie\ndhpjZzSO06k/pjN2mkw7mWkDJH84/UvpTJrWSa1jYq1p1ahNE+k0OS/4g5o/QB1EMALeqigncpBB\nIILA3e27f3y/h3vn7t3u+7vL7p2vx8wOu5/v9/v+vm/Z136/+73vd8/cHRGpXFO9GxCZqBQekSCF\nRyRI4REJUnhEghQekaCahcfMusxst5m9YWZra7UekXqxWvyex8yagDeALwP7gZeAm919d9VXJlIn\ntdrydAK97v6Ouw8APwaur9G6ROqiVuFZCOwreNyXjolMGrUKjxUZ03lAMqlMqVHdPuCigseLSD77\nnGFmCpNMCO5ebGNQsy3PS0C7mS02sxbgZmBjjdYlUhc12fK4+5CZfQPoIQnog+6+qxbrEqmXmhyq\nLmvF2m2TCaLUblutPvNMWB0dHSxbtmzE2IEDB5g/f37JZZ5//nn6+vrOPL700ku5/PLLx11XYd2t\nW7fS29t7ZtrixYu5+uqrK+r91VdfZefOnRUtM57zzz+fzs7OMX/+0fbt28fmzZur2kcxK1asYOnS\npWce53I5tmzZUvP1DlN4RlmwYAHLly+vaJlXXnllRHjmzp1bcY0333xzRHhaW1srrnHw4MGqh2fa\ntGlcd911FS0zderUsxKeJUuWjHiOBgYGFJ5GcujQIV588cUzj82MNWvWMGVK+U/dsWPHeOGFF0aM\nrVy5knPPPbfsGqdOnWLTpk0jxq666ipmz55ddo1q2bx5Mx9++GHJ6QcPHjyL3dSPwjOOw4cP88wz\nz5x53NTUxKpVqyoKz/Hjx0fUAOjs7KwoPAMDA5+ocdlll9UlPFu2bKG/v/+sr7fRKDxSsY6ODo4d\nO1Zy+uHDh9mzZ89Z7Kg+FB6p2DXXXDPm9J07dyo8IqdPn2b37rFPhm9ra6O1tfUsddQ4FB4Z09Gj\nR3nggQfGnOeGG26o+LD6ZKDwjKOlpYW5c+eeedzU1IRZ0d+ZldTc3DyixvBYJZqamj5RY+rUqRXV\niGhubh73oMT06dNr3kcjUnjGsXTpUtavX5+pxrx58zLXmDFjRuYaEbNnz67LeicChWeUfD7P4OBg\nRcuMPsUpUiOfz1e9RrVU2sfQ0FBN+hht9HN0ttY7TOe2jdLU1DRi12ys52d4nsHBwRHzRWoMDQ2N\nePGbGc3NzZlqVMvw77TG62V4ej6fPysv5MLnZ7ivWqy31LltCo/IOBryxNC1a/WlOtLYNmzYUHJa\nXcPT1tZWz9WLZKIvPRQJUnhEghQekSCFRyRI4REJUnhEghQekSCFRyRI4REJUnhEghQekSCFRyRI\n4REJUnhEghr2MuytW7fy3HPP1bsNmeRWrVpFZ2dnaNmGDc+JEyc+Nd95LPVz4sSJ8LLabRMJUnhE\nghQekSCFRyRI4REJUnhEghQekSCFRyRI4REJUnhEghQekSCFRyRI4REJynRWtZntBY4CeWDA3TvN\nrBV4HFgM7AVucvejGfsUaThZtzx5YKW7f8Hdhy+KWAdscvfPAc8C+oOWMillDY8VqXE98HB6/2Hg\nqxnXIdKQsobHgafN7CUz+4t0rM3d+wHc/QAwt+TSIhNY1itJv+juB8xsLtBjZntIAiUy6WUKT7pl\nwd0PmdnPgE6g38za3L3fzOYDJa+l7u7uPnO/vb2d9vb2LO2IZJbL5cjlcmXNGw6Pmc0Amtz9QzM7\nF1gD/D2wEbgN2ADcCjxVqkZXV1d09SI1MfpNvKenp+S8WbY8bcBP0z8JPwV4xN17zOxl4AkzuwN4\nF7gxwzpEGlY4PO7+NtBRZPwwsDpLUyITgc4wEAlSeESCGvZLD+dMm8bnZ82qdxsyyc2ZNi28bMOG\np2vBAv7yqqvq3YZMcns/8xn2B5fVbptIkMIjEqTwiAQpPCJBCo9IUMMebfOZA+QXHq93GzLJ+XkD\n4WUbNjxMycP0oXp3IZNdc/wKGu22iQQpPCJBCo9IkMIjEtSwBwyGmvOcnBo/EiJSjsHmfHjZhg3P\nQHOeE9MVHqmtwSnxI7rabRMJUnhEghQekSCFRySoYQ8YYI6bvnxUaivLK6xhw3OyNc8HC3W0TWrr\n1Ik8nIwt27DhKfr3F0SqLMuWRy9PkSCFRyRI4REJUnhEghr2gMH7fg6H8631bkMmuTmcQ/R7aRs2\nPEdoIcd59W5DJrkmpobDo902kSCFRyRI4REJUnhEghr2gIF/NJP8Rwvr3YZMcs7M5FSwgIYNT/6d\n32PwjSX1bkMmufxn98KS2F/o0W6bSJDCIxKk8IgEKTwiQQqPSFDDHm07sP9pXt7yUr3bkElu9nmd\nXLJkWWjZhg3P6VO/4diR1+rdhkxyp09dHF5Wu20iQeOGx8weNLN+M9tRMNZqZj1mtsfMnjazWQXT\n7jWzXjPbbmYdtWpcpN7K2fI8BFwzamwdsMndPwc8C6wHMLNrgUvc/XeAO4H7q9irSEMZNzzu/ivg\ng1HD1wMPp/cfTh8Pj/8oXW4rMMvM2qrTqkhjiX7mmefu/QDufgCYl44vBPYVzPdeOiYy6VT7gEGx\n81P1nbkyKUUPVfebWZu795vZfOBgOt4HXFgw3yKg5Cmr3d3dZ+63t7fT3t4ebEekOnK5HLlcrqx5\nyw2PMXKrshG4DdiQ/vtUwfjXgcfN7ErgyPDuXTFdXV1lrl7k7Bj9Jt7T01Ny3nHDY2aPAiuBOWb2\nLnAXcA/wpJndAbwL3Ajg7j83s+vMLAccB26P/xgijW3c8Lj7LSUmrS4x/zcydSQyQegMA5EghUck\nSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpEg\nhUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIU\nHpEghUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpGgccNjZg+aWb+Z\n7SgYu8vM+sxsW3rrKpi23sx6zWyXma2pVeMi9VbOluch4Joi4//o7svTWzeAmV0G3ARcBlwL3Gdm\nVrVuRRrIuOFx918BHxSZVCwU1wM/dvdBd98L9AKdmToUaVBZPvN83cy2m9kPzWxWOrYQ2Fcwz3vp\nmMikEw3PfcAl7t4BHAC+m44X2xp5cB0iDW1KZCF3P1Tw8AfA/6T3+4ALC6YtAvaXqtPd3X3mfnt7\nO+3t7ZF2RKoml8uRy+XKmrfc8BgFWxUzm+/uB9KHfwq8lt7fCDxiZt8j2V1rB14sVbSrq6vUJJG6\nGP0m3tPTU3LeccNjZo8CK4E5ZvYucBfwh2bWAeSBvcCdAO7+upk9AbwODAB/5e7abZNJadzwuPst\nRYYfGmP+u4G7szQlMhHoDAORIIVHJEjhEQlSeESCFB6RIIVHJEjhEQlSeESCFB6RIIVHJEjhEQlS\neESCFB6RIIVHJEjhEQlSeESCFB6RIIVHJEjhEQlSeESCFB6RIIVHJEjhEQlSeESCFB6RIIVHJEjh\nEQlSeESCFB6RIIVHJEjhEQlSeESCFB6RIIVHJEjhEQlSeESCyv1T8jXxTstAyWkfNA+dxU6kmNaW\nFr40b16mGieGhuh5//0qdVR95x05Qtu+faFl6xqenTNOlZz2/tTBs9iJFLNg+nTWLVuWqcb7H33U\n0OGZ09/PJbt2hZbVbptIkMIjElTX3TZpbKfzed45fjxTjUMnT1apm8aj8EhJvb/9LV974YV6t9Gw\nFB75VBty53Q+H1pW4ZFPtft7e/lhLhda1tx97BnMFgE/AuYDQ8AP3P1eM2sFHgcWA3uBm9z9aLrM\nvcC1wHHgNnffXqSut8yaWXK9Q6dOM3TydORnEqkqd7dSE8a8kYSmI70/E9gDXApsAL6Vjq8F7knv\nXwv8b3r/CmBLibqum24T4VYyG+OFp8iL/mfAamA30FYQsF3p/fuBrxXMv2t4PoVHt4l4K5WFin7P\nY2ZLgA5gC0kg+kmqHwCGz+NYCBSe7/BeOiYyqZQdHjObCfwX8Dfu/iFJKovOWmSs1LwiE1ZZ4TGz\nKSTB+Q93fyod7jeztnT6fOBgOt4HXFiw+CJgf3XaFWkc5W55/g143d3/uWBsI3Bbev824KmC8T8H\nMLMrgSPDu3cik0k5h6pXAP8H7OTjD1HfBl4EniDZyrwL3OjuR9Jlvg90kRyqvt3dtxWpq105mRBK\nHaoeNzy1ovDIRFEqPDqrWiRI4REJUnhEghQekSCFRyRI4REJUnhEgur2ex6RiU5bHpEghUckqC7h\nMbMuM9ttZm+Y2dpgjUVm9qyZvW5mO83sr9PxVjPrMbM9Zva0mc0K1G4ys21mtjF9vMTMtqQ1H0vP\nMq+05iwze9LMdpnZr83siqy9mtk3zew1M9thZo+YWUukVzN70Mz6zWxHwVjJ3szsXjPrNbPtZtZR\nQc1/SH/+7Wb2EzM7v2Da+rTmLjNbU27Ngml/a2Z5M5tdSZ+ZVHoladYbSWBzJN99MBXYDlwaqFPR\n5eEV1v4m8J/AxvTx4yQnvgL8K3BnoOa/k5wkC8kXr8zK0iuwAHgLaCno8dZIr8AfkFzkuKNgLOtl\n9sVqrgaa0vv3AHen9z8PvJI+L0vS14eVUzMdXwR0A28DsyvpM9NruZZBKfGkXgn8ouDxOmBtFeqW\nujx8d4V1FgG/BFYWhOdQwX/6lUB3hTXPA94sMh7uNQ3PO0Br+qLbCPwRyXVVFfdK8ma2Y4zeKrrM\nvljNUdO+SnJ92CdeA8AvgCvKrQk8CfzuqPCU3Wf0Vo/dttGXafeR8TLtcS4Pn1thue8Bf0d69auZ\nzQE+cPfhL/fqI3nhVuJi4Ddm9lC6O/iAmc3I0qu77we+S3I5yHvAUWAbyfVTWXodNs9re5n9HcDP\ns9Y0sz8B9rn7zlGTav51APUIT1Uv067g8vByav0x0O/JV2UN92l8sudK1zEFWA78i7svJ7nOaV3G\nXi8Arid5J14AnEuyqzJatX8Xkfn/z8y+Awy4+2NZaprZdOA7wF3FJkdqVqIe4ekDLip4HL5Mu8LL\nw8uxAviKmb0FPAasAv4JmGVmw89VpN8+knfHl9PHPyEJU5ZeVwNvufthdx8Cfgp8EbggY6/DanKZ\nvZndClwH3FIwHK15CclnpFfN7O10uW1mNi9rn+WoR3heAtrNbLGZtQA3k+yvR4x3efitfHx5+Ljc\n/dvufpG7X5z29ay7/xnwHHBjpGZatx/YZ2afTYe+DPw6S68ku2tXmtk5ZmYFNaO9jt7CVuMy+xE1\nzawL+BbwFXcv/ONMG4Gb06OFS4F2kiuVx6zp7q+5+3x3v9jdl5IE5gvufrDCPmOq+QGqgg+7XSRH\nx3qBdcEaK0i+wXQ7yZGabWnd2cCmtP4vgQuC9b/ExwcMlgJbgTdIjmZNDdT7fZI3ju3Af5McbcvU\nK8nuyi5gB/AwydHLinsFHiV5Vz5FEsrbSQ5EFO0N+D7JEbFXgeUV1OwlOcixLb3dVzD/+rTmLmBN\nuTVHTX+L9IBBuX1muen0HJEgnWEgEqTwiAQpPCJBCo9IkMIjEqTwiAQpPCJBCo9I0P8DEdhXRvCY\nGIIAAAAASUVORK5CYII=\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x109103630>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAM8AAAEACAYAAAAUSCKKAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEGNJREFUeJzt3X+Q1PV9x/Hn+w4OQRQPBo4ACug10QxtL4xzaqgTSiie\ndhpjZzSO06k/pjN2mkw7mWkDJH84/UvpTJrWSa1jYq1p1ahNE+k0OS/4g5o/QB1EMALeqigncpBB\nIILA3e27f3y/h3vn7t3u+7vL7p2vx8wOu5/v9/v+vm/Z136/+73vd8/cHRGpXFO9GxCZqBQekSCF\nRyRI4REJUnhEghQekaCahcfMusxst5m9YWZra7UekXqxWvyex8yagDeALwP7gZeAm919d9VXJlIn\ntdrydAK97v6Ouw8APwaur9G6ROqiVuFZCOwreNyXjolMGrUKjxUZ03lAMqlMqVHdPuCigseLSD77\nnGFmCpNMCO5ebGNQsy3PS0C7mS02sxbgZmBjjdYlUhc12fK4+5CZfQPoIQnog+6+qxbrEqmXmhyq\nLmvF2m2TCaLUblutPvNMWB0dHSxbtmzE2IEDB5g/f37JZZ5//nn6+vrOPL700ku5/PLLx11XYd2t\nW7fS29t7ZtrixYu5+uqrK+r91VdfZefOnRUtM57zzz+fzs7OMX/+0fbt28fmzZur2kcxK1asYOnS\npWce53I5tmzZUvP1DlN4RlmwYAHLly+vaJlXXnllRHjmzp1bcY0333xzRHhaW1srrnHw4MGqh2fa\ntGlcd911FS0zderUsxKeJUuWjHiOBgYGFJ5GcujQIV588cUzj82MNWvWMGVK+U/dsWPHeOGFF0aM\nrVy5knPPPbfsGqdOnWLTpk0jxq666ipmz55ddo1q2bx5Mx9++GHJ6QcPHjyL3dSPwjOOw4cP88wz\nz5x53NTUxKpVqyoKz/Hjx0fUAOjs7KwoPAMDA5+ocdlll9UlPFu2bKG/v/+sr7fRKDxSsY6ODo4d\nO1Zy+uHDh9mzZ89Z7Kg+FB6p2DXXXDPm9J07dyo8IqdPn2b37rFPhm9ra6O1tfUsddQ4FB4Z09Gj\nR3nggQfGnOeGG26o+LD6ZKDwjKOlpYW5c+eeedzU1IRZ0d+ZldTc3DyixvBYJZqamj5RY+rUqRXV\niGhubh73oMT06dNr3kcjUnjGsXTpUtavX5+pxrx58zLXmDFjRuYaEbNnz67LeicChWeUfD7P4OBg\nRcuMPsUpUiOfz1e9RrVU2sfQ0FBN+hht9HN0ttY7TOe2jdLU1DRi12ys52d4nsHBwRHzRWoMDQ2N\nePGbGc3NzZlqVMvw77TG62V4ej6fPysv5MLnZ7ivWqy31LltCo/IOBryxNC1a/WlOtLYNmzYUHJa\nXcPT1tZWz9WLZKIvPRQJUnhEghQekSCFRyRI4REJUnhEghQekSCFRyRI4REJUnhEghQekSCFRyRI\n4REJUnhEghr2MuytW7fy3HPP1bsNmeRWrVpFZ2dnaNmGDc+JEyc+Nd95LPVz4sSJ8LLabRMJUnhE\nghQekSCFRyRI4REJUnhEghQekSCFRyRI4REJUnhEghQekSCFRyRI4REJynRWtZntBY4CeWDA3TvN\nrBV4HFgM7AVucvejGfsUaThZtzx5YKW7f8Hdhy+KWAdscvfPAc8C+oOWMillDY8VqXE98HB6/2Hg\nqxnXIdKQsobHgafN7CUz+4t0rM3d+wHc/QAwt+TSIhNY1itJv+juB8xsLtBjZntIAiUy6WUKT7pl\nwd0PmdnPgE6g38za3L3fzOYDJa+l7u7uPnO/vb2d9vb2LO2IZJbL5cjlcmXNGw6Pmc0Amtz9QzM7\nF1gD/D2wEbgN2ADcCjxVqkZXV1d09SI1MfpNvKenp+S8WbY8bcBP0z8JPwV4xN17zOxl4AkzuwN4\nF7gxwzpEGlY4PO7+NtBRZPwwsDpLUyITgc4wEAlSeESCGvZLD+dMm8bnZ82qdxsyyc2ZNi28bMOG\np2vBAv7yqqvq3YZMcns/8xn2B5fVbptIkMIjEqTwiAQpPCJBCo9IUMMebfOZA+QXHq93GzLJ+XkD\n4WUbNjxMycP0oXp3IZNdc/wKGu22iQQpPCJBCo9IkMIjEtSwBwyGmvOcnBo/EiJSjsHmfHjZhg3P\nQHOeE9MVHqmtwSnxI7rabRMJUnhEghQekSCFRySoYQ8YYI6bvnxUaivLK6xhw3OyNc8HC3W0TWrr\n1Ik8nIwt27DhKfr3F0SqLMuWRy9PkSCFRyRI4REJUnhEghr2gMH7fg6H8631bkMmuTmcQ/R7aRs2\nPEdoIcd59W5DJrkmpobDo902kSCFRyRI4REJUnhEghr2gIF/NJP8Rwvr3YZMcs7M5FSwgIYNT/6d\n32PwjSX1bkMmufxn98KS2F/o0W6bSJDCIxKk8IgEKTwiQQqPSFDDHm07sP9pXt7yUr3bkElu9nmd\nXLJkWWjZhg3P6VO/4diR1+rdhkxyp09dHF5Wu20iQeOGx8weNLN+M9tRMNZqZj1mtsfMnjazWQXT\n7jWzXjPbbmYdtWpcpN7K2fI8BFwzamwdsMndPwc8C6wHMLNrgUvc/XeAO4H7q9irSEMZNzzu/ivg\ng1HD1wMPp/cfTh8Pj/8oXW4rMMvM2qrTqkhjiX7mmefu/QDufgCYl44vBPYVzPdeOiYy6VT7gEGx\n81P1nbkyKUUPVfebWZu795vZfOBgOt4HXFgw3yKg5Cmr3d3dZ+63t7fT3t4ebEekOnK5HLlcrqx5\nyw2PMXKrshG4DdiQ/vtUwfjXgcfN7ErgyPDuXTFdXV1lrl7k7Bj9Jt7T01Ny3nHDY2aPAiuBOWb2\nLnAXcA/wpJndAbwL3Ajg7j83s+vMLAccB26P/xgijW3c8Lj7LSUmrS4x/zcydSQyQegMA5EghUck\nSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpEg\nhUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIU\nHpEghUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpEghUckSOERCVJ4RIIUHpGgccNjZg+aWb+Z\n7SgYu8vM+sxsW3rrKpi23sx6zWyXma2pVeMi9VbOluch4Joi4//o7svTWzeAmV0G3ARcBlwL3Gdm\nVrVuRRrIuOFx918BHxSZVCwU1wM/dvdBd98L9AKdmToUaVBZPvN83cy2m9kPzWxWOrYQ2Fcwz3vp\nmMikEw3PfcAl7t4BHAC+m44X2xp5cB0iDW1KZCF3P1Tw8AfA/6T3+4ALC6YtAvaXqtPd3X3mfnt7\nO+3t7ZF2RKoml8uRy+XKmrfc8BgFWxUzm+/uB9KHfwq8lt7fCDxiZt8j2V1rB14sVbSrq6vUJJG6\nGP0m3tPTU3LeccNjZo8CK4E5ZvYucBfwh2bWAeSBvcCdAO7+upk9AbwODAB/5e7abZNJadzwuPst\nRYYfGmP+u4G7szQlMhHoDAORIIVHJEjhEQlSeESCFB6RIIVHJEjhEQlSeESCFB6RIIVHJEjhEQlS\neESCFB6RIIVHJEjhEQlSeESCFB6RIIVHJEjhEQlSeESCFB6RIIVHJEjhEQlSeESCFB6RIIVHJEjh\nEQlSeESCFB6RIIVHJEjhEQlSeESCFB6RIIVHJEjhEQlSeESCyv1T8jXxTstAyWkfNA+dxU6kmNaW\nFr40b16mGieGhuh5//0qdVR95x05Qtu+faFl6xqenTNOlZz2/tTBs9iJFLNg+nTWLVuWqcb7H33U\n0OGZ09/PJbt2hZbVbptIkMIjElTX3TZpbKfzed45fjxTjUMnT1apm8aj8EhJvb/9LV974YV6t9Gw\nFB75VBty53Q+H1pW4ZFPtft7e/lhLhda1tx97BnMFgE/AuYDQ8AP3P1eM2sFHgcWA3uBm9z9aLrM\nvcC1wHHgNnffXqSut8yaWXK9Q6dOM3TydORnEqkqd7dSE8a8kYSmI70/E9gDXApsAL6Vjq8F7knv\nXwv8b3r/CmBLibqum24T4VYyG+OFp8iL/mfAamA30FYQsF3p/fuBrxXMv2t4PoVHt4l4K5WFin7P\nY2ZLgA5gC0kg+kmqHwCGz+NYCBSe7/BeOiYyqZQdHjObCfwX8Dfu/iFJKovOWmSs1LwiE1ZZ4TGz\nKSTB+Q93fyod7jeztnT6fOBgOt4HXFiw+CJgf3XaFWkc5W55/g143d3/uWBsI3Bbev824KmC8T8H\nMLMrgSPDu3cik0k5h6pXAP8H7OTjD1HfBl4EniDZyrwL3OjuR9Jlvg90kRyqvt3dtxWpq105mRBK\nHaoeNzy1ovDIRFEqPDqrWiRI4REJUnhEghQekSCFRyRI4REJUnhEgur2ex6RiU5bHpEghUckqC7h\nMbMuM9ttZm+Y2dpgjUVm9qyZvW5mO83sr9PxVjPrMbM9Zva0mc0K1G4ys21mtjF9vMTMtqQ1H0vP\nMq+05iwze9LMdpnZr83siqy9mtk3zew1M9thZo+YWUukVzN70Mz6zWxHwVjJ3szsXjPrNbPtZtZR\nQc1/SH/+7Wb2EzM7v2Da+rTmLjNbU27Ngml/a2Z5M5tdSZ+ZVHoladYbSWBzJN99MBXYDlwaqFPR\n5eEV1v4m8J/AxvTx4yQnvgL8K3BnoOa/k5wkC8kXr8zK0iuwAHgLaCno8dZIr8AfkFzkuKNgLOtl\n9sVqrgaa0vv3AHen9z8PvJI+L0vS14eVUzMdXwR0A28DsyvpM9NruZZBKfGkXgn8ouDxOmBtFeqW\nujx8d4V1FgG/BFYWhOdQwX/6lUB3hTXPA94sMh7uNQ3PO0Br+qLbCPwRyXVVFfdK8ma2Y4zeKrrM\nvljNUdO+SnJ92CdeA8AvgCvKrQk8CfzuqPCU3Wf0Vo/dttGXafeR8TLtcS4Pn1thue8Bf0d69auZ\nzQE+cPfhL/fqI3nhVuJi4Ddm9lC6O/iAmc3I0qu77we+S3I5yHvAUWAbyfVTWXodNs9re5n9HcDP\ns9Y0sz8B9rn7zlGTav51APUIT1Uv067g8vByav0x0O/JV2UN92l8sudK1zEFWA78i7svJ7nOaV3G\nXi8Arid5J14AnEuyqzJatX8Xkfn/z8y+Awy4+2NZaprZdOA7wF3FJkdqVqIe4ekDLip4HL5Mu8LL\nw8uxAviKmb0FPAasAv4JmGVmw89VpN8+knfHl9PHPyEJU5ZeVwNvufthdx8Cfgp8EbggY6/DanKZ\nvZndClwH3FIwHK15CclnpFfN7O10uW1mNi9rn+WoR3heAtrNbLGZtQA3k+yvR4x3efitfHx5+Ljc\n/dvufpG7X5z29ay7/xnwHHBjpGZatx/YZ2afTYe+DPw6S68ku2tXmtk5ZmYFNaO9jt7CVuMy+xE1\nzawL+BbwFXcv/ONMG4Gb06OFS4F2kiuVx6zp7q+5+3x3v9jdl5IE5gvufrDCPmOq+QGqgg+7XSRH\nx3qBdcEaK0i+wXQ7yZGabWnd2cCmtP4vgQuC9b/ExwcMlgJbgTdIjmZNDdT7fZI3ju3Af5McbcvU\nK8nuyi5gB/AwydHLinsFHiV5Vz5FEsrbSQ5EFO0N+D7JEbFXgeUV1OwlOcixLb3dVzD/+rTmLmBN\nuTVHTX+L9IBBuX1muen0HJEgnWEgEqTwiAQpPCJBCo9IkMIjEqTwiAQpPCJBCo9I0P8DEdhXRvCY\nGIIAAAAASUVORK5CYII=\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x1092c2470>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"Action space size: {}\".format(env.action_space.n))\n",
"print(env.get_action_meanings()) # env.unwrapped.get_action_meanings() for gym 0.8.0 or later\n",
"\n",
"observation = env.reset()\n",
"print(\"Observation space shape: {}\".format(observation.shape))\n",
"\n",
"plt.figure()\n",
"plt.imshow(env.render(mode='rgb_array'))\n",
"\n",
"[env.step(2) for x in range(1)]\n",
"plt.figure()\n",
"plt.imshow(env.render(mode='rgb_array'))\n",
"\n",
"env.render(close=True)"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.image.AxesImage at 0x108de7748>"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAQQAAAD/CAYAAAAXKqhkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAENZJREFUeJzt3WusZWV9x/Hvb4ApCHgAlSE4IpdBxaYtJalQrdEq0gEj\n6Ataao1cjDFRq5GoXHxhfFMvifGS1lpSpGgV5VLLJEUcCdb0kvHSYbgOlykUZqAcRAYMKjDM/Pti\nL57Znp7DDGevffaZ8ftJTlj7Weus53kO5/z2us3+p6qQJIAlkx6ApMXDQJDUGAiSGgNBUmMgSGoM\nBEnN2AIhycoktye5M8l54+pHUn8yjucQkiwB7gTeCDwA/Bg4o6pu770zSb0Z1xHCq4C7qureqtoC\nfBM4bUx9SerJuALhxcDGodebujZJi9ieY9pvZmn7tXOTJD4zLU1IVc32Nzq2QNgEHDb0ejmDawm/\n5qSTTmLlypXPuqOq4qKLLuKOO+7od4TSbuSYY47h3e9+905te+655865blynDD8GViR5aZKlwBnA\nqjH1JaknYzlCqKqtSd4PrGYQOhdX1fpx9CWpP+M6ZaCqrgVe/mzbrFixYlzdS5qHiT6paCBIi4uP\nLktqDARJjYEgqTEQJDUGgqTGQJDUGAiSGgNBUmMgSGoMBEmNgSCpMRAkNQaCpMZAkNQYCJIaA0FS\nYyBIagwESc28AyHJ8iTXJ7ktyc1JPtC1H5hkdZI7knw3yVR/w5U0TqMcITwNnFtVrwT+EHhfklcA\n5wPXVdXLgeuBC0YfpqSFMO9AqKoHq2pdt/w4sJ5BQZbTgEu7zS4F3jrqICUtjF6uISQ5HDgWWAMs\nq6ppGIQG8KI++pA0fiPXZUiyH3Al8MGqevy51Gy89tpr2/KKFSv8WHZpDDZs2MCGDRt2atuRAiHJ\nngzC4GtVdXXXPJ1kWVVNJzkEeGiu799RXUdJo5v5Zrt69eo5tx31lOErwG1V9YWhtlXAWd3ymcDV\nM79J0uI07yOEJK8B/gK4OckNDMq9Xwh8Grg8yTnAfcDpowwwwHuOPhr222+U3Ui7t0MP7WU38w6E\nqvoPYI85Vp843/3O5pipKQ58+uk+dyntVjZPTdFHNWWfVJTUGAiSGgNBUmMgSGoMBEmNgSCpGfnR\n5QWx79PU009NehTSolX79nNbfpcIhJp6ilr6xKSHIS1atU8/b5ieMkhqDARJjYEgqTEQJDUGgqTG\nQJDU7BK3Hbel2Jptkx6GtGjVkp3+5MJntUsEwq/23sKSPDnpYUiL1i9/a0sv+9klAmFbim179JOA\n0u6odv6zjZ+V1xAkNQaCpGbkQEiyJMnaJKu614cnWdPVdrys+6h2SbuAPo4QPgjcNvT608Bnu9qO\njwLv6qEPSQtgpEBIshw4Bfj7oeY3AFd1y5cCbxulD0kLZ9QjhM8BH2FQk4EkLwA2V9UzDw1sAvr5\nwHhJYzdKoZY3A9NVtS7J659p7r6GzXk/ZKdqOwaeOGArxLoM0lyeqK1z/qUtVG3H1wCnJjkF2AfY\nH/g8MJVkSXeUsBx4YK4d7Gxtxy37F1nqk4rSXLY8tQ1+Pvu6BantWFUXVtVhVXUkcAZwfVW9A/g+\n28u3WdtR2oWM4zmE84Fzk9wJHARcPIY+JI1BL88IVNUPgB90y/cAx/exX0kLyycVJTUGgqTGQJDU\nLPp/Z1DAI7WUbbX3pIciLVpLamkvf8yLPhAAbqwpflb9fACEtDt6YR3AcT3sZ5cIhIGZD0BK6pvX\nECQ1BoKkxkCQ1BgIkhoDQVJjIEhqFv9txwpbb34DWx5fOumRSIvW1v2fhMPn+ECE52DxBwJQjxxK\n/ezASQ9DWrS2bdncSyB4yiCpMRAkNQaCpMZAkNQYCJKaUSs3TSW5Isn6JLcmOT7JgUlWd7Udv5tk\nqq/BShqvUW87fgG4pqpO74q67gtcCFxXVZ9Jch5wAYNPYp6n4uGH/o2Nm54YcajS7mvJtr2BF468\nn1EqN+0PvLaqzgKoqqeBx5KcBryu2+xS4F8ZKRBg031Xcdcdd4yyC2m3tifHAO8eeT+jnDIcCTyc\n5JKuHPxFSZ4HLKuqaYCqehB40cijlLQgRjll2BM4DnhfVf0kyecYHAnMWctxpp2q7ShpJAtV23ET\nsLGqftK9vopBIEwnWVZV00kOAR6aawc7W9tR0vwtVG3HaWBjkpd1TW8EbgVWAWd1bdZ2lHYho95l\n+ADw9SR7AXcDZwN7AJcnOQe4j+2FXyUtciMFQlXdCPzBLKtOHGW/kibDJxUlNQaCpMZAkNQYCJIa\nA0FSYyBIagwESY2BIKkxECQ1BoKkxkCQ1BgIkhoDQVJjIEhqDARJjYEgqTEQJDUGgqTGQJDUjFrb\n8UNJbklyU5KvJ1ma5PAka7rajpd1Jd4k7QLmHQhJDgX+Ejiuqn6XwQe2/jnwaeCzVfVy4FHgXX0M\nVNL4jXrKsAewb3cUsA/wAPDHDIq2wKC249tG7EPSAhmlUMsDwGcZ1F64H3gMWAs8WlXbus02AYeO\nOkhJC2OU6s8HAKcBL2UQBlcAJ8+y6Zy1Hq3tKI3fQtV2PBG4u6oeAUjybeDVwAFJlnRHCcsZnEbM\nytqO0vgtSG1HBqcKJyTZO0nYXtvx+2wv32ZtR2kXMso1hB8BVwI3ADcCAS5iUAH63CR3AgcBF/cw\nTkkLYNTajp8APjGj+R7g+FH2K2kyfFJRUmMgSGoMBEmNgSCpMRAkNQaCpMZAkNQYCJIaA0FSYyBI\nagwESY2BIKkxECQ1BoKkxkCQ1BgIkhoDQVJjIEhqDARJzQ4DIcnFSaaT3DTUdmCS1V39xu8mmRpa\n98UkdyVZl+TYcQ1cUv925gjhEuBPZrSdD1zX1W+8HrgAIMnJwFFVdTTwHuDLPY5V0pjtMBCq6t+B\nzTOaT2NQt5Huv6cNtX+1+74fAlNJlvUzVEnjNt9rCAdX1TRAVT0IHNy1vxjYOLTd/V2bpF3ASHUZ\nZpFZ2qztKE3QQtR2nE6yrKqmkxwCPNS1bwJeMrSdtR2lCRtHbcfw6+/+q4CzuuWz2F6/cRXwToAk\nJzAoDT+9k31ImrAdHiEk+QbweuAFSe4DPg58CrgiyTkMir6eDlBV1yQ5JckG4BfA2eMauKT+7TAQ\nqurtc6w6cY7t3z/SiCRNjE8qSmoMBEmNgSCpMRAkNQaCpMZAkNQYCJIaA0FSYyBIagwESY2BIKkx\nECQ1BoKkxkCQ1BgIkhoDQVJjIEhqDARJjYEgqZlvbcfPJFnf1W+8Ksnzh9Zd0NV2XJ/kpHENXFL/\n5lvbcTXw21V1LHAX22s7vhL4U+AY4GTgS0lmK94iaRGaV23HqrquqrZ1L9cwKMgCcCrwzap6uqr+\nh0FYvKq/4Uoapz6uIZwDXNMtW9tR2oWNVNsxyceALVV12TNNs2xmbUdpghaitiNJzgROAd4w1Gxt\nR2mRGXttxyQrgY8Cp1bVk0PbrQLOSLI0yRHACuBHOz90SZM039qOFwJLge91NxHWVNV7q+q2JJcD\ntwFbgPdW1ZynDJIWl/nWdrzkWbb/JPDJUQYlaTJ8UlFSYyBIagwESY2BIKkxECQ1BoKkxkCQ1BgI\nkhoDQVJjIEhqDARJjYEgqTEQJDUGgqTGQJDUGAiSGgNBUmMgSGrmVcptaN2Hk2xLctBQ2xe7Um7r\nkhzb94Aljc98S7mRZDlwInDvUNvJwFFVdTTwHuDLPY1T0gKYVym3zueAj8xoOw34avd9PwSmkiwb\ndZCSFsa8riEkeQuwsapunrHKUm7SLuw5V25Ksg/wMeBNs62epc26DNIuYj6l3I4CDgdu7Eq9LwfW\nJnkVz7GUm7UdpfEbR23HVsqtqm4BDmkrknuA46pqc5JVwPuAbyU5AXi0qqbn2qm1HaXx67W2Y1fK\n7T+BlyW5L8nZMzYptofFNcA9STYAfwe89zmPXtLEzLeU2/D6I2e8fv+og5I0GT6pKKkxECQ1BoKk\nxkCQ1BgIkhoDQVJjIEhqDARJjYEgqTEQJDUGgqTGQJDUGAiSGgNBUmMgSGrm8xFqvfnFkm073Kaq\n2DrbJzXqN8J+e+7JAUuX9r7fp7dt46dPPsnW2j0+8nPJ1q3s/ctfjryfiQbCj/b91Q63KeCxPXYc\nHNo9vfbggzn7qKN63+///upX/NUttzD9xBO973sS9nvsMY5Zu3bk/Uz2CGGPHafz4Ahh90hxPXf7\n77UXh+27b+/7DbDXkt3njHnJ1q3s08MRwu7zE5E0sokGwubb75lk99qF/NfPfjbpIfxGMBC0S1j7\nyCOTHsJvhIleQ5B25K6f/5zL772XWx97jMvvvXfH37CTHn3qKR7fsqW3/e0uDAQtajds3swNmwe1\nhtc8/PCER7P7S03oPmzirQNpUqpq1qd7JhYIkhYfbztKagwESY2BIKmZSCAkWZnk9iR3JjlvDPtf\nnuT6JLcluTnJB7r2A5OsTnJHku8mmeq53yVJ1iZZ1b0+PMmarr/LkvR2VyfJVJIrkqxPcmuS48c5\nvyQfSnJLkpuSfD3J0j7nl+TiJNNJbhpqm3M+Sb6Y5K4k65Ic21N/n+l+nuuSXJXk+UPrLuj6W5/k\npD76G1r34STbkhzU1/zmraoW9ItBCG0AXgrsBawDXtFzH4cAx3bL+wF3AK8APg18tGs/D/hUz/1+\nCPhHYFX3+lvA6d3y3wLv6bGvfwDO7pb3BKbGNT/gUOBuYOnQvM7sc37AHwHHAjcNtc06H+Bk4F+6\n5eOBNT31dyKwpFv+FPDJbvmVwA3dz/nw7vc3o/bXtS8HrgXuAQ7qa37z/v+wUB0N/QBOAL4z9Pp8\n4Lwx9/nP3f/s24FlXdshwO099rEc+B7w+qFA+OnQL9gJwLU99bU/8N+ztI9lfl0g3Asc2P1RrALe\nBDzU5/wYvEkM/4HOnM/6bvnLwJ8Nbbf+me1G6W/GurcCX5vtdxT4DnB8H/0BVwC/MyMQepnffL4m\nccrwYmDj0OtNXdtYJDmcQTKvYfBDnQaoqgeBF/XY1eeAjzD4F9skeQGwuaqe+bfbmxj8YfXhSODh\nJJd0pygXJXkeY5pfVT0AfBa4D7gfeAxYCzw6pvk94+AZ8zm4a5/5O3Q//f8OnQNcM87+krwF2FhV\nN89YtRDzm9UkAmG2ByLG8jBEkv2AK4EPVtXjY+znzcB0Va1j+/zC/59rX/3vCRwH/E1VHQf8gsG7\n2LjmdwBwGoN3uEOBfRkc1s60UA+1jPV3KMnHgC1Vddm4+kuyD/Ax4OOzre67v501iUDYBBw29Ho5\n8EDfnXQXuK5kcNh3ddc8nWRZt/4QBoe8fXgNcGqSu4HLgDcAnwemkjzzM+5znpsYvLP8pHt9FYOA\nGNf8TgTurqpHqmor8G3g1cABY5rfM+aazybgJUPb9dZ3kjOBU4C3DzWPo7+jGFyPuDHJPd0+1yY5\neEz97ZRJBMKPgRVJXppkKXAGg3PSvn0FuK2qvjDUtgo4q1s+E7h65jfNR1VdWFWHVdWRDOZzfVW9\nA/g+cPoY+psGNiZ5Wdf0RuBWxjQ/BqcKJyTZO0mG+ut7fjOPqobnc9bQ/lcB7wRIcgKDU5fpUftL\nshL4KHBqVT05YxxndHdWjgBWAD8apb+quqWqDqmqI6vqCAYh8PtV9RD9ze+5W4gLFbNcXFnJ4Mr/\nXcD5Y9j/a4CtDO5g3MDgfHclcBBwXdf394ADxtD369h+UfEI4IfAnQyuyO/VYz+/xyBc1wH/xOAu\nw9jmx+DQdj1wE3ApgztEvc0P+AaDd8EnGQTQ2QwuYs46H+CvGVztvxE4rqf+7mJw8XRt9/Wloe0v\n6PpbD5zUR38z1t9Nd1Gxj/nN98t/yyCp8UlFSY2BIKkxECQ1BoKkxkCQ1BgIkhoDQVLzf3eqTb2L\nl4WOAAAAAElFTkSuQmCC\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x10c069f28>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Check out what a cropped image looks like\n",
"plt.imshow(observation[34:-16,:,:])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
================================================
FILE: DQN/Deep Q Learning Solution.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import gym\n",
"from gym.wrappers import Monitor\n",
"import itertools\n",
"import numpy as np\n",
"import os\n",
"import random\n",
"import sys\n",
"import psutil\n",
"import tensorflow as tf\n",
"\n",
"if \"../\" not in sys.path:\n",
" sys.path.append(\"../\")\n",
"\n",
"from lib import plotting\n",
"from collections import deque, namedtuple"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"env = gym.envs.make(\"Breakout-v0\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Atari Actions: 0 (noop), 1 (fire), 2 (left) and 3 (right) are valid actions\n",
"VALID_ACTIONS = [0, 1, 2, 3]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class StateProcessor():\n",
" \"\"\"\n",
" Processes a raw Atari images. Resizes it and converts it to grayscale.\n",
" \"\"\"\n",
" def __init__(self):\n",
" # Build the Tensorflow graph\n",
" with tf.variable_scope(\"state_processor\"):\n",
" self.input_state = tf.placeholder(shape=[210, 160, 3], dtype=tf.uint8)\n",
" self.output = tf.image.rgb_to_grayscale(self.input_state)\n",
" self.output = tf.image.crop_to_bounding_box(self.output, 34, 0, 160, 160)\n",
" self.output = tf.image.resize_images(\n",
" self.output, [84, 84], method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)\n",
" self.output = tf.squeeze(self.output)\n",
"\n",
" def process(self, sess, state):\n",
" \"\"\"\n",
" Args:\n",
" sess: A Tensorflow session object\n",
" state: A [210, 160, 3] Atari RGB State\n",
"\n",
" Returns:\n",
" A processed [84, 84] state representing grayscale values.\n",
" \"\"\"\n",
" return sess.run(self.output, { self.input_state: state })"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class Estimator():\n",
" \"\"\"Q-Value Estimator neural network.\n",
"\n",
" This network is used for both the Q-Network and the Target Network.\n",
" \"\"\"\n",
"\n",
" def __init__(self, scope=\"estimator\", summaries_dir=None):\n",
" self.scope = scope\n",
" # Writes Tensorboard summaries to disk\n",
" self.summary_writer = None\n",
" with tf.variable_scope(scope):\n",
" # Build the graph\n",
" self._build_model()\n",
" if summaries_dir:\n",
" summary_dir = os.path.join(summaries_dir, \"summaries_{}\".format(scope))\n",
" if not os.path.exists(summary_dir):\n",
" os.makedirs(summary_dir)\n",
" self.summary_writer = tf.summary.FileWriter(summary_dir)\n",
"\n",
" def _build_model(self):\n",
" \"\"\"\n",
" Builds the Tensorflow graph.\n",
" \"\"\"\n",
"\n",
" # Placeholders for our input\n",
" # Our input are 4 grayscale frames of shape 84, 84 each\n",
" self.X_pl = tf.placeholder(shape=[None, 84, 84, 4], dtype=tf.uint8, name=\"X\")\n",
" # The TD target value\n",
" self.y_pl = tf.placeholder(shape=[None], dtype=tf.float32, name=\"y\")\n",
" # Integer id of which action was selected\n",
" self.actions_pl = tf.placeholder(shape=[None], dtype=tf.int32, name=\"actions\")\n",
"\n",
" X = tf.to_float(self.X_pl) / 255.0\n",
" batch_size = tf.shape(self.X_pl)[0]\n",
"\n",
" # Three convolutional layers\n",
" conv1 = tf.contrib.layers.conv2d(\n",
" X, 32, 8, 4, activation_fn=tf.nn.relu)\n",
" conv2 = tf.contrib.layers.conv2d(\n",
" conv1, 64, 4, 2, activation_fn=tf.nn.relu)\n",
" conv3 = tf.contrib.layers.conv2d(\n",
" conv2, 64, 3, 1, activation_fn=tf.nn.relu)\n",
"\n",
" # Fully connected layers\n",
" flattened = tf.contrib.layers.flatten(conv3)\n",
" fc1 = tf.contrib.layers.fully_connected(flattened, 512)\n",
" self.predictions = tf.contrib.layers.fully_connected(fc1, len(VALID_ACTIONS))\n",
"\n",
" # Get the predictions for the chosen actions only\n",
" gather_indices = tf.range(batch_size) * tf.shape(self.predictions)[1] + self.actions_pl\n",
" self.action_predictions = tf.gather(tf.reshape(self.predictions, [-1]), gather_indices)\n",
"\n",
" # Calculate the loss\n",
" self.losses = tf.squared_difference(self.y_pl, self.action_predictions)\n",
" self.loss = tf.reduce_mean(self.losses)\n",
"\n",
" # Optimizer Parameters from original paper\n",
" self.optimizer = tf.train.RMSPropOptimizer(0.00025, 0.99, 0.0, 1e-6)\n",
" self.train_op = self.optimizer.minimize(self.loss, global_step=tf.contrib.framework.get_global_step())\n",
"\n",
" # Summaries for Tensorboard\n",
" self.summaries = tf.summary.merge([\n",
" tf.summary.scalar(\"loss\", self.loss),\n",
" tf.summary.histogram(\"loss_hist\", self.losses),\n",
" tf.summary.histogram(\"q_values_hist\", self.predictions),\n",
" tf.summary.scalar(\"max_q_value\", tf.reduce_max(self.predictions))\n",
" ])\n",
"\n",
" def predict(self, sess, s):\n",
" \"\"\"\n",
" Predicts action values.\n",
"\n",
" Args:\n",
" sess: Tensorflow session\n",
" s: State input of shape [batch_size, 4, 84, 84, 1]\n",
"\n",
" Returns:\n",
" Tensor of shape [batch_size, NUM_VALID_ACTIONS] containing the estimated \n",
" action values.\n",
" \"\"\"\n",
" return sess.run(self.predictions, { self.X_pl: s })\n",
"\n",
" def update(self, sess, s, a, y):\n",
" \"\"\"\n",
" Updates the estimator towards the given targets.\n",
"\n",
" Args:\n",
" sess: Tensorflow session object\n",
" s: State input of shape [batch_size, 4, 84, 84, 1]\n",
" a: Chosen actions of shape [batch_size]\n",
" y: Targets of shape [batch_size]\n",
"\n",
" Returns:\n",
" The calculated loss on the batch.\n",
" \"\"\"\n",
" feed_dict = { self.X_pl: s, self.y_pl: y, self.actions_pl: a }\n",
" summaries, global_step, _, loss = sess.run(\n",
" [self.summaries, tf.contrib.framework.get_global_step(), self.train_op, self.loss],\n",
" feed_dict)\n",
" if self.summary_writer:\n",
" self.summary_writer.add_summary(summaries, global_step)\n",
" return loss"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# For Testing....\n",
"\n",
"tf.reset_default_graph()\n",
"global_step = tf.Variable(0, name=\"global_step\", trainable=False)\n",
"\n",
"e = Estimator(scope=\"test\")\n",
"sp = StateProcessor()\n",
"\n",
"with tf.Session() as sess:\n",
" sess.run(tf.global_variables_initializer())\n",
" \n",
" # Example observation batch\n",
" observation = env.reset()\n",
" \n",
" observation_p = sp.process(sess, observation)\n",
" observation = np.stack([observation_p] * 4, axis=2)\n",
" observations = np.array([observation] * 2)\n",
" \n",
" # Test Prediction\n",
" print(e.predict(sess, observations))\n",
"\n",
" # Test training step\n",
" y = np.array([10.0, 10.0])\n",
" a = np.array([1, 3])\n",
" print(e.update(sess, observations, a, y))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class ModelParametersCopier():\n",
" \"\"\"\n",
" Copy model parameters of one estimator to another.\n",
" \"\"\"\n",
" \n",
" def __init__(self, estimator1, estimator2):\n",
" \"\"\"\n",
" Defines copy-work operation graph. \n",
" Args:\n",
" estimator1: Estimator to copy the paramters from\n",
" estimator2: Estimator to copy the parameters to\n",
" \"\"\"\n",
" e1_params = [t for t in tf.trainable_variables() if t.name.startswith(estimator1.scope)]\n",
" e1_params = sorted(e1_params, key=lambda v: v.name)\n",
" e2_params = [t for t in tf.trainable_variables() if t.name.startswith(estimator2.scope)]\n",
" e2_params = sorted(e2_params, key=lambda v: v.name)\n",
"\n",
" self.update_ops = []\n",
" for e1_v, e2_v in zip(e1_params, e2_params):\n",
" op = e2_v.assign(e1_v)\n",
" self.update_ops.append(op)\n",
" \n",
" def make(self, sess):\n",
" \"\"\"\n",
" Makes copy.\n",
" Args:\n",
" sess: Tensorflow session instance\n",
" \"\"\"\n",
" sess.run(self.update_ops)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def make_epsilon_greedy_policy(estimator, nA):\n",
" \"\"\"\n",
" Creates an epsilon-greedy policy based on a given Q-function approximator and epsilon.\n",
"\n",
" Args:\n",
" estimator: An estimator that returns q values for a given state\n",
" nA: Number of actions in the environment.\n",
"\n",
" Returns:\n",
" A function that takes the (sess, observation, epsilon) as an argument and returns\n",
" the probabilities for each action in the form of a numpy array of length nA.\n",
"\n",
" \"\"\"\n",
" def policy_fn(sess, observation, epsilon):\n",
" A = np.ones(nA, dtype=float) * epsilon / nA\n",
" q_values = estimator.predict(sess, np.expand_dims(observation, 0))[0]\n",
" best_action = np.argmax(q_values)\n",
" A[best_action] += (1.0 - epsilon)\n",
" return A\n",
" return policy_fn"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def deep_q_learning(sess,\n",
" env,\n",
" q_estimator,\n",
" target_estimator,\n",
" state_processor,\n",
" num_episodes,\n",
" experiment_dir,\n",
" replay_memory_size=500000,\n",
" replay_memory_init_size=50000,\n",
" update_target_estimator_every=10000,\n",
" discount_factor=0.99,\n",
" epsilon_start=1.0,\n",
" epsilon_end=0.1,\n",
" epsilon_decay_steps=500000,\n",
" batch_size=32,\n",
" record_video_every=50):\n",
" \"\"\"\n",
" Q-Learning algorithm for off-policy TD control using Function Approximation.\n",
" Finds the optimal greedy policy while following an epsilon-greedy policy.\n",
"\n",
" Args:\n",
" sess: Tensorflow Session object\n",
" env: OpenAI environment\n",
" q_estimator: Estimator object used for the q values\n",
" target_estimator: Estimator object used for the targets\n",
" state_processor: A StateProcessor object\n",
" num_episodes: Number of episodes to run for\n",
" experiment_dir: Directory to save Tensorflow summaries in\n",
" replay_memory_size: Size of the replay memory\n",
" replay_memory_init_size: Number of random experiences to sampel when initializing \n",
" the reply memory.\n",
" update_target_estimator_every: Copy parameters from the Q estimator to the \n",
" target estimator every N steps\n",
" discount_factor: Gamma discount factor\n",
" epsilon_start: Chance to sample a random action when taking an action.\n",
" Epsilon is decayed over time and this is the start value\n",
" epsilon_end: The final minimum value of epsilon after decaying is done\n",
" epsilon_decay_steps: Number of steps to decay epsilon over\n",
" batch_size: Size of batches to sample from the replay memory\n",
" record_video_every: Record a video every N episodes\n",
"\n",
" Returns:\n",
" An EpisodeStats object with two numpy arrays for episode_lengths and episode_rewards.\n",
" \"\"\"\n",
"\n",
" Transition = namedtuple(\"Transition\", [\"state\", \"action\", \"reward\", \"next_state\", \"done\"])\n",
"\n",
" # The replay memory\n",
" replay_memory = []\n",
" \n",
" # Make model copier object\n",
" estimator_copy = ModelParametersCopier(q_estimator, target_estimator)\n",
"\n",
" # Keeps track of useful statistics\n",
" stats = plotting.EpisodeStats(\n",
" episode_lengths=np.zeros(num_episodes),\n",
" episode_rewards=np.zeros(num_episodes))\n",
" \n",
" # For 'system/' summaries, usefull to check if currrent process looks healthy\n",
" current_process = psutil.Process()\n",
"\n",
" # Create directories for checkpoints and summaries\n",
" checkpoint_dir = os.path.join(experiment_dir, \"checkpoints\")\n",
" checkpoint_path = os.path.join(checkpoint_dir, \"model\")\n",
" monitor_path = os.path.join(experiment_dir, \"monitor\")\n",
" \n",
" if not os.path.exists(checkpoint_dir):\n",
" os.makedirs(checkpoint_dir)\n",
" if not os.path.exists(monitor_path):\n",
" os.makedirs(monitor_path)\n",
"\n",
" saver = tf.train.Saver()\n",
" # Load a previous checkpoint if we find one\n",
" latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)\n",
" if latest_checkpoint:\n",
" print(\"Loading model checkpoint {}...\\n\".format(latest_checkpoint))\n",
" saver.restore(sess, latest_checkpoint)\n",
" \n",
" # Get the current time step\n",
" total_t = sess.run(tf.contrib.framework.get_global_step())\n",
"\n",
" # The epsilon decay schedule\n",
" epsilons = np.linspace(epsilon_start, epsilon_end, epsilon_decay_steps)\n",
"\n",
" # The policy we're following\n",
" policy = make_epsilon_greedy_policy(\n",
" q_estimator,\n",
" len(VALID_ACTIONS))\n",
"\n",
" # Populate the replay memory with initial experience\n",
" print(\"Populating replay memory...\")\n",
" state = env.reset()\n",
" state = state_processor.process(sess, state)\n",
" state = np.stack([state] * 4, axis=2)\n",
" for i in range(replay_memory_init_size):\n",
" action_probs = policy(sess, state, epsilons[min(total_t, epsilon_decay_steps-1)])\n",
" action = np.random.choice(np.arange(len(action_probs)), p=action_probs)\n",
" next_state, reward, done, _ = env.step(VALID_ACTIONS[action])\n",
" next_state = state_processor.process(sess, next_state)\n",
" next_state = np.append(state[:,:,1:], np.expand_dims(next_state, 2), axis=2)\n",
" replay_memory.append(Transition(state, action, reward, next_state, done))\n",
" if done:\n",
" state = env.reset()\n",
" state = state_processor.process(sess, state)\n",
" state = np.stack([state] * 4, axis=2)\n",
" else:\n",
" state = next_state\n",
"\n",
"\n",
" # Record videos\n",
" # Add env Monitor wrapper\n",
" env = Monitor(env, directory=monitor_path, video_callable=lambda count: count % record_video_every == 0, resume=True)\n",
"\n",
" for i_episode in range(num_episodes):\n",
"\n",
" # Save the current checkpoint\n",
" saver.save(tf.get_default_session(), checkpoint_path)\n",
"\n",
" # Reset the environment\n",
" state = env.reset()\n",
" state = state_processor.process(sess, state)\n",
" state = np.stack([state] * 4, axis=2)\n",
" loss = None\n",
"\n",
" # One step in the environment\n",
" for t in itertools.count():\n",
"\n",
" # Epsilon for this time step\n",
" epsilon = epsilons[min(total_t, epsilon_decay_steps-1)]\n",
"\n",
" # Maybe update the target estimator\n",
" if total_t % update_target_estimator_every == 0:\n",
" estimator_copy.make(sess)\n",
" print(\"\\nCopied model parameters to target network.\")\n",
"\n",
" # Print out which step we're on, useful for debugging.\n",
" print(\"\\rStep {} ({}) @ Episode {}/{}, loss: {}\".format(\n",
" t, total_t, i_episode + 1, num_episodes, loss), end=\"\")\n",
" sys.stdout.flush()\n",
"\n",
" # Take a step\n",
" action_probs = policy(sess, state, epsilon)\n",
" action = np.random.choice(np.arange(len(action_probs)), p=action_probs)\n",
" next_state, reward, done, _ = env.step(VALID_ACTIONS[action])\n",
" next_state = state_processor.process(sess, next_state)\n",
" next_state = np.append(state[:,:,1:], np.expand_dims(next_state, 2), axis=2)\n",
"\n",
" # If our replay memory is full, pop the first element\n",
" if len(replay_memory) == replay_memory_size:\n",
" replay_memory.pop(0)\n",
"\n",
" # Save transition to replay memory\n",
" replay_memory.append(Transition(state, action, reward, next_state, done)) \n",
"\n",
" # Update statistics\n",
" stats.episode_rewards[i_episode] += reward\n",
" stats.episode_lengths[i_episode] = t\n",
"\n",
" # Sample a minibatch from the replay memory\n",
" samples = random.sample(replay_memory, batch_size)\n",
" states_batch, action_batch, reward_batch, next_states_batch, done_batch = map(np.array, zip(*samples))\n",
"\n",
" # Calculate q values and targets\n",
" q_values_next = target_estimator.predict(sess, next_states_batch)\n",
" targets_batch = reward_batch + np.invert(done_batch).astype(np.float32) * discount_factor * np.amax(q_values_next, axis=1)\n",
"\n",
" # Perform gradient descent update\n",
" states_batch = np.array(states_batch)\n",
" loss = q_estimator.update(sess, states_batch, action_batch, targets_batch)\n",
"\n",
" if done:\n",
" break\n",
"\n",
" state = next_state\n",
" total_t += 1\n",
"\n",
" # Add summaries to tensorboard\n",
" episode_summary = tf.Summary()\n",
" episode_summary.value.add(simple_value=epsilon, tag=\"episode/epsilon\")\n",
" episode_summary.value.add(simple_value=stats.episode_rewards[i_episode], tag=\"episode/reward\")\n",
" episode_summary.value.add(simple_value=stats.episode_lengths[i_episode], tag=\"episode/length\")\n",
" episode_summary.value.add(simple_value=current_process.cpu_percent(), tag=\"system/cpu_usage_percent\")\n",
" episode_summary.value.add(simple_value=current_process.memory_percent(memtype=\"vms\"), tag=\"system/v_memeory_usage_percent\")\n",
" q_estimator.summary_writer.add_summary(episode_summary, i_episode)\n",
" q_estimator.summary_writer.flush()\n",
" \n",
" yield total_t, plotting.EpisodeStats(\n",
" episode_lengths=stats.episode_lengths[:i_episode+1],\n",
" episode_rewards=stats.episode_rewards[:i_episode+1])\n",
"\n",
" return stats"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"tf.reset_default_graph()\n",
"\n",
"# Where we save our checkpoints and graphs\n",
"experiment_dir = os.path.abspath(\"./experiments/{}\".format(env.spec.id))\n",
"\n",
"# Create a glboal step variable\n",
"global_step = tf.Variable(0, name='global_step', trainable=False)\n",
" \n",
"# Create estimators\n",
"q_estimator = Estimator(scope=\"q_estimator\", summaries_dir=experiment_dir)\n",
"target_estimator = Estimator(scope=\"target_q\")\n",
"\n",
"# State processor\n",
"state_processor = StateProcessor()\n",
"\n",
"# Run it!\n",
"with tf.Session() as sess:\n",
" sess.run(tf.global_variables_initializer())\n",
" for t, stats in deep_q_learning(sess,\n",
" env,\n",
" q_estimator=q_estimator,\n",
" target_estimator=target_estimator,\n",
" state_processor=state_processor,\n",
" experiment_dir=experiment_dir,\n",
" num_episodes=10000,\n",
" replay_memory_size=500000,\n",
" replay_memory_init_size=50000,\n",
" update_target_estimator_every=10000,\n",
" epsilon_start=1.0,\n",
" epsilon_end=0.1,\n",
" epsilon_decay_steps=500000,\n",
" discount_factor=0.99,\n",
" batch_size=32):\n",
"\n",
" print(\"\\nEpisode Reward: {}\".format(stats.episode_rewards[-1]))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
================================================
FILE: DQN/Deep Q Learning.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import gym\n",
"from gym.wrappers import Monitor\n",
"import itertools\n",
"import numpy as np\n",
"import os\n",
"import random\n",
"import sys\n",
"import tensorflow as tf\n",
"\n",
"if \"../\" not in sys.path:\n",
" sys.path.append(\"../\")\n",
"\n",
"from lib import plotting\n",
"from collections import deque, namedtuple"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"env = gym.envs.make(\"Breakout-v0\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Atari Actions: 0 (noop), 1 (fire), 2 (left) and 3 (right) are valid actions\n",
"VALID_ACTIONS = [0, 1, 2, 3]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class StateProcessor():\n",
" \"\"\"\n",
" Processes a raw Atari images. Resizes it and converts it to grayscale.\n",
" \"\"\"\n",
" def __init__(self):\n",
" # Build the Tensorflow graph\n",
" with tf.variable_scope(\"state_processor\"):\n",
" self.input_state = tf.placeholder(shape=[210, 160, 3], dtype=tf.uint8)\n",
" self.output = tf.image.rgb_to_grayscale(self.input_state)\n",
" self.output = tf.image.crop_to_bounding_box(self.output, 34, 0, 160, 160)\n",
" self.output = tf.image.resize_images(\n",
" self.output, [84, 84], method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)\n",
" self.output = tf.squeeze(self.output)\n",
"\n",
" def process(self, sess, state):\n",
" \"\"\"\n",
" Args:\n",
" sess: A Tensorflow session object\n",
" state: A [210, 160, 3] Atari RGB State\n",
"\n",
" Returns:\n",
" A processed [84, 84] state representing grayscale values.\n",
" \"\"\"\n",
" return sess.run(self.output, { self.input_state: state })"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class Estimator():\n",
" \"\"\"Q-Value Estimator neural network.\n",
"\n",
" This network is used for both the Q-Network and the Target Network.\n",
" \"\"\"\n",
"\n",
" def __init__(self, scope=\"estimator\", summaries_dir=None):\n",
" self.scope = scope\n",
" # Writes Tensorboard summaries to disk\n",
" self.summary_writer = None\n",
" with tf.variable_scope(scope):\n",
" # Build the graph\n",
" self._build_model()\n",
" if summaries_dir:\n",
" summary_dir = os.path.join(summaries_dir, \"summaries_{}\".format(scope))\n",
" if not os.path.exists(summary_dir):\n",
" os.makedirs(summary_dir)\n",
" self.summary_writer = tf.summary.FileWriter(summary_dir)\n",
"\n",
" def _build_model(self):\n",
" \"\"\"\n",
" Builds the Tensorflow graph.\n",
" \"\"\"\n",
"\n",
" # Placeholders for our input\n",
" # Our input are 4 grayscale frames of shape 84, 84 each\n",
" self.X_pl = tf.placeholder(shape=[None, 84, 84, 4], dtype=tf.uint8, name=\"X\")\n",
" # The TD target value\n",
" self.y_pl = tf.placeholder(shape=[None], dtype=tf.float32, name=\"y\")\n",
" # Integer id of which action was selected\n",
" self.actions_pl = tf.placeholder(shape=[None], dtype=tf.int32, name=\"actions\")\n",
"\n",
" X = tf.to_float(self.X_pl) / 255.0\n",
" batch_size = tf.shape(self.X_pl)[0]\n",
"\n",
" # Three convolutional layers\n",
" conv1 = tf.contrib.layers.conv2d(\n",
" X, 32, 8, 4, activation_fn=tf.nn.relu)\n",
" conv2 = tf.contrib.layers.conv2d(\n",
" conv1, 64, 4, 2, activation_fn=tf.nn.relu)\n",
" conv3 = tf.contrib.layers.conv2d(\n",
" conv2, 64, 3, 1, activation_fn=tf.nn.relu)\n",
"\n",
" # Fully connected layers\n",
" flattened = tf.contrib.layers.flatten(conv3)\n",
" fc1 = tf.contrib.layers.fully_connected(flattened, 512)\n",
" self.predictions = tf.contrib.layers.fully_connected(fc1, len(VALID_ACTIONS))\n",
"\n",
" # Get the predictions for the chosen actions only\n",
" gather_indices = tf.range(batch_size) * tf.shape(self.predictions)[1] + self.actions_pl\n",
" self.action_predictions = tf.gather(tf.reshape(self.predictions, [-1]), gather_indices)\n",
"\n",
" # Calculate the loss\n",
" self.losses = tf.squared_difference(self.y_pl, self.action_predictions)\n",
" self.loss = tf.reduce_mean(self.losses)\n",
"\n",
" # Optimizer Parameters from original paper\n",
" self.optimizer = tf.train.RMSPropOptimizer(0.00025, 0.99, 0.0, 1e-6)\n",
" self.train_op = self.optimizer.minimize(self.loss, global_step=tf.contrib.framework.get_global_step())\n",
"\n",
" # Summaries for Tensorboard\n",
" self.summaries = tf.summary.merge([\n",
" tf.summary.scalar(\"loss\", self.loss),\n",
" tf.summary.histogram(\"loss_hist\", self.losses),\n",
" tf.summary.histogram(\"q_values_hist\", self.predictions),\n",
" tf.summary.scalar(\"max_q_value\", tf.reduce_max(self.predictions))\n",
" ])\n",
"\n",
"\n",
" def predict(self, sess, s):\n",
" \"\"\"\n",
" Predicts action values.\n",
"\n",
" Args:\n",
" sess: Tensorflow session\n",
" s: State input of shape [batch_size, 4, 84, 84, 1]\n",
"\n",
" Returns:\n",
" Tensor of shape [batch_size, NUM_VALID_ACTIONS] containing the estimated \n",
" action values.\n",
" \"\"\"\n",
" return sess.run(self.predictions, { self.X_pl: s })\n",
"\n",
" def update(self, sess, s, a, y):\n",
" \"\"\"\n",
" Updates the estimator towards the given targets.\n",
"\n",
" Args:\n",
" sess: Tensorflow session object\n",
" s: State input of shape [batch_size, 4, 84, 84, 1]\n",
" a: Chosen actions of shape [batch_size]\n",
" y: Targets of shape [batch_size]\n",
"\n",
" Returns:\n",
" The calculated loss on the batch.\n",
" \"\"\"\n",
" feed_dict = { self.X_pl: s, self.y_pl: y, self.actions_pl: a }\n",
" summaries, global_step, _, loss = sess.run(\n",
" [self.summaries, tf.contrib.framework.get_global_step(), self.train_op, self.loss],\n",
" feed_dict)\n",
" if self.summary_writer:\n",
" self.summary_writer.add_summary(summaries, global_step)\n",
" return loss"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# For Testing....\n",
"\n",
"tf.reset_default_graph()\n",
"global_step = tf.Variable(0, name=\"global_step\", trainable=False)\n",
"\n",
"e = Estimator(scope=\"test\")\n",
"sp = StateProcessor()\n",
"\n",
"with tf.Session() as sess:\n",
" sess.run(tf.global_variables_initializer())\n",
" \n",
" # Example observation batch\n",
" observation = env.reset()\n",
" \n",
" observation_p = sp.process(sess, observation)\n",
" observation = np.stack([observation_p] * 4, axis=2)\n",
" observations = np.array([observation] * 2)\n",
" \n",
" # Test Prediction\n",
" print(e.predict(sess, observations))\n",
"\n",
" # Test training step\n",
" y = np.array([10.0, 10.0])\n",
" a = np.array([1, 3])\n",
" print(e.update(sess, observations, a, y))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def copy_model_parameters(sess, estimator1, estimator2):\n",
" \"\"\"\n",
" Copies the model parameters of one estimator to another.\n",
"\n",
" Args:\n",
" sess: Tensorflow session instance\n",
" estimator1: Estimator to copy the paramters from\n",
" estimator2: Estimator to copy the parameters to\n",
" \"\"\"\n",
" e1_params = [t for t in tf.trainable_variables() if t.name.startswith(estimator1.scope)]\n",
" e1_params = sorted(e1_params, key=lambda v: v.name)\n",
" e2_params = [t for t in tf.trainable_variables() if t.name.startswith(estimator2.scope)]\n",
" e2_params = sorted(e2_params, key=lambda v: v.name)\n",
"\n",
" update_ops = []\n",
" for e1_v, e2_v in zip(e1_params, e2_params):\n",
" op = e2_v.assign(e1_v)\n",
" update_ops.append(op)\n",
"\n",
" sess.run(update_ops)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def make_epsilon_greedy_policy(estimator, nA):\n",
" \"\"\"\n",
" Creates an epsilon-greedy policy based on a given Q-function approximator and epsilon.\n",
"\n",
" Args:\n",
" estimator: An estimator that returns q values for a given state\n",
" nA: Number of actions in the environment.\n",
"\n",
" Returns:\n",
" A function that takes the (sess, observation, epsilon) as an argument and returns\n",
" the probabilities for each action in the form of a numpy array of length nA.\n",
"\n",
" \"\"\"\n",
" def policy_fn(sess, observation, epsilon):\n",
" A = np.ones(nA, dtype=float) * epsilon / nA\n",
" q_values = estimator.predict(sess, np.expand_dims(observation, 0))[0]\n",
" best_action = np.argmax(q_values)\n",
" A[best_action] += (1.0 - epsilon)\n",
" return A\n",
" return policy_fn"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def deep_q_learning(sess,\n",
" env,\n",
" q_estimator,\n",
" target_estimator,\n",
" state_processor,\n",
" num_episodes,\n",
" experiment_dir,\n",
" replay_memory_size=500000,\n",
" replay_memory_init_size=50000,\n",
" update_target_estimator_every=10000,\n",
" discount_factor=0.99,\n",
" epsilon_start=1.0,\n",
" epsilon_end=0.1,\n",
" epsilon_decay_steps=500000,\n",
" batch_size=32,\n",
" record_video_every=50):\n",
" \"\"\"\n",
" Q-Learning algorithm for off-policy TD control using Function Approximation.\n",
" Finds the optimal greedy policy while following an epsilon-greedy policy.\n",
"\n",
" Args:\n",
" sess: Tensorflow Session object\n",
" env: OpenAI environment\n",
" q_estimator: Estimator object used for the q values\n",
" target_estimator: Estimator object used for the targets\n",
" state_processor: A StateProcessor object\n",
" num_episodes: Number of episodes to run for\n",
" experiment_dir: Directory to save Tensorflow summaries in\n",
" replay_memory_size: Size of the replay memory\n",
" replay_memory_init_size: Number of random experiences to sampel when initializing \n",
" the reply memory.\n",
" update_target_estimator_every: Copy parameters from the Q estimator to the \n",
" target estimator every N steps\n",
" discount_factor: Gamma discount factor\n",
" epsilon_start: Chance to sample a random action when taking an action.\n",
" Epsilon is decayed over time and this is the start value\n",
" epsilon_end: The final minimum value of epsilon after decaying is done\n",
" epsilon_decay_steps: Number of steps to decay epsilon over\n",
" batch_size: Size of batches to sample from the replay memory\n",
" record_video_every: Record a video every N episodes\n",
"\n",
" Returns:\n",
" An EpisodeStats object with two numpy arrays for episode_lengths and episode_rewards.\n",
" \"\"\"\n",
"\n",
" Transition = namedtuple(\"Transition\", [\"state\", \"action\", \"reward\", \"next_state\", \"done\"])\n",
"\n",
" # The replay memory\n",
" replay_memory = []\n",
"\n",
" # Keeps track of useful statistics\n",
" stats = plotting.EpisodeStats(\n",
" episode_lengths=np.zeros(num_episodes),\n",
" episode_rewards=np.zeros(num_episodes))\n",
"\n",
" # Create directories for checkpoints and summaries\n",
" checkpoint_dir = os.path.join(experiment_dir, \"checkpoints\")\n",
" checkpoint_path = os.path.join(checkpoint_dir, \"model\")\n",
" monitor_path = os.path.join(experiment_dir, \"monitor\")\n",
"\n",
" if not os.path.exists(checkpoint_dir):\n",
" os.makedirs(checkpoint_dir)\n",
" if not os.path.exists(monitor_path):\n",
" os.makedirs(monitor_path)\n",
"\n",
" saver = tf.train.Saver()\n",
" # Load a previous checkpoint if we find one\n",
" latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)\n",
" if latest_checkpoint:\n",
" print(\"Loading model checkpoint {}...\\n\".format(latest_checkpoint))\n",
" saver.restore(sess, latest_checkpoint)\n",
" \n",
" # Get the current time step\n",
" total_t = sess.run(tf.contrib.framework.get_global_step())\n",
"\n",
" # The epsilon decay schedule\n",
" epsilons = np.linspace(epsilon_start, epsilon_end, epsilon_decay_steps)\n",
"\n",
" # The policy we're following\n",
" policy = make_epsilon_greedy_policy(\n",
" q_estimator,\n",
" len(VALID_ACTIONS))\n",
"\n",
" # Populate the replay memory with initial experience\n",
" print(\"Populating replay memory...\")\n",
" state = env.reset()\n",
" state = state_processor.process(sess, state)\n",
" state = np.stack([state] * 4, axis=2)\n",
" for i in range(replay_memory_init_size):\n",
" # TODO: Populate replay memory!\n",
" pass\n",
"\n",
" # Record videos\n",
" env= Monitor(env,\n",
" directory=monitor_path,\n",
" resume=True,\n",
" video_callable=lambda count: count % record_video_every == 0)\n",
"\n",
" for i_episode in range(num_episodes):\n",
"\n",
" # Save the current checkpoint\n",
" saver.save(tf.get_default_session(), checkpoint_path)\n",
"\n",
" # Reset the environment\n",
" state = env.reset()\n",
" state = state_processor.process(sess, state)\n",
" state = np.stack([state] * 4, axis=2)\n",
" loss = None\n",
"\n",
" # One step in the environment\n",
" for t in itertools.count():\n",
"\n",
" # Epsilon for this time step\n",
" epsilon = epsilons[min(total_t, epsilon_decay_steps-1)]\n",
"\n",
" # Add epsilon to Tensorboard\n",
" episode_summary = tf.Summary()\n",
" episode_summary.value.add(simple_value=epsilon, tag=\"epsilon\")\n",
" q_estimator.summary_writer.add_summary(episode_summary, total_t)\n",
"\n",
" # TODO: Maybe update the target estimator\n",
" if total_t % update_target_estimator_every == 0:\n",
" pass\n",
"\n",
" # Print out which step we're on, useful for debugging.\n",
" print(\"\\rStep {} ({}) @ Episode {}/{}, loss: {}\".format(\n",
" t, total_t, i_episode + 1, num_episodes, loss), end=\"\")\n",
" sys.stdout.flush()\n",
"\n",
" # Take a step in the environment\n",
" # TODO: Implement!\n",
"\n",
" # If our replay memory is full, pop the first element\n",
" if len(replay_memory) == replay_memory_size:\n",
" replay_memory.pop(0)\n",
"\n",
" # TODO: Save transition to replay memory\n",
"\n",
" # Update statistics\n",
" stats.episode_rewards[i_episode] += reward\n",
" stats.episode_lengths[i_episode] = t\n",
"\n",
" # TODO: Sample a minibatch from the replay memory\n",
" # TODO: Calculate q values and targets\n",
" # TODO Perform gradient descent update\n",
"\n",
" if done:\n",
" break\n",
"\n",
" state = next_state\n",
" total_t += 1\n",
"\n",
" # Add summaries to tensorboard\n",
" episode_summary = tf.Summary()\n",
" episode_summary.value.add(simple_value=stats.episode_rewards[i_episode], node_name=\"episode_reward\", tag=\"episode_reward\")\n",
" episode_summary.value.add(simple_value=stats.episode_lengths[i_episode], node_name=\"episode_length\", tag=\"episode_length\")\n",
" q_estimator.summary_writer.add_summary(episode_summary, total_t)\n",
" q_estimator.summary_writer.flush()\n",
"\n",
" yield total_t, plotting.EpisodeStats(\n",
" episode_lengths=stats.episode_lengths[:i_episode+1],\n",
" episode_rewards=stats.episode_rewards[:i_episode+1])\n",
"\n",
" env.monitor.close()\n",
" return stats"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tf.reset_default_graph()\n",
"\n",
"# Where we save our checkpoints and graphs\n",
"experiment_dir = os.path.abspath(\"./experiments/{}\".format(env.spec.id))\n",
"\n",
"# Create a glboal step variable\n",
"global_step = tf.Variable(0, name='global_step', trainable=False)\n",
" \n",
"# Create estimators\n",
"q_estimator = Estimator(scope=\"q\", summaries_dir=experiment_dir)\n",
"target_estimator = Estimator(scope=\"target_q\")\n",
"\n",
"# State processor\n",
"state_processor = StateProcessor()\n",
"\n",
"# Run it!\n",
"with tf.Session() as sess:\n",
" sess.run(tf.initialize_all_variables())\n",
" for t, stats in deep_q_learning(sess,\n",
" env,\n",
" q_estimator=q_estimator,\n",
" target_estimator=target_estimator,\n",
" state_processor=state_processor,\n",
" experiment_dir=experiment_dir,\n",
" num_episodes=10000,\n",
" replay_memory_size=500000,\n",
" replay_memory_init_size=50000,\n",
" update_target_estimator_every=10000,\n",
" epsilon_start=1.0,\n",
" epsilon_end=0.1,\n",
" epsilon_decay_steps=500000,\n",
" discount_factor=0.99,\n",
" batch_size=32):\n",
"\n",
" print(\"\\nEpisode Reward: {}\".format(stats.episode_rewards[-1]))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
================================================
FILE: DQN/Double DQN Solution.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import gym\n",
"import itertools\n",
"import numpy as np\n",
"import os\n",
"import random\n",
"import sys\n",
"import tensorflow as tf\n",
"\n",
"if \"../\" not in sys.path:\n",
" sys.path.append(\"../\")\n",
"\n",
"from lib import plotting\n",
"from collections import deque, namedtuple"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"env = gym.envs.make(\"Breakout-v0\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Atari Actions: 0 (noop), 1 (fire), 2 (left) and 3 (right) are valid actions\n",
"VALID_ACTIONS = [0, 1, 2, 3]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class StateProcessor():\n",
" \"\"\"\n",
" Processes a raw Atari images. Resizes it and converts it to grayscale.\n",
" \"\"\"\n",
" def __init__(self):\n",
" # Build the Tensorflow graph\n",
" with tf.variable_scope(\"state_processor\"):\n",
" self.input_state = tf.placeholder(shape=[210, 160, 3], dtype=tf.uint8)\n",
" self.output = tf.image.rgb_to_grayscale(self.input_state)\n",
" self.output = tf.image.crop_to_bounding_box(self.output, 34, 0, 160, 160)\n",
" self.output = tf.image.resize_images(\n",
" self.output, 84, 84, method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)\n",
" self.output = tf.squeeze(self.output)\n",
"\n",
" def process(self, sess, state):\n",
" \"\"\"\n",
" Args:\n",
" sess: A Tensorflow session object\n",
" state: A [210, 160, 3] Atari RGB State\n",
"\n",
" Returns:\n",
" A processed [84, 84] state representing grayscale values.\n",
" \"\"\"\n",
" return sess.run(self.output, { self.input_state: state })"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class Estimator():\n",
" \"\"\"Q-Value Estimator neural network.\n",
"\n",
" This network is used for both the Q-Network and the Target Network.\n",
" \"\"\"\n",
"\n",
" def __init__(self, scope=\"estimator\", summaries_dir=None):\n",
" self.scope = scope\n",
" # Writes Tensorboard summaries to disk\n",
" self.summary_writer = None\n",
" with tf.variable_scope(scope):\n",
" # Build the graph\n",
" self._build_model()\n",
" if summaries_dir:\n",
" summary_dir = os.path.join(summaries_dir, \"summaries_{}\".format(scope))\n",
" if not os.path.exists(summary_dir):\n",
" os.makedirs(summary_dir)\n",
" self.summary_writer = tf.train.SummaryWriter(summary_dir)\n",
"\n",
" def _build_model(self):\n",
" \"\"\"\n",
" Builds the Tensorflow graph.\n",
" \"\"\"\n",
"\n",
" # Placeholders for our input\n",
" # Our input are 4 grayscale frames of shape 84, 84 each\n",
" self.X_pl = tf.placeholder(shape=[None, 84, 84, 4], dtype=tf.uint8, name=\"X\")\n",
" # The TD target value\n",
" self.y_pl = tf.placeholder(shape=[None], dtype=tf.float32, name=\"y\")\n",
" # Integer id of which action was selected\n",
" self.actions_pl = tf.placeholder(shape=[None], dtype=tf.int32, name=\"actions\")\n",
"\n",
" X = tf.to_float(self.X_pl) / 255.0\n",
" \n",
" # TODO: Implement the Tensorflow graph!\n",
" batch_size = tf.shape(self.X_pl)[0]\n",
" self.predictions = tf.zeros(shape=[batch_size, len(VALID_ACTIONS)])\n",
" self.loss = tf.constant(0.0)\n",
" self.train_op = tf.no_op(\"train_pp\")\n",
" \n",
" # Summaries for Tensorboard\n",
" self.summaries = tf.merge_summary([\n",
" tf.scalar_summary(\"loss\", self.loss)\n",
" ])\n",
"\n",
"\n",
" def predict(self, sess, s):\n",
" \"\"\"\n",
" Predicts action values.\n",
"\n",
" Args:\n",
" sess: Tensorflow session\n",
" s: State input of shape [batch_size, 4, 84, 84, 1]\n",
"\n",
" Returns:\n",
" Tensor of shape [batch_size, NUM_VALID_ACTIONS] containing the estimated \n",
" action values.\n",
" \"\"\"\n",
" return sess.run(self.predictions, { self.X_pl: s })\n",
"\n",
" def update(self, sess, s, a, y):\n",
" \"\"\"\n",
" Updates the estimator towards the given targets.\n",
"\n",
" Args:\n",
" sess: Tensorflow session object\n",
" s: State input of shape [batch_size, 4, 84, 84, 1]\n",
" a: Chosen actions of shape [batch_size]\n",
" y: Targets of shape [batch_size]\n",
"\n",
" Returns:\n",
" The calculated loss on the batch.\n",
" \"\"\"\n",
" feed_dict = { self.X_pl: s, self.y_pl: y, self.actions_pl: a }\n",
" summaries, global_step, _, loss = sess.run(\n",
" [self.summaries, tf.contrib.framework.get_global_step(), self.train_op, self.loss],\n",
" feed_dict)\n",
" if self.summary_writer:\n",
" self.summary_writer.add_summary(summaries, global_step)\n",
" return loss"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# For Testing....\n",
"\n",
"tf.reset_default_graph()\n",
"global_step = tf.Variable(0, name=\"global_step\", trainable=False)\n",
"\n",
"e = Estimator(scope=\"test\")\n",
"sp = StateProcessor()\n",
"\n",
"with tf.Session() as sess:\n",
" sess.run(tf.initialize_all_variables())\n",
" \n",
" # Example observation batch\n",
" observation = env.reset()\n",
" \n",
" observation_p = sp.process(sess, observation)\n",
" observation = np.stack([observation_p] * 4, axis=2)\n",
" observations = np.array([observation] * 2)\n",
" \n",
" # Test Prediction\n",
" print(e.predict(sess, observations))\n",
"\n",
" # Test training step\n",
" y = np.array([10.0, 10.0])\n",
" a = np.array([1, 3])\n",
" print(e.update(sess, observations, a, y))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def copy_model_parameters(sess, estimator1, estimator2):\n",
" \"\"\"\n",
" Copies the model parameters of one estimator to another.\n",
"\n",
" Args:\n",
" sess: Tensorflow session instance\n",
" estimator1: Estimator to copy the paramters from\n",
" estimator2: Estimator to copy the parameters to\n",
" \"\"\"\n",
" e1_params = [t for t in tf.trainable_variables() if t.name.startswith(estimator1.scope)]\n",
" e1_params = sorted(e1_params, key=lambda v: v.name)\n",
" e2_params = [t for t in tf.trainable_variables() if t.name.startswith(estimator2.scope)]\n",
" e2_params = sorted(e2_params, key=lambda v: v.name)\n",
"\n",
" update_ops = []\n",
" for e1_v, e2_v in zip(e1_params, e2_params):\n",
" op = e2_v.assign(e1_v)\n",
" update_ops.append(op)\n",
"\n",
" sess.run(update_ops)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def make_epsilon_greedy_policy(estimator, nA):\n",
" \"\"\"\n",
" Creates an epsilon-greedy policy based on a given Q-function approximator and epsilon.\n",
"\n",
" Args:\n",
" estimator: An estimator that returns q values for a given state\n",
" nA: Number of actions in the environment.\n",
"\n",
" Returns:\n",
" A function that takes the (sess, observation, epsilon) as an argument and returns\n",
" the probabilities for each action in the form of a numpy array of length nA.\n",
"\n",
" \"\"\"\n",
" def policy_fn(sess, observation, epsilon):\n",
" A = np.ones(nA, dtype=float) * epsilon / nA\n",
" q_values = estimator.predict(sess, np.expand_dims(observation, 0))[0]\n",
" best_action = np.argmax(q_values)\n",
" A[best_action] += (1.0 - epsilon)\n",
" return A\n",
" return policy_fn"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def deep_q_learning(sess,\n",
" env,\n",
" q_estimator,\n",
" target_estimator,\n",
" state_processor,\n",
" num_episodes,\n",
" experiment_dir,\n",
" replay_memory_size=500000,\n",
" replay_memory_init_size=50000,\n",
" update_target_estimator_every=10000,\n",
" discount_factor=0.99,\n",
" epsilon_start=1.0,\n",
" epsilon_end=0.1,\n",
" epsilon_decay_steps=500000,\n",
" batch_size=32,\n",
" record_video_every=50):\n",
" \"\"\"\n",
" Q-Learning algorithm for off-policy TD control using Function Approximation.\n",
" Finds the optimal greedy policy while following an epsilon-greedy policy.\n",
"\n",
" Args:\n",
" sess: Tensorflow Session object\n",
" env: OpenAI environment\n",
" q_estimator: Estimator object used for the q values\n",
" target_estimator: Estimator object used for the targets\n",
" state_processor: A StateProcessor object\n",
" num_episodes: Number of episodes to run for\n",
" experiment_dir: Directory to save Tensorflow summaries in\n",
" replay_memory_size: Size of the replay memory\n",
" replay_memory_init_size: Number of random experiences to sampel when initializing \n",
" the reply memory.\n",
" update_target_estimator_every: Copy parameters from the Q estimator to the \n",
" target estimator every N steps\n",
" discount_factor: Gamma discount factor\n",
" epsilon_start: Chance to sample a random action when taking an action.\n",
" Epsilon is decayed over time and this is the start value\n",
" epsilon_end: The final minimum value of epsilon after decaying is done\n",
" epsilon_decay_steps: Number of steps to decay epsilon over\n",
" batch_size: Size of batches to sample from the replay memory\n",
" record_video_every: Record a video every N episodes\n",
"\n",
" Returns:\n",
" An EpisodeStats object with two numpy arrays for episode_lengths and episode_rewards.\n",
" \"\"\"\n",
"\n",
" Transition = namedtuple(\"Transition\", [\"state\", \"action\", \"reward\", \"next_state\", \"done\"])\n",
"\n",
" # The replay memory\n",
" replay_memory = []\n",
"\n",
" # Keeps track of useful statistics\n",
" stats = plotting.EpisodeStats(\n",
" episode_lengths=np.zeros(num_episodes),\n",
" episode_rewards=np.zeros(num_episodes))\n",
"\n",
" # Create directories for checkpoints and summaries\n",
" checkpoint_dir = os.path.join(experiment_dir, \"checkpoints\")\n",
" checkpoint_path = os.path.join(checkpoint_dir, \"model\")\n",
" monitor_path = os.path.join(experiment_dir, \"monitor\")\n",
"\n",
" if not os.path.exists(checkpoint_dir):\n",
" os.makedirs(checkpoint_dir)\n",
" if not os.path.exists(monitor_path):\n",
" os.makedirs(monitor_path)\n",
"\n",
" saver = tf.train.Saver()\n",
" # Load a previous checkpoint if we find one\n",
" latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)\n",
" if latest_checkpoint:\n",
" print(\"Loading model checkpoint {}...\\n\".format(latest_checkpoint))\n",
" saver.restore(sess, latest_checkpoint)\n",
" \n",
" # Get the current time step\n",
" total_t = sess.run(tf.contrib.framework.get_global_step())\n",
"\n",
" # The epsilon decay schedule\n",
" epsilons = np.linspace(epsilon_start, epsilon_end, epsilon_decay_steps)\n",
"\n",
" # The policy we're following\n",
" policy = make_epsilon_greedy_policy(\n",
" q_estimator,\n",
" len(VALID_ACTIONS))\n",
"\n",
" # Populate the replay memory with initial experience\n",
" print(\"Populating replay memory...\")\n",
" state = env.reset()\n",
" state = state_processor.process(sess, state)\n",
" state = np.stack([state] * 4, axis=2)\n",
" for i in range(replay_memory_init_size):\n",
" action_probs = policy(sess, state, epsilons[total_t])\n",
" action = np.random.choice(np.arange(len(action_probs)), p=action_probs)\n",
" next_state, reward, done, _ = env.step(VALID_ACTIONS[action])\n",
" next_state = state_processor.process(sess, next_state)\n",
" next_state = np.append(state[:,:,1:], np.expand_dims(next_state, 2), axis=2)\n",
" replay_memory.append(Transition(state, action, reward, next_state, done))\n",
" if done:\n",
" state = env.reset()\n",
" state = state_processor.process(sess, state)\n",
" state = np.stack([state] * 4, axis=2)\n",
" else:\n",
" state = next_state\n",
"\n",
" # Record videos\n",
" env.monitor.start(monitor_path,\n",
" resume=True,\n",
" video_callable=lambda count: count % record_video_every == 0)\n",
"\n",
" for i_episode in range(num_episodes):\n",
"\n",
" # Save the current checkpoint\n",
" saver.save(tf.get_default_session(), checkpoint_path)\n",
"\n",
" # Reset the environment\n",
" state = env.reset()\n",
" state = state_processor.process(sess, state)\n",
" state = np.stack([state] * 4, axis=2)\n",
" loss = None\n",
"\n",
" # One step in the environment\n",
" for t in itertools.count():\n",
"\n",
" # Epsilon for this time step\n",
" epsilon = epsilons[min(total_t, epsilon_decay_steps-1)]\n",
"\n",
" # Add epsilon to Tensorboard\n",
" episode_summary = tf.Summary()\n",
" episode_summary.value.add(simple_value=epsilon, tag=\"epsilon\")\n",
" q_estimator.summary_writer.add_summary(episode_summary, total_t)\n",
"\n",
" # Maybe update the target estimator\n",
" if total_t % update_target_estimator_every == 0:\n",
" copy_model_parameters(sess, q_estimator, target_estimator)\n",
" print(\"\\nCopied model parameters to target network.\")\n",
"\n",
" # Print out which step we're on, useful for debugging.\n",
" print(\"\\rStep {} ({}) @ Episode {}/{}, loss: {}\".format(\n",
" t, total_t, i_episode + 1, num_episodes, loss), end=\"\")\n",
" sys.stdout.flush()\n",
"\n",
" # Take a step\n",
" action_probs = policy(sess, state, epsilon)\n",
" action = np.random.choice(np.arange(len(action_probs)), p=action_probs)\n",
" next_state, reward, done, _ = env.step(VALID_ACTIONS[action])\n",
" next_state = state_processor.process(sess, next_state)\n",
" next_state = np.append(state[:,:,1:], np.expand_dims(next_state, 2), axis=2)\n",
"\n",
" # If our replay memory is full, pop the first element\n",
" if len(replay_memory) == replay_memory_size:\n",
" replay_memory.pop(0)\n",
"\n",
" # Save transition to replay memory\n",
" replay_memory.append(Transition(state, action, reward, next_state, done)) \n",
"\n",
" # Update statistics\n",
" stats.episode_rewards[i_episode] += reward\n",
" stats.episode_lengths[i_episode] = t\n",
"\n",
" # Sample a minibatch from the replay memory\n",
" samples = random.sample(replay_memory, batch_size)\n",
" states_batch, action_batch, reward_batch, next_states_batch, done_batch = map(np.array, zip(*samples))\n",
"\n",
" # Calculate q values and targets\n",
" # This is where Double Q-Learning comes in!\n",
" q_values_next = q_estimator.predict(sess, next_states_batch)\n",
" best_actions = np.argmax(q_values_next, axis=1)\n",
" q_values_next_target = target_estimator.predict(sess, next_states_batch)\n",
" targets_batch = reward_batch + np.invert(done_batch).astype(np.float32) * \\\n",
" discount_factor * q_values_next_target[np.arange(batch_size), best_actions]\n",
"\n",
" # Perform gradient descent update\n",
" states_batch = np.array(states_batch)\n",
" loss = q_estimator.update(sess, states_batch, action_batch, targets_batch)\n",
"\n",
" if done:\n",
" break\n",
"\n",
" state = next_state\n",
" total_t += 1\n",
"\n",
" # Add summaries to tensorboard\n",
" episode_summary = tf.Summary()\n",
" episode_summary.value.add(simple_value=stats.episode_rewards[i_episode], node_name=\"episode_reward\", tag=\"episode_reward\")\n",
" episode_summary.value.add(simple_value=stats.episode_lengths[i_episode], node_name=\"episode_length\", tag=\"episode_length\")\n",
" q_estimator.summary_writer.add_summary(episode_summary, total_t)\n",
" q_estimator.summary_writer.flush()\n",
"\n",
" yield total_t, plotting.EpisodeStats(\n",
" episode_lengths=stats.episode_lengths[:i_episode+1],\n",
" episode_rewards=stats.episode_rewards[:i_episode+1])\n",
"\n",
" env.monitor.close()\n",
" return stats"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tf.reset_default_graph()\n",
"\n",
"# Where we save our checkpoints and graphs\n",
"experiment_dir = os.path.abspath(\"./experiments/{}\".format(env.spec.id))\n",
"\n",
"# Create a glboal step variable\n",
"global_step = tf.Variable(0, name='global_step', trainable=False)\n",
" \n",
"# Create estimators\n",
"q_estimator = Estimator(scope=\"q\", summaries_dir=experiment_dir)\n",
"target_estimator = Estimator(scope=\"target_q\")\n",
"\n",
"# State processor\n",
"state_processor = StateProcessor()\n",
"\n",
"# Run it!\n",
"with tf.Session() as sess:\n",
" sess.run(tf.initialize_all_variables())\n",
" for t, stats in deep_q_learning(sess,\n",
" env,\n",
" q_estimator=q_estimator,\n",
" target_estimator=target_estimator,\n",
" state_processor=state_processor,\n",
" experiment_dir=experiment_dir,\n",
" num_episodes=10000,\n",
" replay_memory_size=500000,\n",
" replay_memory_init_size=50000,\n",
" update_target_estimator_every=10000,\n",
" epsilon_start=1.0,\n",
" epsilon_end=0.1,\n",
" epsilon_decay_steps=500000,\n",
" discount_factor=0.99,\n",
" batch_size=32):\n",
"\n",
" print(\"\\nEpisode Reward: {}\".format(stats.episode_rewards[-1]))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
================================================
FILE: DQN/README.md
================================================
## Deep Q-Learning
### Learning Goals
- Understand the Deep Q-Learning (DQN) algorithm
- Understand why Experience Replay and a Target Network are necessary to make Deep Q-Learning work in practice
- (Optional) Understand Double Deep Q-Learning
- (Optional) Understand Prioritized Experience Replay
### Summary
- DQN: Q-Learning but with a Deep Neural Network as a function approximator.
- Using a non-linear Deep Neural Network is powerful, but training is unstable if we apply it naively.
- Trick 1 - Experience Replay: Store experience `(S, A, R, S_next)` in a replay buffer and sample minibatches from it to train the network. This decorrelates the data and leads to better data efficiency. In the beginning, the replay buffer is filled with random experience.
- Trick 2 - Target Network: Use a separate network to estimate the TD target. This target network has the same architecture as the function approximator but with frozen parameters. Every T steps (a hyperparameter) the parameters from the Q network are copied to the target network. This leads to more stable training because it keeps the target function fixed (for a while).
- By using a Convolutional Neural Network as the function approximator on raw pixels of Atari games where the score is the reward we can learn to play many of those games at human-like performance.
- Double DQN: Just like regular Q-Learning, DQN tends to overestimate values due to its max operation applied to both selecting and estimating actions. We get around this by using the Q network for selection and the target network for estimation when making updates.
### Lectures & Readings
**Required:**
- [Human-Level Control through Deep Reinforcement Learning](http://www.readcube.com/articles/10.1038/nature14236)
- [Demystifying Deep Reinforcement Learning](https://ai.intel.com/demystifying-deep-reinforcement-learning/)
- David Silver's RL Course Lecture 6 - Value Function Approximation ([video](https://www.youtube.com/watch?v=UoPei5o4fps), [slides](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf))
**Optional:**
- [Using Keras and Deep Q-Network to Play FlappyBird](https://yanpanlau.github.io/2016/07/10/FlappyBird-Keras.html)
- [Deep Reinforcement Learning with Double Q-learning](http://arxiv.org/abs/1509.06461)
- [Prioritized Experience Replay](http://arxiv.org/abs/1511.05952)
**Deep Learning:**
- [Tensorflow](http://www.tensorflow.org)
- [Deep Learning Books](http://www.deeplearningbook.org/)
### Exercises
- Get familiar with the [OpenAI Gym Atari Environment Playground](Breakout%20Playground.ipynb)
- Deep-Q Learning for Atari Games
- [Exercise](Deep%20Q%20Learning.ipynb)
- [Solution](Deep%20Q%20Learning%20Solution.ipynb)
- Double-Q Learning
- This is a minimal change to Q-Learning so use the same exercise as above
- [Solution](Double%20DQN%20Solution.ipynb)
- Prioritized Experience Replay (WIP)
================================================
FILE: DQN/dqn.py
================================================
import gym
from gym.wrappers import Monitor
import itertools
import numpy as np
import os
import random
import sys
import tensorflow as tf
if "../" not in sys.path:
sys.path.append("../")
from lib import plotting
from collections import deque, namedtuple
env = gym.envs.make("Breakout-v0")
# Atari Actions: 0 (noop), 1 (fire), 2 (left) and 3 (right) are valid actions
VALID_ACTIONS = [0, 1, 2, 3]
class StateProcessor():
"""
Processes a raw Atari images. Resizes it and converts it to grayscale.
"""
def __init__(self):
# Build the Tensorflow graph
with tf.variable_scope("state_processor"):
self.input_state = tf.placeholder(shape=[210, 160, 3], dtype=tf.uint8)
self.output = tf.image.rgb_to_grayscale(self.input_state)
self.output = tf.image.crop_to_bounding_box(self.output, 34, 0, 160, 160)
self.output = tf.image.resize_images(
self.output, [84, 84], method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)
self.output = tf.squeeze(self.output)
def process(self, sess, state):
"""
Args:
sess: A Tensorflow session object
state: A [210, 160, 3] Atari RGB State
Returns:
A processed [84, 84] state representing grayscale values.
"""
return sess.run(self.output, { self.input_state: state })
class Estimator():
"""Q-Value Estimator neural network.
This network is used for both the Q-Network and the Target Network.
"""
def __init__(self, scope="estimator", summaries_dir=None):
self.scope = scope
# Writes Tensorboard summaries to disk
self.summary_writer = None
with tf.variable_scope(scope):
# Build the graph
self._build_model()
if summaries_dir:
summary_dir = os.path.join(summaries_dir, "summaries_{}".format(scope))
if not os.path.exists(summary_dir):
os.makedirs(summary_dir)
self.summary_writer = tf.summary.FileWriter(summary_dir)
def _build_model(self):
"""
Builds the Tensorflow graph.
"""
# Placeholders for our input
# Our input are 4 RGB frames of shape 160, 160 each
self.X_pl = tf.placeholder(shape=[None, 84, 84, 4], dtype=tf.uint8, name="X")
# The TD target value
self.y_pl = tf.placeholder(shape=[None], dtype=tf.float32, name="y")
# Integer id of which action was selected
self.actions_pl = tf.placeholder(shape=[None], dtype=tf.int32, name="actions")
X = tf.to_float(self.X_pl) / 255.0
batch_size = tf.shape(self.X_pl)[0]
# Three convolutional layers
conv1 = tf.contrib.layers.conv2d(
X, 32, 8, 4, activation_fn=tf.nn.relu)
conv2 = tf.contrib.layers.conv2d(
conv1, 64, 4, 2, activation_fn=tf.nn.relu)
conv3 = tf.contrib.layers.conv2d(
conv2, 64, 3, 1, activation_fn=tf.nn.relu)
# Fully connected layers
flattened = tf.contrib.layers.flatten(conv3)
fc1 = tf.contrib.layers.fully_connected(flattened, 512)
self.predictions = tf.contrib.layers.fully_connected(fc1, len(VALID_ACTIONS))
# Get the predictions for the chosen actions only
gather_indices = tf.range(batch_size) * tf.shape(self.predictions)[1] + self.actions_pl
self.action_predictions = tf.gather(tf.reshape(self.predictions, [-1]), gather_indices)
# Calculate the loss
self.losses = tf.squared_difference(self.y_pl, self.action_predictions)
self.loss = tf.reduce_mean(self.losses)
# Optimizer Parameters from original paper
self.optimizer = tf.train.RMSPropOptimizer(0.00025, 0.99, 0.0, 1e-6)
self.train_op = self.optimizer.minimize(self.loss, global_step=tf.contrib.framework.get_global_step())
# Summaries for Tensorboard
self.summaries = tf.summary.merge([
tf.summary.scalar("loss", self.loss),
tf.summary.histogram("loss_hist", self.losses),
tf.summary.histogram("q_values_hist", self.predictions),
tf.summary.scalar("max_q_value", tf.reduce_max(self.predictions))
])
def predict(self, sess, s):
"""
Predicts action values.
Args:
sess: Tensorflow session
s: State input of shape [batch_size, 4, 160, 160, 3]
Returns:
Tensor of shape [batch_size, NUM_VALID_ACTIONS] containing the estimated
action values.
"""
return sess.run(self.predictions, { self.X_pl: s })
def update(self, sess, s, a, y):
"""
Updates the estimator towards the given targets.
Args:
sess: Tensorflow session object
s: State input of shape [batch_size, 4, 160, 160, 3]
a: Chosen actions of shape [batch_size]
y: Targets of shape [batch_size]
Returns:
The calculated loss on the batch.
"""
feed_dict = { self.X_pl: s, self.y_pl: y, self.actions_pl: a }
summaries, global_step, _, loss = sess.run(
[self.summaries, tf.contrib.framework.get_global_step(), self.train_op, self.loss],
feed_dict)
if self.summary_writer:
self.summary_writer.add_summary(summaries, global_step)
return loss
def copy_model_parameters(sess, estimator1, estimator2):
"""
Copies the model parameters of one estimator to another.
Args:
sess: Tensorflow session instance
estimator1: Estimator to copy the paramters from
estimator2: Estimator to copy the parameters to
"""
e1_params = [t for t in tf.trainable_variables() if t.name.startswith(estimator1.scope)]
e1_params = sorted(e1_params, key=lambda v: v.name)
e2_params = [t for t in tf.trainable_variables() if t.name.startswith(estimator2.scope)]
e2_params = sorted(e2_params, key=lambda v: v.name)
update_ops = []
for e1_v, e2_v in zip(e1_params, e2_params):
op = e2_v.assign(e1_v)
update_ops.append(op)
sess.run(update_ops)
def make_epsilon_greedy_policy(estimator, nA):
"""
Creates an epsilon-greedy policy based on a given Q-function approximator and epsilon.
Args:
estimator: An estimator that returns q values for a given state
nA: Number of actions in the environment.
Returns:
A function that takes the (sess, observation, epsilon) as an argument and returns
the probabilities for each action in the form of a numpy array of length nA.
"""
def policy_fn(sess, observation, epsilon):
A = np.ones(nA, dtype=float) * epsilon / nA
q_values = estimator.predict(sess, np.expand_dims(observation, 0))[0]
best_action = np.argmax(q_values)
A[best_action] += (1.0 - epsilon)
return A
return policy_fn
def deep_q_learning(sess,
env,
q_estimator,
target_estimator,
state_processor,
num_episodes,
experiment_dir,
replay_memory_size=500000,
replay_memory_init_size=50000,
update_target_estimator_every=10000,
discount_factor=0.99,
epsilon_start=1.0,
epsilon_end=0.1,
epsilon_decay_steps=500000,
batch_size=32,
record_video_every=50):
"""
Q-Learning algorithm for off-policy TD control using Function Approximation.
Finds the optimal greedy policy while following an epsilon-greedy policy.
Args:
sess: Tensorflow Session object
env: OpenAI environment
q_estimator: Estimator object used for the q values
target_estimator: Estimator object used for the targets
state_processor: A StateProcessor object
num_episodes: Number of episodes to run for
experiment_dir: Directory to save Tensorflow summaries in
replay_memory_size: Size of the replay memory
replay_memory_init_size: Number of random experiences to sampel when initializing
the reply memory.
update_target_estimator_every: Copy parameters from the Q estimator to the
target estimator every N steps
discount_factor: Gamma discount factor
epsilon_start: Chance to sample a random action when taking an action.
Epsilon is decayed over time and this is the start value
epsilon_end: The final minimum value of epsilon after decaying is done
epsilon_decay_steps: Number of steps to decay epsilon over
batch_size: Size of batches to sample from the replay memory
record_video_every: Record a video every N episodes
Returns:
An EpisodeStats object with two numpy arrays for episode_lengths and episode_rewards.
"""
Transition = namedtuple("Transition", ["state", "action", "reward", "next_state", "done"])
# The replay memory
replay_memory = []
# Keeps track of useful statistics
stats = plotting.EpisodeStats(
episode_lengths=np.zeros(num_episodes),
episode_rewards=np.zeros(num_episodes))
# Create directories for checkpoints and summaries
checkpoint_dir = os.path.join(experiment_dir, "checkpoints")
checkpoint_path = os.path.join(checkpoint_dir, "model")
monitor_path = os.path.join(experiment_dir, "monitor")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
if not os.path.exists(monitor_path):
os.makedirs(monitor_path)
saver = tf.train.Saver()
# Load a previous checkpoint if we find one
latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)
if latest_checkpoint:
print("Loading model checkpoint {}...\n".format(latest_checkpoint))
saver.restore(sess, latest_checkpoint)
total_t = sess.run(tf.contrib.framework.get_global_step())
# The epsilon decay schedule
epsilons = np.linspace(epsilon_start, epsilon_end, epsilon_decay_steps)
# The policy we're following
policy = make_epsilon_greedy_policy(
q_estimator,
len(VALID_ACTIONS))
# Populate the replay memory with initial experience
print("Populating replay memory...")
state = env.reset()
state = state_processor.process(sess, state)
state = np.stack([state] * 4, axis=2)
for i in range(replay_memory_init_size):
action_probs = policy(sess, state, epsilons[min(total_t, epsilon_decay_steps-1)])
action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
next_state, reward, done, _ = env.step(VALID_ACTIONS[action])
next_state = state_processor.process(sess, next_state)
next_state = np.append(state[:,:,1:], np.expand_dims(next_state, 2), axis=2)
replay_memory.append(Transition(state, action, reward, next_state, done))
if done:
state = env.reset()
state = state_processor.process(sess, state)
state = np.stack([state] * 4, axis=2)
else:
state = next_state
# Record videos
# Use the gym env Monitor wrapper
env = Monitor(env,
directory=monitor_path,
resume=True,
video_callable=lambda count: count % record_video_every ==0)
for i_episode in range(num_episodes):
# Save the current checkpoint
saver.save(tf.get_default_session(), checkpoint_path)
# Reset the environment
state = env.reset()
state = state_processor.process(sess, state)
state = np.stack([state] * 4, axis=2)
loss = None
# One step in the environment
for t in itertools.count():
# Epsilon for this time step
epsilon = epsilons[min(total_t, epsilon_decay_steps-1)]
# Add epsilon to Tensorboard
episode_summary = tf.Summary()
episode_summary.value.add(simple_value=epsilon, tag="epsilon")
q_estimator.summary_writer.add_summary(episode_summary, total_t)
# Maybe update the target estimator
if total_t % update_target_estimator_every == 0:
copy_model_parameters(sess, q_estimator, target_estimator)
print("\nCopied model parameters to target network.")
# Print out which step we're on, useful for debugging.
print("\rStep {} ({}) @ Episode {}/{}, loss: {}".format(
t, total_t, i_episode + 1, num_episodes, loss), end="")
sys.stdout.flush()
# Take a step
action_probs = policy(sess, state, epsilon)
action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
next_state, reward, done, _ = env.step(VALID_ACTIONS[action])
next_state = state_processor.process(sess, next_state)
next_state = np.append(state[:,:,1:], np.expand_dims(next_state, 2), axis=2)
# If our replay memory is full, pop the first element
if len(replay_memory) == replay_memory_size:
replay_memory.pop(0)
# Save transition to replay memory
replay_memory.append(Transition(state, action, reward, next_state, done))
# Update statistics
stats.episode_rewards[i_episode] += reward
stats.episode_lengths[i_episode] = t
# Sample a minibatch from the replay memory
samples = random.sample(replay_memory, batch_size)
states_batch, action_batch, reward_batch, next_states_batch, done_batch = map(np.array, zip(*samples))
# Calculate q values and targets (Double DQN)
q_values_next = q_estimator.predict(sess, next_states_batch)
best_actions = np.argmax(q_values_next, axis=1)
q_values_next_target = target_estimator.predict(sess, next_states_batch)
targets_batch = reward_batch + np.invert(done_batch).astype(np.float32) * \
discount_factor * q_values_next_target[np.arange(batch_size), best_actions]
# Perform gradient descent update
states_batch = np.array(states_batch)
loss = q_estimator.update(sess, states_batch, action_batch, targets_batch)
if done:
break
state = next_state
total_t += 1
# Add summaries to tensorboard
episode_summary = tf.Summary()
episode_summary.value.add(simple_value=stats.episode_rewards[i_episode], node_name="episode_reward", tag="episode_reward")
episode_summary.value.add(simple_value=stats.episode_lengths[i_episode], node_name="episode_length", tag="episode_length")
q_estimator.summary_writer.add_summary(episode_summary, total_t)
q_estimator.summary_writer.flush()
yield total_t, plotting.EpisodeStats(
episode_lengths=stats.episode_lengths[:i_episode+1],
episode_rewards=stats.episode_rewards[:i_episode+1])
env.monitor.close()
return stats
tf.reset_default_graph()
# Where we save our checkpoints and graphs
experiment_dir = os.path.abspath("./experiments/{}".format(env.spec.id))
# Create a glboal step variable
global_step = tf.Variable(0, name='global_step', trainable=False)
# Create estimators
q_estimator = Estimator(scope="q", summaries_dir=experiment_dir)
target_estimator = Estimator(scope="target_q")
# State processor
state_processor
gitextract_nce2oqyi/
├── .gitignore
├── DP/
│ ├── Gamblers Problem Solution.ipynb
│ ├── Gamblers Problem.ipynb
│ ├── Policy Evaluation Solution.ipynb
│ ├── Policy Evaluation.ipynb
│ ├── Policy Iteration Solution.ipynb
│ ├── Policy Iteration.ipynb
│ ├── README.md
│ ├── Value Iteration Solution.ipynb
│ └── Value Iteration.ipynb
├── DQN/
│ ├── .gitignore
│ ├── Breakout Playground.ipynb
│ ├── Deep Q Learning Solution.ipynb
│ ├── Deep Q Learning.ipynb
│ ├── Double DQN Solution.ipynb
│ ├── README.md
│ └── dqn.py
├── FA/
│ ├── MountainCar Playground.ipynb
│ ├── Q-Learning with Value Function Approximation Solution.ipynb
│ ├── Q-Learning with Value Function Approximation.ipynb
│ └── README.md
├── Introduction/
│ └── README.md
├── LICENSE
├── MC/
│ ├── Blackjack Playground.ipynb
│ ├── MC Control with Epsilon-Greedy Policies Solution.ipynb
│ ├── MC Control with Epsilon-Greedy Policies.ipynb
│ ├── MC Prediction Solution.ipynb
│ ├── MC Prediction.ipynb
│ ├── Off-Policy MC Control with Weighted Importance Sampling Solution.ipynb
│ ├── Off-Policy MC Control with Weighted Importance Sampling.ipynb
│ └── README.md
├── MDP/
│ └── README.md
├── PolicyGradient/
│ ├── CliffWalk Actor Critic Solution.ipynb
│ ├── CliffWalk REINFORCE with Baseline Solution.ipynb
│ ├── Continuous MountainCar Actor Critic Solution.ipynb
│ ├── README.md
│ └── a3c/
│ ├── README.md
│ ├── estimator_test.py
│ ├── estimators.py
│ ├── policy_monitor.py
│ ├── policy_monitor_test.py
│ ├── train.py
│ ├── worker.py
│ └── worker_test.py
├── README.md
├── TD/
│ ├── Cliff Environment Playground.ipynb
│ ├── Q-Learning Solution.ipynb
│ ├── Q-Learning.ipynb
│ ├── README.md
│ ├── SARSA Solution.ipynb
│ ├── SARSA.ipynb
│ └── Windy Gridworld Playground.ipynb
├── __init__.py
└── lib/
├── __init__.py
├── atari/
│ ├── __init__.py
│ ├── helpers.py
│ └── state_processor.py
├── envs/
│ ├── __init__.py
│ ├── blackjack.py
│ ├── cliff_walking.py
│ ├── discrete.py
│ ├── gridworld.py
│ └── windy_gridworld.py
└── plotting.py
SYMBOL INDEX (96 symbols across 16 files)
FILE: DQN/dqn.py
class StateProcessor (line 21) | class StateProcessor():
method __init__ (line 25) | def __init__(self):
method process (line 35) | def process(self, sess, state):
class Estimator (line 46) | class Estimator():
method __init__ (line 52) | def __init__(self, scope="estimator", summaries_dir=None):
method _build_model (line 65) | def _build_model(self):
method predict (line 115) | def predict(self, sess, s):
method update (line 129) | def update(self, sess, s, a, y):
function copy_model_parameters (line 150) | def copy_model_parameters(sess, estimator1, estimator2):
function make_epsilon_greedy_policy (line 172) | def make_epsilon_greedy_policy(estimator, nA):
function deep_q_learning (line 194) | def deep_q_learning(sess,
FILE: PolicyGradient/a3c/estimator_test.py
function make_env (line 21) | def make_env():
class PolicyEstimatorTest (line 26) | class PolicyEstimatorTest(tf.test.TestCase):
method testPredict (line 27) | def testPredict(self):
method testGradient (line 54) | def testGradient(self):
class ValueEstimatorTest (line 81) | class ValueEstimatorTest(tf.test.TestCase):
method testPredict (line 82) | def testPredict(self):
method testGradient (line 107) | def testGradient(self):
FILE: PolicyGradient/a3c/estimators.py
function build_shared_network (line 4) | def build_shared_network(X, add_summaries=False):
class PolicyEstimator (line 36) | class PolicyEstimator():
method __init__ (line 49) | def __init__(self, num_outputs, reuse=False, trainable=True):
class ValueEstimator (line 109) | class ValueEstimator():
method __init__ (line 120) | def __init__(self, reuse=False, trainable=True):
FILE: PolicyGradient/a3c/policy_monitor.py
class PolicyMonitor (line 25) | class PolicyMonitor(object):
method __init__ (line 35) | def __init__(self, env, policy_net, summary_writer, saver=None):
method _policy_net_predict (line 62) | def _policy_net_predict(self, state, sess):
method eval_once (line 67) | def eval_once(self, sess):
method continuous_eval (line 100) | def continuous_eval(self, eval_every, sess, coord):
FILE: PolicyGradient/a3c/policy_monitor_test.py
function make_env (line 24) | def make_env():
class PolicyMonitorTest (line 29) | class PolicyMonitorTest(tf.test.TestCase):
method setUp (line 30) | def setUp(self):
method testEvalOnce (line 41) | def testEvalOnce(self):
FILE: PolicyGradient/a3c/train.py
function make_env (line 37) | def make_env(wrap=True):
FILE: PolicyGradient/a3c/worker.py
function make_copy_params_op (line 24) | def make_copy_params_op(v1_list, v2_list):
function make_train_op (line 39) | def make_train_op(local_estimator, global_estimator):
class Worker (line 53) | class Worker(object):
method __init__ (line 67) | def __init__(self, name, env, policy_net, value_net, global_counter, d...
method run (line 95) | def run(self, sess, coord, t_max):
method _policy_net_predict (line 118) | def _policy_net_predict(self, state, sess):
method _value_net_predict (line 123) | def _value_net_predict(self, state, sess):
method run_n_steps (line 128) | def run_n_steps(self, n, sess):
method update (line 155) | def update(self, transitions, sess):
FILE: PolicyGradient/a3c/worker_test.py
function make_env (line 23) | def make_env():
class WorkerTest (line 28) | class WorkerTest(tf.test.TestCase):
method setUp (line 29) | def setUp(self):
method testPolicyNetPredict (line 42) | def testPolicyNetPredict(self):
method testValueNetPredict (line 59) | def testValueNetPredict(self):
method testRunNStepsAndUpdate (line 75) | def testRunNStepsAndUpdate(self):
FILE: lib/atari/helpers.py
class AtariEnvWrapper (line 3) | class AtariEnvWrapper(object):
method __init__ (line 7) | def __init__(self, env):
method __getattr__ (line 10) | def __getattr__(self, name):
method step (line 13) | def step(self, *args, **kwargs):
function atari_make_initial_state (line 27) | def atari_make_initial_state(state):
function atari_make_next_state (line 30) | def atari_make_next_state(state, next_state):
FILE: lib/atari/state_processor.py
class StateProcessor (line 4) | class StateProcessor():
method __init__ (line 8) | def __init__(self):
method process (line 18) | def process(self, state, sess=None):
FILE: lib/envs/blackjack.py
function cmp (line 5) | def cmp(a, b):
function draw_card (line 12) | def draw_card(np_random):
function draw_hand (line 16) | def draw_hand(np_random):
function usable_ace (line 20) | def usable_ace(hand): # Does this hand have a usable ace?
function sum_hand (line 24) | def sum_hand(hand): # Return current hand total
function is_bust (line 30) | def is_bust(hand): # Is this hand a bust?
function score (line 34) | def score(hand): # What is the score of this hand (0 if bust)
function is_natural (line 38) | def is_natural(hand): # Is this hand a natural blackjack?
class BlackjackEnv (line 42) | class BlackjackEnv(gym.Env):
method __init__ (line 67) | def __init__(self, natural=False):
method reset (line 82) | def reset(self):
method step (line 85) | def step(self, action):
method _seed (line 88) | def _seed(self, seed=None):
method _step (line 92) | def _step(self, action):
method _get_obs (line 111) | def _get_obs(self):
method _reset (line 114) | def _reset(self):
FILE: lib/envs/cliff_walking.py
class CliffWalkingEnv (line 12) | class CliffWalkingEnv(discrete.DiscreteEnv):
method _limit_coordinates (line 16) | def _limit_coordinates(self, coord):
method _calculate_transition_prob (line 23) | def _calculate_transition_prob(self, current, delta):
method __init__ (line 31) | def __init__(self):
method render (line 57) | def render(self, mode='human', close=False):
method _render (line 60) | def _render(self, mode='human', close=False):
FILE: lib/envs/discrete.py
class DiscreteEnv (line 7) | class DiscreteEnv(Env):
method __init__ (line 23) | def __init__(self, nS, nA, P, isd):
method seed (line 36) | def seed(self, seed=None):
method reset (line 40) | def reset(self):
method step (line 45) | def step(self, a):
FILE: lib/envs/gridworld.py
class GridworldEnv (line 12) | class GridworldEnv(discrete.DiscreteEnv):
method __init__ (line 34) | def __init__(self, shape=[4,4]):
method _render (line 88) | def _render(self, mode='human', close=False):
FILE: lib/envs/windy_gridworld.py
class WindyGridworldEnv (line 13) | class WindyGridworldEnv(discrete.DiscreteEnv):
method _limit_coordinates (line 17) | def _limit_coordinates(self, coord):
method _calculate_transition_prob (line 24) | def _calculate_transition_prob(self, current, delta, winds):
method __init__ (line 31) | def __init__(self):
method render (line 58) | def render(self, mode='human', close=False):
method _render (line 61) | def _render(self, mode='human', close=False):
FILE: lib/plotting.py
function plot_cost_to_go_mountain_car (line 10) | def plot_cost_to_go_mountain_car(env, estimator, num_tiles=20):
function plot_value_function (line 28) | def plot_value_function(V, title="Value Function"):
function plot_episode_stats (line 63) | def plot_episode_stats(stats, smoothing_window=10, noshow=False):
Condensed preview — 64 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,372K chars).
[
{
"path": ".gitignore",
"chars": 1130,
"preview": "### Python ###\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distrib"
},
{
"path": "DP/Gamblers Problem Solution.ipynb",
"chars": 37086,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {\n \"collapsed\": true\n },\n \"source\": [\n \"### This "
},
{
"path": "DP/Gamblers Problem.ipynb",
"chars": 4388,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {\n \"collapsed\": true\n },\n \"source\": [\n \"### This "
},
{
"path": "DP/Policy Evaluation Solution.ipynb",
"chars": 4830,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": [\n "
},
{
"path": "DP/Policy Evaluation.ipynb",
"chars": 7611,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 23,\n \"metadata\": {\n \"collapsed\": true\n },\n \"out"
},
{
"path": "DP/Policy Iteration Solution.ipynb",
"chars": 8217,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": [\n "
},
{
"path": "DP/Policy Iteration.ipynb",
"chars": 11307,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 5,\n \"metadata\": {\n \"collapsed\": true\n },\n \"outp"
},
{
"path": "DP/README.md",
"chars": 2715,
"preview": "## Model-Based RL: Policy and Value Iteration using Dynamic Programming\n\n### Learning Goals\n\n- Understand the difference"
},
{
"path": "DP/Value Iteration Solution.ipynb",
"chars": 6006,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": [\n "
},
{
"path": "DP/Value Iteration.ipynb",
"chars": 8725,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 3,\n \"metadata\": {\n \"collapsed\": true\n },\n \"outp"
},
{
"path": "DQN/.gitignore",
"chars": 12,
"preview": "experiments/"
},
{
"path": "DQN/Breakout Playground.ipynb",
"chars": 20779,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 33,\n \"metadata\": {\n \"collapsed\": true\n },\n \"out"
},
{
"path": "DQN/Deep Q Learning Solution.ipynb",
"chars": 23897,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"metadata\": {\n \"collapsed\": true\n },\n \"o"
},
{
"path": "DQN/Deep Q Learning.ipynb",
"chars": 20968,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"metadata\": {\n \"collapsed\": true\n },\n \"o"
},
{
"path": "DQN/Double DQN Solution.ipynb",
"chars": 21609,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"metadata\": {\n \"collapsed\": true\n },\n \"o"
},
{
"path": "DQN/README.md",
"chars": 2904,
"preview": "## Deep Q-Learning\n\n### Learning Goals\n\n- Understand the Deep Q-Learning (DQN) algorithm\n- Understand why Experience Rep"
},
{
"path": "DQN/dqn.py",
"chars": 16648,
"preview": "import gym\nfrom gym.wrappers import Monitor\nimport itertools\nimport numpy as np\nimport os\nimport random\nimport sys\nimpor"
},
{
"path": "FA/MountainCar Playground.ipynb",
"chars": 30004,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 7,\n \"metadata\": {\n \"collapsed\": true\n },\n \"outp"
},
{
"path": "FA/Q-Learning with Value Function Approximation Solution.ipynb",
"chars": 191929,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": [\n "
},
{
"path": "FA/Q-Learning with Value Function Approximation.ipynb",
"chars": 131946,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {\n \"collapsed\": true\n },\n \"outp"
},
{
"path": "FA/README.md",
"chars": 2687,
"preview": "## Function Approximation\n\n### Learning Goals\n\n- Understand the motivation for Function Approximation over Table Lookup\n"
},
{
"path": "Introduction/README.md",
"chars": 1205,
"preview": "## Introduction\n\n### Learning Goals\n\n- Understand the Reinforcement Learning problem and how it differs from Supervised "
},
{
"path": "LICENSE",
"chars": 1067,
"preview": "MIT License\n\nCopyright (c) 2016 Denny Britz\n\nPermission is hereby granted, free of charge, to any person obtaining a cop"
},
{
"path": "MC/Blackjack Playground.ipynb",
"chars": 7679,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": [\n "
},
{
"path": "MC/MC Control with Epsilon-Greedy Policies Solution.ipynb",
"chars": 258184,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {\n \"collapsed\": true\n },\n \"outp"
},
{
"path": "MC/MC Control with Epsilon-Greedy Policies.ipynb",
"chars": 4727,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"metadata\": {\n \"collapsed\": true\n },\n \"o"
},
{
"path": "MC/MC Prediction Solution.ipynb",
"chars": 520699,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {\n \"collapsed\": true\n },\n \"outp"
},
{
"path": "MC/MC Prediction.ipynb",
"chars": 3485,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"metadata\": {\n \"collapsed\": true\n },\n \"o"
},
{
"path": "MC/Off-Policy MC Control with Weighted Importance Sampling Solution.ipynb",
"chars": 290864,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 2,\n \"metadata\": {\n \"collapsed\": true\n },\n \"outp"
},
{
"path": "MC/Off-Policy MC Control with Weighted Importance Sampling.ipynb",
"chars": 4989,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"metadata\": {\n \"collapsed\": true\n },\n \"o"
},
{
"path": "MC/README.md",
"chars": 3472,
"preview": "## Model-Free Prediction & Control with Monte Carlo (MC)\n\n\n### Learning Goals\n\n- Understand the difference between Predi"
},
{
"path": "MDP/README.md",
"chars": 3005,
"preview": "## MDPs and Bellman Equations\n\n### Learning Goals\n\n- Understand the Agent-Environment interface\n- Understand what MDPs ("
},
{
"path": "PolicyGradient/CliffWalk Actor Critic Solution.ipynb",
"chars": 100689,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 16,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": [\n"
},
{
"path": "PolicyGradient/CliffWalk REINFORCE with Baseline Solution.ipynb",
"chars": 106599,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 7,\n \"metadata\": {\n \"collapsed\": false\n },\n \"out"
},
{
"path": "PolicyGradient/Continuous MountainCar Actor Critic Solution.ipynb",
"chars": 14097,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 2,\n \"metadata\": {\n \"collapsed\": true\n },\n \"outp"
},
{
"path": "PolicyGradient/README.md",
"chars": 4676,
"preview": "## Policy Gradient Methods\n\n\n### Learning Goals\n\n- Understand the difference between value-based and policy-based Reinfo"
},
{
"path": "PolicyGradient/a3c/README.md",
"chars": 741,
"preview": "## Implementation of A3C (Asynchronous Advantage Actor-Critic)\n\n#### Running\n\n```\n./train.py --model_dir /tmp/a3c --env "
},
{
"path": "PolicyGradient/a3c/estimator_test.py",
"chars": 3873,
"preview": "import unittest\nimport gym\nimport sys\nimport os\nimport numpy as np\nimport tensorflow as tf\n\nfrom inspect import getsourc"
},
{
"path": "PolicyGradient/a3c/estimators.py",
"chars": 6930,
"preview": "import numpy as np\nimport tensorflow as tf\n\ndef build_shared_network(X, add_summaries=False):\n \"\"\"\n Builds a 3-layer n"
},
{
"path": "PolicyGradient/a3c/policy_monitor.py",
"chars": 3906,
"preview": "import sys\nimport os\nimport itertools\nimport collections\nimport numpy as np\nimport tensorflow as tf\nimport time\n\nfrom in"
},
{
"path": "PolicyGradient/a3c/policy_monitor_test.py",
"chars": 1520,
"preview": "import gym\nimport sys\nimport os\nimport itertools\nimport collections\nimport unittest\nimport numpy as np\nimport tensorflow"
},
{
"path": "PolicyGradient/a3c/train.py",
"chars": 4436,
"preview": "#! /usr/bin/env python\n\nimport unittest\nimport gym\nimport sys\nimport os\nimport numpy as np\nimport tensorflow as tf\nimpor"
},
{
"path": "PolicyGradient/a3c/worker.py",
"chars": 7546,
"preview": "import gym\nimport sys\nimport os\nimport itertools\nimport collections\nimport numpy as np\nimport tensorflow as tf\n\nfrom ins"
},
{
"path": "PolicyGradient/a3c/worker_test.py",
"chars": 3214,
"preview": "import gym\nimport sys\nimport os\nimport itertools\nimport collections\nimport unittest\nimport numpy as np\nimport tensorflow"
},
{
"path": "README.md",
"chars": 5963,
"preview": "### Overview\n\nThis repository provides code, exercises and solutions for popular Reinforcement Learning algorithms. Thes"
},
{
"path": "TD/Cliff Environment Playground.ipynb",
"chars": 2595,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": [\n "
},
{
"path": "TD/Q-Learning Solution.ipynb",
"chars": 137512,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": "
},
{
"path": "TD/Q-Learning.ipynb",
"chars": 67444,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 3,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": [\n "
},
{
"path": "TD/README.md",
"chars": 2770,
"preview": "## Model-Free Prediction & Control with Temporal Difference (TD) and Q-Learning\n\n\n### Learning Goals\n\n- Understand TD(0)"
},
{
"path": "TD/SARSA Solution.ipynb",
"chars": 92754,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 19,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": [\n"
},
{
"path": "TD/SARSA.ipynb",
"chars": 65342,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 11,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": [\n"
},
{
"path": "TD/Windy Gridworld Playground.ipynb",
"chars": 3872,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": [\n "
},
{
"path": "__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "lib/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "lib/atari/__init__.py",
"chars": 1,
"preview": "\n"
},
{
"path": "lib/atari/helpers.py",
"chars": 829,
"preview": "import numpy as np\n\nclass AtariEnvWrapper(object):\n \"\"\"\n Wraps an Atari environment to end an episode when a life is l"
},
{
"path": "lib/atari/state_processor.py",
"chars": 1077,
"preview": "import numpy as np\nimport tensorflow as tf\n\nclass StateProcessor():\n \"\"\"\n Processes a raw Atari iamges. Resizes it"
},
{
"path": "lib/envs/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "lib/envs/blackjack.py",
"chars": 4251,
"preview": "import gym\nfrom gym import spaces\nfrom gym.utils import seeding\n\ndef cmp(a, b):\n return int((a > b)) - int((a < b))\n\n"
},
{
"path": "lib/envs/cliff_walking.py",
"chars": 2682,
"preview": "import io\nimport numpy as np\nimport sys\n\nfrom . import discrete\n\nUP = 0\nRIGHT = 1\nDOWN = 2\nLEFT = 3\n\nclass CliffWalkingE"
},
{
"path": "lib/envs/discrete.py",
"chars": 1350,
"preview": "import numpy as np\n\nfrom gym import Env, spaces\nfrom gym.utils import seeding\nfrom gym.envs.toy_text.utils import catego"
},
{
"path": "lib/envs/gridworld.py",
"chars": 3835,
"preview": "import io\nimport numpy as np\nimport sys\n\nfrom . import discrete\n\nUP = 0\nRIGHT = 1\nDOWN = 2\nLEFT = 3\n\nclass GridworldEnv("
},
{
"path": "lib/envs/windy_gridworld.py",
"chars": 2592,
"preview": "import io\nimport gym\nimport numpy as np\nimport sys\n\nfrom . import discrete\n\nUP = 0\nRIGHT = 1\nDOWN = 2\nLEFT = 3\n\nclass Wi"
},
{
"path": "lib/plotting.py",
"chars": 3457,
"preview": "import matplotlib\nimport numpy as np\nimport pandas as pd\nfrom collections import namedtuple\nfrom matplotlib import pyplo"
}
]
About this extraction
This page contains the full source code of the dennybritz/reinforcement-learning GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 64 files (2.2 MB), approximately 580.8k tokens, and a symbol index with 96 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.