master 9bfa97e1b1f7 cached
129 files
1.1 MB
342.9k tokens
1 requests
Download .txt
Showing preview only (1,141K chars total). Download the full file or copy to clipboard to get everything.
Repository: PacktPublishing/Deep-Reinforcement-Learning-with-Python
Branch: master
Commit: 9bfa97e1b1f7
Files: 129
Total size: 1.1 MB

Directory structure:
gitextract_lj237dqn/

├── 01. Fundamentals of Reinforcement Learning/
│   ├── .ipynb_checkpoints/
│   │   ├── 1.01. Basic Idea of Reinforcement Learning -checkpoint.ipynb
│   │   ├── 1.02. Key Elements of Reinforcement Learning -checkpoint.ipynb
│   │   ├── 1.03. Reinforcement Learning Algorithm-checkpoint.ipynb
│   │   ├── 1.04. RL agent in the Grid World -checkpoint.ipynb
│   │   ├── 1.05. How RL differs from other ML paradigms?-checkpoint.ipynb
│   │   ├── 1.06. Markov Decision Processes-checkpoint.ipynb
│   │   └── 1.07. Action space, Policy, Episode and Horizon-checkpoint.ipynb
│   └── 1.01. Key Elements of Reinforcement Learning .ipynb
├── 02. A Guide to the Gym Toolkit/
│   ├── 2.02.  Creating our First Gym Environment.ipynb
│   ├── 2.04. Classic Control Environments.ipynb
│   ├── 2.05. Cart Pole Balancing with Random Policy.ipynb
│   └── README.md
├── 03. Bellman Equation and Dynamic Programming/
│   ├── .ipynb_checkpoints/
│   │   ├── 3.06. Solving the Frozen Lake Problem with Value Iteration-checkpoint.ipynb
│   │   └── 3.08. Solving the Frozen Lake Problem with Policy Iteration-checkpoint.ipynb
│   ├── 3.06. Solving the Frozen Lake Problem with Value Iteration.ipynb
│   ├── 3.08. Solving the Frozen Lake Problem with Policy Iteration.ipynb
│   └── README.md
├── 04. Monte Carlo Methods/
│   ├── .ipynb_checkpoints/
│   │   ├── 4.01. Understanding the Monte Carlo Method-checkpoint.ipynb
│   │   ├── 4.02.  Prediction and control tasks-checkpoint.ipynb
│   │   ├── 4.05. Every-visit MC Prediction with Blackjack Game-checkpoint.ipynb
│   │   ├── 4.06. First-visit MC Prediction with Blackjack Game-checkpoint.ipynb
│   │   └── 4.13. Implementing On-Policy MC Control-checkpoint.ipynb
│   ├── 4.05. Every-visit MC Prediction with Blackjack Game.ipynb
│   ├── 4.06. First-visit MC Prediction with Blackjack Game.ipynb
│   ├── 4.13. Implementing On-Policy MC Control.ipynb
│   └── README.md
├── 05. Understanding Temporal Difference Learning/
│   ├── .ipynb_checkpoints/
│   │   ├── 5.03. Predicting the Value of States in a Frozen Lake Environment-checkpoint.ipynb
│   │   ├── 5.06. Computing Optimal Policy using SARSA-checkpoint.ipynb
│   │   └── 5.08. Computing the Optimal Policy using Q Learning-checkpoint.ipynb
│   ├── 5.06. Computing Optimal Policy using SARSA.ipynb
│   ├── 5.08. Computing the Optimal Policy using Q Learning.ipynb
│   └── README.md
├── 06. Case Study: The MAB Problem/
│   ├── .ipynb_checkpoints/
│   │   ├── 6.04. Implementing epsilon-greedy -checkpoint.ipynb
│   │   ├── 6.06. Implementing Softmax Exploration-checkpoint.ipynb
│   │   ├── 6.08. Implementing UCB-checkpoint.ipynb
│   │   ├── 6.1-checkpoint.ipynb
│   │   ├── 6.10. Implementing Thompson Sampling-checkpoint.ipynb
│   │   └── 6.12. Finding the Best Advertisement Banner using Bandits-checkpoint.ipynb
│   ├── 6.03. Epsilon-Greedy.ipynb
│   ├── 6.04. Implementing epsilon-greedy .ipynb
│   ├── 6.06. Implementing Softmax Exploration.ipynb
│   ├── 6.08. Implementing UCB.ipynb
│   ├── 6.10. Implementing Thompson Sampling.ipynb
│   ├── 6.12. Finding the Best Advertisement Banner using Bandits.ipynb
│   └── README.md
├── 07. Deep learning foundations/
│   ├── .ipynb_checkpoints/
│   │   └── 7.05 Building Neural Network from scratch-checkpoint.ipynb
│   ├── 7.05 Building Neural Network from scratch.ipynb
│   └── README.md
├── 08. A primer on TensorFlow/
│   ├── .ipynb_checkpoints/
│   │   ├── 8.05 Handwritten digits classification using TensorFlow-checkpoint.ipynb
│   │   └── 8.10 MNIST digits classification in TensorFlow 2.0-checkpoint.ipynb
│   ├── 8.05 Handwritten digits classification using TensorFlow.ipynb
│   ├── 8.08 Math operations in TensorFlow.ipynb
│   ├── 8.10 MNIST digits classification in TensorFlow 2.0.ipynb
│   ├── README.md
│   └── graphs/
│       └── events.out.tfevents.1559122983.ml-dev
├── 09.  Deep Q Network and its Variants/
│   ├── .ipynb_checkpoints/
│   │   ├── 7.03. Playing Atari Games using DQN-Copy1-checkpoint.ipynb
│   │   ├── 7.03. Playing Atari Games using DQN-checkpoint.ipynb
│   │   └── 9.03. Playing Atari Games using DQN-checkpoint.ipynb
│   ├── 9.03. Playing Atari Games using DQN.ipynb
│   └── READEME.md
├── 10. Policy Gradient Method/
│   ├── .ipynb_checkpoints/
│   │   ├── 10.01. Why Policy based Methods-checkpoint.ipynb
│   │   ├── 10.02. Policy Gradient Intuition-checkpoint.ipynb
│   │   ├── 10.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb
│   │   └── 8.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb
│   ├── 10.07. Cart Pole Balancing with Policy Gradient.ipynb
│   └── README.md
├── 11. Actor Critic Methods - A2C and A3C/
│   ├── .ipynb_checkpoints/
│   │   ├── 11.01. Overview of actor critic method-checkpoint.ipynb
│   │   ├── 11.05. Mountain Car Climbing using A3C-checkpoint.ipynb
│   │   └── 9.05. Mountain Car Climbing using A3C-checkpoint.ipynb
│   ├── 11.05. Mountain Car Climbing using A3C.ipynb
│   ├── README.md
│   └── logs/
│       └── events.out.tfevents.1596718791.Sudharsan
├── 12. Learning DDPG, TD3 and SAC/
│   ├── .ipynb_checkpoints/
│   │   ├── 10.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb
│   │   ├── 12.01. DDPG-checkpoint.ipynb
│   │   ├── 12.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb
│   │   ├── 12.03. Twin delayed DDPG-checkpoint.ipynb
│   │   └── Swinging up the pendulum using DDPG -checkpoint.ipynb
│   ├── 12.05. Swinging Up the Pendulum using DDPG .ipynb
│   └── README.md
├── 13. TRPO, PPO and ACKTR Methods/
│   ├── .ipynb_checkpoints/
│   │   ├──  Implementing PPO-clipped method-checkpoint.ipynb
│   │   ├── 11.09. Implementing PPO-Clipped Method-checkpoint.ipynb
│   │   ├── 13.01. Trust Region Policy Optimization-checkpoint.ipynb
│   │   └── 13.09. Implementing PPO-Clipped Method-checkpoint.ipynb
│   ├── 13.09. Implementing PPO-Clipped Method.ipynb
│   └── README.md
├── 14. Distributional Reinforcement Learning/
│   ├── .ipynb_checkpoints/
│   │   ├── 12.03. Playing Atari games using Categorical DQN-checkpoint.ipynb
│   │   ├── 14.03. Playing Atari games using Categorical DQN-checkpoint.ipynb
│   │   ├── Playing Atari games using Categorical DQN-checkpoint.ipynb
│   │   └── c51 done-Copy1-checkpoint.ipynb
│   ├── 14.03. Playing Atari games using Categorical DQN.ipynb
│   └── README.md
├── 15. Imitation Learning and Inverse RL/
│   ├── .ipynb_checkpoints/
│   │   ├── 13.01. Supervised Imitation Learning -checkpoint.ipynb
│   │   └── 13.02. DAgger-checkpoint.ipynb
│   └── README.md
├── 16. Deep Reinforcement Learning with Stable Baselines/
│   ├── .ipynb_checkpoints/
│   │   ├── 14.01. Creating our First Agent with Baseline-checkpoint.ipynb
│   │   ├── 14.04. Playing Atari games with DQN and its variants-checkpoint.ipynb
│   │   ├── 14.05. Implementing DQN variants-checkpoint.ipynb
│   │   ├── 14.06. Lunar Lander using A2C-checkpoint.ipynb
│   │   ├── 14.07. Creating a custom network-checkpoint.ipynb
│   │   ├── 14.08. Swinging up a pendulum using DDPG-checkpoint.ipynb
│   │   ├── 16.01. Creating our First Agent with Stable Baseline-checkpoint.ipynb
│   │   ├── 16.04. Playing Atari games with DQN and its variants-checkpoint.ipynb
│   │   ├── 16.05. Implementing DQN variants-checkpoint.ipynb
│   │   ├── 16.06. Lunar Lander using A2C-checkpoint.ipynb
│   │   ├── 16.07. Creating a custom network-checkpoint.ipynb
│   │   ├── 16.08. Swinging up a pendulum using DDPG-checkpoint.ipynb
│   │   ├── 16.09. Training an agent to walk using TRPO-checkpoint.ipynb
│   │   ├── 16.10. Training cheetah bot to run using PPO-checkpoint.ipynb
│   │   ├── Creating a custom network-checkpoint.ipynb
│   │   ├── Implementing DQN variants-checkpoint.ipynb
│   │   ├── Lunar Lander using A2C-checkpoint.ipynb
│   │   ├── Playing Atari games with DQN and its variants-checkpoint.ipynb
│   │   ├── Swinging up a pendulum using DDPG-checkpoint.ipynb
│   │   ├── Training an agent to walk using TRPO-checkpoint.ipynb
│   │   ├── Training cheetah bot to run using PPO-checkpoint.ipynb
│   │   └── Untitled-checkpoint.ipynb
│   ├── 16.01. Creating our First Agent with Stable Baseline.ipynb
│   ├── 16.04. Playing Atari games with DQN and its variants.ipynb
│   ├── 16.05. Implementing DQN variants.ipynb
│   ├── 16.06. Lunar Lander using A2C.ipynb
│   ├── 16.07. Creating a custom network.ipynb
│   ├── 16.08. Swinging up a pendulum using DDPG.ipynb
│   ├── 16.09. Training an agent to walk using TRPO.ipynb
│   ├── 16.10. Training cheetah bot to run using PPO.ipynb
│   ├── README.md
│   └── logs/
│       └── DDPG_1/
│           └── events.out.tfevents.1582974711.Sudharsan
├── 17. Reinforcement Learning Frontiers/
│   ├── .ipynb_checkpoints/
│   │   └── 15.01. Meta Reinforcement Learning-checkpoint.ipynb
│   └── README.md
└── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.01. Basic Idea of Reinforcement Learning -checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "\n",
    "Reinforcement Learning (RL) is one of the areas of Machine Learning (ML). Unlike\n",
    "other ML paradigms, such as supervised and unsupervised learning, RL works in a\n",
    "trial and error fashion by interacting with its environment.\n",
    "\n",
    "RL is one of the most active areas of research in artificial intelligence, and it is\n",
    "believed that RL will take us a step closer towards achieving artificial general\n",
    "intelligence. RL has evolved rapidly in the past few years with a wide variety of\n",
    "applications ranging from building a recommendation system to self-driving cars.\n",
    "The major reason for this evolution is the advent of deep reinforcement learning,\n",
    "which is a combination of deep learning and RL. With the emergence of new RL\n",
    "algorithms and libraries, RL is clearly one of the most promising areas of ML.\n",
    "\n",
    "In this chapter, we will build a strong foundation in RL by exploring several\n",
    "important and fundamental concepts involved in RL. In this chapter, we will learn about the following topics:\n",
    "\n",
    "* Key elements of RL\n",
    "* The basic idea of RL\n",
    "* The RL algorithm\n",
    "* How RL differs from other ML paradigms\n",
    "* The Markov Decision Processes\n",
    "* Fundamental concepts of RL\n",
    "* Applications of RL\n",
    "* RL glossary\n",
    "\n",
    "We will begin the chapter by understanding Key elements of RL. This will help us understand the\n",
    "basic idea of RL."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Key Elements of Reinforcement Learning \n",
    "\n",
    "Let's begin by understanding some key elements of RL.\n",
    "\n",
    "## Agent \n",
    "\n",
    "An agent is a software program that learns to make intelligent decisions. We can\n",
    "say that an agent is a learner in the RL setting. For instance, a chess player can be\n",
    "considered an agent since the player learns to make the best moves (decisions) to win\n",
    "the game. Similarly, Mario in a Super Mario Bros video game can be considered an\n",
    "agent since Mario explores the game and learns to make the best moves in the game.\n",
    "\n",
    "\n",
    "## Environment \n",
    "The environment is the world of the agent. The agent stays within the environment.\n",
    "For instance, coming back to our chess game, a chessboard is called the environment\n",
    "since the chess player (agent) learns to play the game of chess within the chessboard\n",
    "(environment). Similarly, in Super Mario Bros, the world of Mario is called the\n",
    "environment.\n",
    "\n",
    "## State and action\n",
    "A state is a position or a moment in the environment that the agent can be in. We\n",
    "learned that the agent stays within the environment, and there can be many positions\n",
    "in the environment that the agent can stay in, and those positions are called states.\n",
    "For instance, in our chess game example, each position on the chessboard is called\n",
    "the state. The state is usually denoted by s.\n",
    "\n",
    "The agent interacts with the environment and moves from one state to another\n",
    "by performing an action. In the chess game environment, the action is the move\n",
    "performed by the player (agent). The action is usually denoted by a.\n",
    "\n",
    "\n",
    "## Reward\n",
    "\n",
    "We learned that the agent interacts with an environment by performing an action\n",
    "and moves from one state to another. Based on the action, the agent receives a\n",
    "reward. A reward is nothing but a numerical value, say, +1 for a good action and -1\n",
    "for a bad action. How do we decide if an action is good or bad?\n",
    "In our chess game example, if the agent makes a move in which it takes one of the\n",
    "opponent's chess pieces, then it is considered a good action and the agent receives\n",
    "a positive reward. Similarly, if the agent makes a move that leads to the opponent\n",
    "taking the agent's chess piece, then it is considered a bad action and the agent\n",
    "receives a negative reward. The reward is denoted by r.\n",
    "\n",
    "\n",
    "In the next section, let us explore basic idea of reinforcement learning. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.02. Key Elements of Reinforcement Learning -checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic Idea of Reinforcement Learning \n",
    "\n",
    "\n",
    "\n",
    "Let's begin with an analogy. Let's suppose we are teaching a dog (agent) to catch a\n",
    "ball. Instead of teaching the dog explicitly to catch a ball, we just throw a ball and\n",
    "every time the dog catches the ball, we give the dog a cookie (reward). If the dog\n",
    "fails to catch the ball, then we do not give it a cookie. So, the dog will figure out\n",
    "what action caused it to receive a cookie and repeat that action. Thus, the dog will\n",
    "understand that catching the ball caused it to receive a cookie and will attempt to\n",
    "repeat catching the ball. Thus, in this way, the dog will learn to catch a ball while\n",
    "aiming to maximize the cookies it can receive.\n",
    "\n",
    "Similarly, in an RL setting, we will not teach the agent what to do or how to do it;\n",
    "instead, we will give a reward to the agent for every action it does. We will give\n",
    "a positive reward to the agent when it performs a good action and we will give a\n",
    "negative reward to the agent when it performs a bad action. The agent begins by\n",
    "performing a random action and if the action is good, we then give the agent a\n",
    "positive reward so that the agent understands it has performed a good action and it\n",
    "will repeat that action. If the action performed by the agent is bad, then we will give\n",
    "the agent a negative reward so that the agent will understand it has performed a bad\n",
    "action and it will not repeat that action.\n",
    "\n",
    "Thus, RL can be viewed as a trial and error learning process where the agent tries out\n",
    "different actions and learns the good action, which gives a positive reward.\n",
    "\n",
    "In the dog analogy, the dog represents the agent, and giving a cookie to the dog\n",
    "upon it catching the ball is a positive reward and not giving a cookie is a negative\n",
    "reward. So, the dog (agent) explores different actions, which are catching the ball\n",
    "and not catching the ball, and understands that catching the ball is a good action as it\n",
    "brings the dog a positive reward (getting a cookie).\n",
    "\n",
    "\n",
    "Let's further explore the idea of RL with one more simple example. Let's suppose we\n",
    "want to teach a robot (agent) to walk without hitting a mountain, as the following figure shows: \n",
    "\n",
    "![title](Images/1.png)\n",
    "\n",
    "We will not teach the robot explicitly to not go in the direction of the mountain.\n",
    "Instead, if the robot hits the mountain and gets stuck, we give the robot a negative\n",
    "reward, say -1. So, the robot will understand that hitting the mountain is the wrong\n",
    "action, and it will not repeat that action:\n",
    "\n",
    "\n",
    "![title](Images/2.png)\n",
    "\n",
    "Similarly, when the robot walks in the right direction without hitting the mountain,\n",
    "we give the robot a positive reward, say +1. So, the robot will understand that not\n",
    "hitting the mountain is a good action, and it will repeat that action:\n",
    "\n",
    "![title](Images/3.png)\n",
    "\n",
    "Thus, in the RL setting, the agent explores different actions and learns the best action\n",
    "based on the reward it gets.\n",
    "Now that we have a basic idea of how RL works, in the upcoming sections, we will\n",
    "go into more detail and also learn the important concepts involved in RL."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.03. Reinforcement Learning Algorithm-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reinforcement Learning algorithm\n",
    "\n",
    "The steps involved in a typical RL algorithm are as follows:\n",
    "\n",
    "1. First, the agent interacts with the environment by performing an action.\n",
    "2. By performing an action, the agent moves from one state to another.\n",
    "3. Then the agent will receive a reward based on the action it performed.\n",
    "4. Based on the reward, the agent will understand whether the action is good or bad.\n",
    "5. If the action was good, that is, if the agent received a positive reward, then the agent will prefer performing that action, else the agent will try performing other actions in search of a positive reward.\n",
    "\n",
    "RL is basically a trial and error learning process. Now, let's revisit our chess game\n",
    "example. The agent (software program) is the chess player. So, the agent interacts\n",
    "with the environment (chessboard) by performing an action (moves). If the agent\n",
    "gets a positive reward for an action, then it will prefer performing that action; else it\n",
    "will find a different action that gives a positive reward.\n",
    "\n",
    "Ultimately, the goal of the agent is to maximize the reward it gets. If the agent\n",
    "receives a good reward, then it means it has performed a good action. If the agent\n",
    "performs a good action, then it implies that it can win the game. Thus, the agent\n",
    "learns to win the game by maximizing the reward."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.04. RL agent in the Grid World -checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# RL agent in the Grid World \n",
    "\n",
    "Let's strengthen our understanding of RL by looking at another simple example.\n",
    "Consider the following grid world environment:\n",
    "\n",
    "![title](Images/4.png)\n",
    "\n",
    "The positions A to I in the environment are called the states of the environment.\n",
    "The goal of the agent is to reach state I by starting from state A without visiting\n",
    "the shaded states (B, C, G, and H). Thus, in order to achieve the goal, whenever\n",
    "our agent visits a shaded state, we will give a negative reward (say -1) and when it\n",
    "visits an unshaded state, we will give a positive reward (say +1). The actions in the\n",
    "environment are moving up, down, right and left. The agent can perform any of these\n",
    "four actions to reach state I from state A.\n",
    "\n",
    "The first time the agent interacts with the environment (the first iteration), the agent\n",
    "is unlikely to perform the correct action in each state, and thus it receives a negative\n",
    "reward. That is, in the first iteration, the agent performs a random action in each\n",
    "state, and this may lead the agent to receive a negative reward. But over a series of\n",
    "iterations, the agent learns to perform the correct action in each state through the\n",
    "reward it obtains, helping it achieve the goal. Let us explore this in detail.\n",
    "\n",
    "## Iteration 1:\n",
    "\n",
    "As we learned, in the first iteration, the agent performs a random action in each state.\n",
    "For instance, look at the following figure. In the first iteration, the agent moves right\n",
    "from state A and reaches the new state B. But since B is the shaded state, the agent\n",
    "will receive a negative reward and so the agent will understand that moving right is\n",
    "not a good action in state A. When it visits state A next time, it will try out a different\n",
    "action instead of moving right:\n",
    "\n",
    "![title](Images/5.PNG)\n",
    "\n",
    "As the avove figure shows, from state B, the agent moves down and reaches the new state\n",
    "E. Since E is an unshaded state, the agent will receive a positive reward, so the agent\n",
    "will understand that moving down from state B is a good action.\n",
    "\n",
    "From state E, the agent moves right and reaches state F. Since F is an unshaded state,\n",
    "the agent receives a positive reward, and it will understand that moving right from\n",
    "state E is a good action. From state F, the agent moves down and reaches the goal\n",
    "state I and receives a positive reward, so the agent will understand that moving\n",
    "down from state F is a good action.\n",
    "\n",
    "\n",
    "## Iteration 2:\n",
    "\n",
    "In the second iteration, from state A, instead of moving right, the agent tries out a\n",
    "different action as the agent learned in the previous iteration that moving right is not\n",
    "a good action in state A.\n",
    "\n",
    "Thus, as the following figure shows, in this iteration the agent moves down from state A and\n",
    "reaches state D. Since D is an unshaded state, the agent receives a positive reward\n",
    "and now the agent will understand that moving down is a good action in state A:\n",
    "\n",
    "\n",
    "![title](Images/6.PNG)\n",
    "\n",
    "As shown in the preceding figure, from state D, the agent moves down and reaches\n",
    "state G. But since G is a shaded state, the agent will receive a negative reward and\n",
    "so the agent will understand that moving down is not a good action in state D, and\n",
    "when it visits state D next time, it will try out a different action instead of moving\n",
    "down.\n",
    "\n",
    "From G, the agent moves right and reaches state H. Since H is a shaded state, it will\n",
    "receive a negative reward and understand that moving right is not a good action in\n",
    "state G.\n",
    "\n",
    "From H it moves right and reaches the goal state I and receives a positive reward, so\n",
    "the agent will understand that moving right from state H is a good action.\n",
    "\n",
    "\n",
    "## Iteration 3:\n",
    "\n",
    "In the third iteration, the agent moves down from state A since, in the second\n",
    "iteration, our agent learned that moving down is a good action in state A. So, the\n",
    "agent moves down from state A and reaches the next state, D, as the following figure shows:\n",
    "\n",
    "![title](Images/7.PNG)\n",
    "\n",
    "Now, from state D, the agent tries a different action instead of moving down since in\n",
    "the second iteration our agent learned that moving down is not a good action in state\n",
    "D. So, in this iteration, the agent moves right from state D and reaches state E.\n",
    "\n",
    "From state E, the agent moves right as the agent already learned in the first iteration\n",
    "that moving right from state E is a good action and reaches state F.\n",
    "\n",
    "Now, from state F, the agent moves down since the agent learned in the first iteration\n",
    "that moving down is a good action in state F, and reaches the goal state I.\n",
    "\n",
    "The following figure shows the result of the third iteration:\n",
    "![title](Images/7.PNG)\n",
    "\n",
    "As we can see, our agent has successfully learned to reach the goal state I from state\n",
    "A without visiting the shaded states based on the rewards.\n",
    "\n",
    "In this way, the agent will try out different actions in each state and understand\n",
    "whether an action is good or bad based on the reward it obtains. The goal of the\n",
    "agent is to maximize rewards. So, the agent will always try to perform good actions\n",
    "that give a positive reward, and when the agent performs good actions in each state,\n",
    "then it ultimately leads the agent to achieve the goal.\n",
    "\n",
    "Note that these iterations are called episodes in RL terminology. We will learn more\n",
    "about episodes later in the chapter."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.05. How RL differs from other ML paradigms?-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How RL differs from other ML paradigms?\n",
    "\n",
    "We can categorize ML into three types:\n",
    "* Supervised learning\n",
    "* Unsupervised learning\n",
    "* Reinforcement learning\n",
    "\n",
    "In supervised learning, the machine learns from training data. The training data\n",
    "consists of a labeled pair of inputs and outputs. So, we train the model (agent)\n",
    "using the training data in such a way that the model can generalize its learning to\n",
    "new unseen data. It is called supervised learning because the training data acts as a\n",
    "supervisor, since it has a labeled pair of inputs and outputs, and it guides the model\n",
    "in learning the given task.\n",
    "\n",
    "Now, let's understand the difference between supervised and reinforcement learning\n",
    "with an example. Consider the dog analogy we discussed earlier in the chapter. In\n",
    "supervised learning, to teach the dog to catch a ball, we will teach it explicitly by\n",
    "specifying turn left, go right, move forward seven steps, catch the ball, and so on\n",
    "in the form of training data. But in RL, we just throw a ball, and every time the dog\n",
    "catches the ball, we give it a cookie (reward). So, the dog will learn to catch the ball\n",
    "while trying to maximize the cookies (reward) it can get.\n",
    "\n",
    "Let's consider one more example. Say we want to train the model to play chess using\n",
    "supervised learning. In this case, we will have training data that includes all the\n",
    "moves a player can make in each state, along with labels indicating whether it is a\n",
    "good move or not. Then, we train the model to learn from this training data, whereas\n",
    "in the case of RL, our agent will not be given any sort of training data; instead, we\n",
    "just give a reward to the agent for each action it performs. Then, the agent will learn\n",
    "by interacting with the environment and, based on the reward it gets, it will choose\n",
    "its actions.\n",
    "\n",
    "Similar to supervised learning, in unsupervised learning, we train the model (agent)\n",
    "based on the training data. But in the case of unsupervised learning, the training data\n",
    "does not contain any labels; that is, it consists of only inputs and not outputs. The\n",
    "goal of unsupervised learning is to determine hidden patterns in the input. There is\n",
    "a common misconception that RL is a kind of unsupervised learning, but it is not. In\n",
    "unsupervised learning, the model learns the hidden structure, whereas, in RL, the\n",
    "model learns by maximizing the reward.\n",
    "\n",
    "For instance, consider a movie recommendation system. Say we want to recommend\n",
    "a new movie to the user. With unsupervised learning, the model (agent) will find\n",
    "movies similar to the movies the user (or users with a profile similar to the user) has\n",
    "viewed before and recommend new movies to the user.\n",
    "\n",
    "With RL, the agent constantly receives feedback from the user. This feedback\n",
    "represents rewards (a reward could be ratings the user has given for a movie they\n",
    "have watched, time spent watching a movie, time spent watching trailers, and so on).\n",
    "Based on the rewards, an RL agent will understand the movie preference of the user\n",
    "and then suggest new movies accordingly.\n",
    "\n",
    "Since the RL agent is learning with the aid of rewards, it can understand if the user's\n",
    "movie preference changes and suggest new movies according to the user's changed\n",
    "movie preference dynamically.\n",
    "\n",
    "Thus, we can say that in both supervised and unsupervised learning the model\n",
    "(agent) learns based on the given training dataset, whereas in RL the agent learns\n",
    "by directly interacting with the environment. Thus, RL is essentially an interaction\n",
    "between the agent and its environment.\n",
    "\n",
    "Before moving on to the fundamental concepts of RL, we will introduce a popular\n",
    "process to aid decision-making in an RL environment.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.06. Markov Decision Processes-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Markov Decision Processes \n",
    "\n",
    "The Markov Decision Process (MDP) provides a mathematical framework for\n",
    "solving the RL problem. Almost all RL problems can be modeled as an MDP. MDPs\n",
    "are widely used for solving various optimization problems. In this section, we will\n",
    "understand what an MDP is and how it is used in RL.\n",
    "\n",
    "To understand an MDP, first, we need to learn about the Markov property and\n",
    "Markov chain.\n",
    "\n",
    "\n",
    "## Markov Property and Markov Chain \n",
    "\n",
    "The Markov property states that the future depends only on the present and not\n",
    "on the past. The Markov chain, also known as the Markov process, consists of a\n",
    "sequence of states that strictly obey the Markov property; that is, the Markov chain\n",
    "is the probabilistic model that solely depends on the current state to predict the next\n",
    "state and not the previous states, that is, the future is conditionally independent of\n",
    "the past.\n",
    "\n",
    "For example, if we want to predict the weather and we know that the current state is\n",
    "cloudy, we can predict that the next state could be rainy. We concluded that the next\n",
    "state is likely to be rainy only by considering the current state (cloudy) and not the\n",
    "previous states, which might have been sunny, windy, and so on.\n",
    "However, the Markov property does not hold for all processes. For instance,\n",
    "throwing a dice (the next state) has no dependency on the previous number that\n",
    "showed up on the dice (the current state).\n",
    "\n",
    "Moving from one state to another is called a transition, and its probability is called\n",
    "a transition probability. We denote the transition probability by $P(s'|s) $. It indicates\n",
    "the probability of moving from the state $s$ to the next state $s'$.\n",
    "\n",
    "Say we have three states (cloudy, rainy, and windy) in our Markov chain. Then we can represent the\n",
    "probability of transitioning from one state to another using a table called a Markov\n",
    "table, as shown in the following table:\n",
    "\n",
    "![title](Images/8.PNG)\n",
    "\n",
    "From the above table, we can observe that:\n",
    "\n",
    "* From the state cloudy, we transition to the state rainy with 70% probability and to the state windy with 30% probability.\n",
    "\n",
    "* From the state rainy, we transition to the same state rainy with 80% probability and to the state cloudy with 20% probability.\n",
    "\n",
    "* From the state windy, we transition to the state rainy with 100% probability.\n",
    "\n",
    "We can also represent this transition information of the Markov chain in the form of\n",
    "a state diagram, as shown below:\n",
    "\n",
    "\n",
    "![title](Images/9.png)\n",
    "We can also formulate the transition probabilities into a matrix called the transition\n",
    "matrix, as shown below:\n",
    "\n",
    "![title](Images/10.PNG)\n",
    "\n",
    "Thus, to conclude, we can say that the Markov chain or Markov process consists of a\n",
    "set of states along with their transition probabilities.\n",
    "\n",
    "## Markov Reward Process\n",
    "\n",
    "The Markov Reward Process (MRP) is an extension of the Markov chain with the\n",
    "reward function. That is, we learned that the Markov chain consists of states and a\n",
    "transition probability. The MRP consists of states, a transition probability, and also a\n",
    "reward function.\n",
    "\n",
    "A reward function tells us the reward we obtain in each state. For instance, based on\n",
    "our previous weather example, the reward function tells us the reward we obtain\n",
    "in the state cloudy, the reward we obtain in the state windy, and so on. The reward\n",
    "function is usually denoted by $R(s)$.\n",
    "\n",
    "Thus, the MRP consists of states $s$, a transition probability $P(s|s')$\n",
    "function $R(s)$. \n",
    "\n",
    "## Markov Decision Process\n",
    "\n",
    "The Markov Decision Process (MDP) is an extension of the MRP with actions. That\n",
    "is, we learned that the MRP consists of states, a transition probability, and a reward\n",
    "function. The MDP consists of states, a transition probability, a reward function,\n",
    "and also actions. We learned that the Markov property states that the next state is\n",
    "dependent only on the current state and is not based on the previous state. Is the\n",
    "Markov property applicable to the RL setting? Yes! In the RL environment, the agent\n",
    "makes decisions only based on the current state and not based on the past states. So,\n",
    "we can model an RL environment as an MDP.\n",
    "\n",
    "Let's understand this with an example. Given any environment, we can formulate\n",
    "the environment using an MDP. For instance, let's consider the same grid world\n",
    "environment we learned earlier. The following figure shows the grid world environment,\n",
    "and the goal of the agent is to reach state I from state A without visiting the shaded\n",
    "state\n",
    "\n",
    "\n",
    "![title](Images/11.png)\n",
    "\n",
    "An agent makes a decision (action) in the environment only based on the current\n",
    "state the agent is in and not based on the past state. So, we can formulate our\n",
    "environment as an MDP. We learned that the MDP consists of states, actions,\n",
    "transition probabilities, and a reward function. Now, let's learn how this relates to\n",
    "our RL environment:\n",
    "\n",
    "__States__ – A set of states present in the environment. Thus, in the grid world\n",
    "environment, we have states A to I.\n",
    "\n",
    "__Actions__ – A set of actions that our agent can perform in each state. An agent\n",
    "performs an action and moves from one state to another. Thus, in the grid world\n",
    "environment, the set of actions is up, down, left, and right.\n",
    "\n",
    "__Transition probability__ – The transition probability is denoted by $ P(s'|s,a) $. It\n",
    "implies the probability of moving from a state $s$ to the next state $s'$ while performing\n",
    "an action $a$. If you observe, in the MRP, the transition probability is just $ P(s'|s,a) $ that\n",
    "is, the probability of going from state $s$ to state $s'$ and it doesn't include actions. But in MDP we include the actions, thus the transition probability is denoted by $ P(s'|s,a) $. \n",
    "\n",
    "For example, in our grid world environment, say, the transition probability of moving from state A to state B while performing an action right is 100% then it can be expressed as: $P( B |A , \\text{right}) = 1.0 $. We can also view this in the state diagram as shown below:\n",
    "\n",
    "\n",
    "![title](Images/12.png)\n",
    "\n",
    "Suppose, our agent is in state C and the transition probability of moving from state C to the state F while performing an action down is 90% then it can be expressed as: $P( F |C , \\text{down}) = 0.9 $. We can also view this in the state diagram as shown below:\n",
    "\n",
    "\n",
    "![title](Images/13.png)\n",
    "\n",
    "__Reward function__ -  The reward function is denoted by $R(s,a,s') $. It implies the reward our agent obtains while transitioning from a state $s$ to the state $s'$ while performing an action $a$. \n",
    "\n",
    "Say, the reward we obtain while transitioning from the state A to the state B while performing an action right is -1, then it can be expressed as $R(A, \\text{right}, B) = -1 $. We can also view this in the state diagram as shown below:\n",
    "\n",
    "\n",
    "![title](Images/14.png)\n",
    "\n",
    "Suppose, our agent is in state C and say, the reward we obtain while transitioning from the state C to the state F while performing an action down is  +1, then it can be expressed as $R(C, \\text{down}, F) = +1 $. We can also view this in the state diagram as shown below:\n",
    "\n",
    "\n",
    "![title](Images/15.png)\n",
    "\n",
    "\n",
    "Thus, an RL environment can be represented as an MDP with states, actions,\n",
    "transition probability, and the reward function. But wait! What is the use of\n",
    "representing the RL environment using the MDP? We can solve the RL problem easily\n",
    "once we model our environment as the MDP. For instance, once we model our grid\n",
    "world environment using the MDP, then we can easily find how to reach the goal\n",
    "state I from state A without visiting the shaded states. We will learn more about this\n",
    "in the upcoming chapters. Next, we will go through more essential concepts of RL.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.07. Action space, Policy, Episode and Horizon-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Action space, Policy, Episode, Horizon\n",
    "\n",
    "In this section, we will learn about the several important fundamental concepts that are involved in reinforcement learning. \n",
    "\n",
    "## Action space\n",
    "Consider the grid world environment shown below:\n",
    "\n",
    "![title](Images/16.png)\n",
    "\n",
    "\n",
    "In the preceding grid world environment, the goal of the agent is to reach state I\n",
    "starting from state A without visiting the shaded states. In each of the states, the\n",
    "agent can perform any of the four actions—up, down, left, and right—to achieve the\n",
    "goal. The set of all possible actions in the environment is called the action space.\n",
    "Thus, for this grid world environment, the action space will be [up, down, left, right].\n",
    "We can categorize action spaces into two types:\n",
    "\n",
    "* Discrete action space \n",
    "* Continuous action space\n",
    "\n",
    "__Discrete action space__ -When our action space consists of actions that are discrete,\n",
    "then it is called a discrete action space. For instance, in the grid world environment,\n",
    "our action space consists of four discrete actions, which are up, down, left, right, and\n",
    "so it is called a discrete action space.\n",
    "\n",
    "__Continuous action space__ - When our action space consists of actions that are\n",
    "continuous, then it is called a continuous action space. For instance, let's suppose\n",
    "we are training an agent to drive a car, then our action space will consist of several\n",
    "actions that have continuous values, such as the speed at which we need to drive the\n",
    "car, the number of degrees we need to rotate the wheel, and so on. In cases where\n",
    "our action space consists of actions that are continuous, it is called a continuous\n",
    "action space.\n",
    "\n",
    "## Policy\n",
    "\n",
    "A policy defines the agent's behavior in an environment. The policy tells the agent\n",
    "what action to perform in each state. For instance, in the grid world environment, we\n",
    "have states A to I and four possible actions. The policy may tell the agent to move\n",
    "down in state A, move right in state D, and so on.\n",
    "\n",
    "To interact with the environment for the first time, we initialize a random policy, that\n",
    "is, the random policy tells the agent to perform a random action in each state. Thus,\n",
    "in an initial iteration, the agent performs a random action in each state and tries to\n",
    "learn whether the action is good or bad based on the reward it obtains. Over a series\n",
    "of iterations, an agent will learn to perform good actions in each state, which gives a\n",
    "positive reward. Thus, we can say that over a series of iterations, the agent will learn\n",
    "a good policy that gives a positive reward.\n",
    "\n",
    "The optimal policy is shown in the following figure. As we can observe, the agent selects the\n",
    "action in each state based on the optimal policy and reaches the terminal state I from\n",
    "the starting state A without visiting the shaded states:\n",
    "\n",
    "![title](Images/17.png)\n",
    "\n",
    "Thus, the optimal policy tells the agent to perform the correct action in each state so\n",
    "that the agent can receive a good reward.\n",
    "\n",
    "A policy can be classified into two:\n",
    "\n",
    "* Deterministic Policy\n",
    "* Stochastic Policy\n",
    "\n",
    "### Deterministic Policy\n",
    "The policy which we just learned above is called deterministic policy. That is, deterministic policy tells the agent to perform a one particular action in a state. Thus, the deterministic policy maps the state to one particular action and is often denoted by $\\mu$. Given a state $s$ at a time $t$, a deterministic policy tells the agent to perform a one particular action $a$. It can be expressed as:\n",
    "\n",
    "$$a_t = \\mu(s_t) $$\n",
    "\n",
    "For instance, consider our grid world example, given a state A, the deterministic policy $\\mu$ tells the agent to perform an action down and it can be expressed as:\n",
    "\n",
    "$$\\mu (A) = \\text{Down} $$\n",
    "\n",
    "Thus, according to the deterministic policy, whenever the agent visits state A, it performs the action down. \n",
    "\n",
    "### Stochastic Policy\n",
    "\n",
    "Unlike deterministic policy, the stochastic policy does not map the state directly to one particular action, instead, it maps the state to a probability distribution over an action space. \n",
    "\n",
    "That is, we learned that given a state, the deterministic policy will tell the agent to perform one particular action in the given state, so, whenever the agent visits the state it always performs the same particular action. But with stochastic policy, given a state, the stochastic policy will return a probability distribution over an action space so instead of performing the same action every time the agent visits the state, the agent performs different actions each time based on a probability distribution returned by the stochastic policy. \n",
    "\n",
    "Let's understand this with an example, we know that our grid world environment's action space consists of 4 actions which are [up, down, left, right]. Given a state A, the stochastic policy returns the probability distribution over the action space as [0.10,0.70,0.10,0.10]. Now, whenever the agent visits the state A, instead of selecting the same particular action every time, the agent selects the action up 10% of the time, action down 70% of the time, action left 10% of time and action right 10% of the time. \n",
    "\n",
    "The difference between the deterministic policy and stochastic policy is shown below, as we can observe the deterministic policy maps the state to one particular action whereas the stochastic policy maps the state to the probability distribution over an action space:\n",
    "\n",
    "![title](Images/18.png)\n",
    "\n",
    "Thus, stochastic policy maps the state to a probability distribution over action space and it is often denoted by $\\pi$.  Say, we have a state $s$ and action $a$ at a time $t$, then we can express the stochastic policy as:\n",
    "\n",
    "\n",
    "$$a_t \\sim \\pi(s_t) $$\n",
    "\n",
    "Or it can also be expressed as $\\pi(a_t |s_t) $. \n",
    "\n",
    "We can categorize the stochastic policy into two:\n",
    "\n",
    "* Categorical policy\n",
    "* Gaussian policy\n",
    "\n",
    "### Categorical policy \n",
    "A stochastic policy is called a categorical policy when the action space is discrete. That is, the stochastic policy uses categorical probability distribution over action space to select actions when the action space is discrete. For instance, in the grid world environment, we have just seen above, we select actions based on categorical probability distribution (discrete distribution) as the action space of the environment is discrete. As shown below, given a state A, we select an action based on the categorical probability distribution over the action space:\n",
    "\n",
    "\n",
    "\n",
    "![title](Images/19.png)\n",
    "### Gaussian policy \n",
    "A stochastic policy is called a gaussian policy when our action space is continuous. That is, the stochastic policy uses Gaussian probability distribution over action space to select actions when the action space is continuous. Let's understand this with a small example. Suppose we training an agent to drive a car and say we have one continuous action in our action space. Let the action be the speed of the car and the value of the speed of the car ranges from 0 to 150 kmph. Then, the stochastic policy uses the Gaussian distribution over the action space to select action as shown below:\n",
    "\n",
    "![title](Images/20.png)\n",
    "\n",
    "\n",
    "We will learn more about the gaussian policy in the upcoming chapters.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Episode \n",
    "\n",
    "The agent interacts with the environment by performing some action starting from the initial state and reach the final state. This agent-environment interaction starting from the initial state until the final state is called an episode. For instance, in the car racing video game, the agent plays the game by starting from the initial state (starting point of the race) and reach the final state (endpoint of the race). This is considered an episode. An episode is also often called trajectory (path taken by the agent) and it is denoted by $\\tau$. \n",
    "\n",
    "An agent can play the game for any number of episodes and each episode is independent of each other. What is the use of playing the game for multiple numbers of episodes? In order to learn the optimal policy, that is, the policy which tells the agent to perform correct action in each state, the agent plays the game for many episodes. \n",
    "\n",
    "For example, let's say we are playing a car racing game, for the first time, we may not win the game and we play the game several times to understand more about the game and discover some good strategies for winning the game. Similarly, in the first episode, the agent may not win the game and it plays the game for several episodes to understand more about the game environment and good strategies to win the game. \n",
    "\n",
    "\n",
    "\n",
    "Say, we begin the game from an initial state at a time step t=0 and reach the final state at a time step T then the episode information consists of the agent environment interaction such as state, action, and reward starting from the initial state till the final state, that is, $(s_0, a_0,r_0,s_1,a_1,r_1,\\dots,s_T) $\n",
    "\n",
    "An episode (or) trajectory is shown below:\n",
    "\n",
    "![title](Images/21.png)\n",
    "\n",
    "\n",
    "Let's strengthen our understanding of the episode and optimal policy with the grid world environment. We learned that in the grid world environment, the goal of our agent is to reach the final state I starting from the initial state A without visiting the shaded states. An agent receives +1 reward when it visits the unshaded states and -1 reward when it visits the shaded states.\n",
    "\n",
    "When we say, generate an episode it means going from initial state to the final state. The agent generates the first episode using a random policy and explores the environment and over several episodes, it will learn the optimal policy. \n",
    "\n",
    "### Episode 1:\n",
    "\n",
    "As shown below, in the first episode, the agent uses random policy and selects random action in each state starting from the initial state until the final state and observe the reward:\n",
    "\n",
    "\n",
    "![title](Images/22.png)\n",
    "\n",
    "\n",
    "### Episode 2:\n",
    "\n",
    "In the second episode, the agent tries a different policy to avoid negative rewards which it had received in the previous episode. For instance, as we can observe in the previous episode, the agent selected an action right in the state A and received a negative reward, so in this episode, instead of selecting action right in the state A, it tries a different action say, down as shown below:\n",
    "\n",
    "\n",
    "![title](Images/23.png)\n",
    "\n",
    "### Episode n:\n",
    "\n",
    "Thus, over a series of the episodes, the agent learns the optimal policy, that is, the policy which takes the agent to the final state I from the state A without visiting the shaded states as shown below:\n",
    "\n",
    "\n",
    "![title](Images/24.png)\n",
    "\n",
    "# Episodic and Continuous tasks \n",
    "A reinforcement learning task can be categorized into two:\n",
    "* Episodic task\n",
    "* Continuous task\n",
    "\n",
    "__Episodic task__ - As the name suggests episodic task is the one that has the terminal state. That is, episodic tasks are basically tasks made up of episodes and thus they have a terminal state. Example: Car racing game. \n",
    "\n",
    "__Continuous task__ - Unlike episodic tasks, continuous tasks do not contain any episodes and so they don't have any terminal state. For example, a personal assistance robot does not have a terminal state. \n",
    "\n",
    "\n",
    "# Horizon\n",
    "Horizon is the time step until which the agent interacts with the environment. We can classify the horizon into two:\n",
    "\n",
    "* Finite horizon\n",
    "* Infinite horizon\n",
    "\n",
    "__Finite horizon__ - If the agent environment interaction stops at a particular time step then it is called finite Horizon. For instance, in the episodic tasks agent interacts with the environment starting from the initial state at time step  t =0 and reach the final state at a time step T.  Since the agent environment interaction stops at the time step T, it is considered a finite horizon. \n",
    "\n",
    "__Infinite horizon__ - If the agent environment interaction never stops then it is called an infinite horizon. For instance, we learned that the continuous task does not have any terminal states, so the agent environment interaction will never stop in the continuous task and so it is considered an infinite horizon. \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.01. Key Elements of Reinforcement Learning .ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "\n",
    "Reinforcement Learning (RL) is one of the areas of Machine Learning (ML). Unlike\n",
    "other ML paradigms, such as supervised and unsupervised learning, RL works in a\n",
    "trial and error fashion by interacting with its environment.\n",
    "\n",
    "RL is one of the most active areas of research in artificial intelligence, and it is\n",
    "believed that RL will take us a step closer towards achieving artificial general\n",
    "intelligence. RL has evolved rapidly in the past few years with a wide variety of\n",
    "applications ranging from building a recommendation system to self-driving cars.\n",
    "The major reason for this evolution is the advent of deep reinforcement learning,\n",
    "which is a combination of deep learning and RL. With the emergence of new RL\n",
    "algorithms and libraries, RL is clearly one of the most promising areas of ML.\n",
    "\n",
    "In this chapter, we will build a strong foundation in RL by exploring several\n",
    "important and fundamental concepts involved in RL. In this chapter, we will learn about the following topics:\n",
    "\n",
    "* Key elements of RL\n",
    "* The basic idea of RL\n",
    "* The RL algorithm\n",
    "* How RL differs from other ML paradigms\n",
    "* The Markov Decision Processes\n",
    "* Fundamental concepts of RL\n",
    "* Applications of RL\n",
    "* RL glossary\n",
    "\n",
    "We will begin the chapter by understanding Key elements of RL. This will help us understand the\n",
    "basic idea of RL."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Key Elements of Reinforcement Learning \n",
    "\n",
    "Let's begin by understanding some key elements of RL.\n",
    "\n",
    "## Agent \n",
    "\n",
    "An agent is a software program that learns to make intelligent decisions. We can\n",
    "say that an agent is a learner in the RL setting. For instance, a chess player can be\n",
    "considered an agent since the player learns to make the best moves (decisions) to win\n",
    "the game. Similarly, Mario in a Super Mario Bros video game can be considered an\n",
    "agent since Mario explores the game and learns to make the best moves in the game.\n",
    "\n",
    "\n",
    "## Environment \n",
    "The environment is the world of the agent. The agent stays within the environment.\n",
    "For instance, coming back to our chess game, a chessboard is called the environment\n",
    "since the chess player (agent) learns to play the game of chess within the chessboard\n",
    "(environment). Similarly, in Super Mario Bros, the world of Mario is called the\n",
    "environment.\n",
    "\n",
    "## State and action\n",
    "A state is a position or a moment in the environment that the agent can be in. We\n",
    "learned that the agent stays within the environment, and there can be many positions\n",
    "in the environment that the agent can stay in, and those positions are called states.\n",
    "For instance, in our chess game example, each position on the chessboard is called\n",
    "the state. The state is usually denoted by s.\n",
    "\n",
    "The agent interacts with the environment and moves from one state to another\n",
    "by performing an action. In the chess game environment, the action is the move\n",
    "performed by the player (agent). The action is usually denoted by a.\n",
    "\n",
    "\n",
    "## Reward\n",
    "\n",
    "We learned that the agent interacts with an environment by performing an action\n",
    "and moves from one state to another. Based on the action, the agent receives a\n",
    "reward. A reward is nothing but a numerical value, say, +1 for a good action and -1\n",
    "for a bad action. How do we decide if an action is good or bad?\n",
    "In our chess game example, if the agent makes a move in which it takes one of the\n",
    "opponent's chess pieces, then it is considered a good action and the agent receives\n",
    "a positive reward. Similarly, if the agent makes a move that leads to the opponent\n",
    "taking the agent's chess piece, then it is considered a bad action and the agent\n",
    "receives a negative reward. The reward is denoted by r.\n",
    "\n",
    "\n",
    "In the next section, let us explore basic idea of reinforcement learning. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 02. A Guide to the Gym Toolkit/2.02.  Creating our First Gym Environment.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creating our first gym environment\n",
    "\n",
    "We learned that the gym provides a variety of environments for training the reinforcement learning agent. To clearly understand how the gym environment is designed, we will start off with the basic gym environment. Going forward, we will understand all other complex gym environments. \n",
    "\n",
    "Let's introduce one of the simplest environments called the frozen lake environment. The frozen lake environment is shown below. As we can observe, in the frozen lake environment, the goal of the agent is to start from the initial state S and reach the goal state G.\n",
    "\n",
    "![title](Images/1.png)\n",
    "\n",
    "In the above environment, the following applies:\n",
    "\n",
    "* S denotes the starting state\n",
    "* F denotes the frozen state\n",
    "* H denotes the hole state\n",
    "* G denotes the goal state\n",
    "\n",
    "So, the agent has to start from the state S and reach the goal state G. But one issue is that if the agent visits the state H, which is just the hole state, then the agent will fall into the hole and die as shown below:\n",
    "\n",
    "\n",
    "![title](Images/2.png)\n",
    "\n",
    "So, we need to make sure that the agent starts from S and reaches G without falling into the hole state H as shown below:\n",
    "\n",
    "\n",
    "![title](Images/3.png)\n",
    "Each grid box in the above environment is called state, thus we have 16 states (S to G) and we have 4 possible actions which are up, down, left and right. We learned that our goal is to reach the state G from S without visiting H. So, we assign reward as 0 to all the states and + 1 for the goal state G. \n",
    "\n",
    "Thus, we learned how the frozen lake environment works. Now, to train our agent in the frozen lake environment, first, we need to create the environment by coding it from scratch in Python. But luckily we don't have to do that! Since the gym provides the various environment, we can directly import the gym toolkit and create a frozen lake environment using the gym.\n",
    "\n",
    "\n",
    "Now, we will learn how to create our frozen lake environment using the gym. Before running any code, make sure that you activated our virtual environment universe. First, let's import the gym library:\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Next, we can create a gym environment using the make function.  The make function requires the environment id as a parameter. In the gym, the id of the frozen lake environment is `FrozenLake-v0`. So, we can create our frozen lake environment as shown below:\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make(\"FrozenLake-v0\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After creating the environment, we can see how our environment looks like using the render function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[41mS\u001b[0mFFF\n",
      "FHFH\n",
      "FFFH\n",
      "HFFG\n"
     ]
    }
   ],
   "source": [
    "env.render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "As we can observe, the frozen lake environment consists of 16 states (S to G) as we learned. The state S is highlighted indicating that it is our current state, that is, agent is in the state S. So whenever we create an environment, an agent will always begin from the initial state, in our case, it is the state S. \n",
    "\n",
    "That's it! Creating the environment using the gym is that simple. In the next section, we will understand more about the gym environment by relating all the concepts we have learned in the previous chapter. \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Exploring the environment\n",
    "\n",
    "In the previous chapter, we learned that the reinforcement learning environment can be modeled as the Markov decision process (MDP) and an MDP consists of the following: \n",
    "\n",
    "* __States__ -  A set of states present in the environment \n",
    "* __Actions__ - A set of actions that the agent can perform in each state. \n",
    "* __Transition probability__ - The transition probability is denoted by $P(s'|s,a) $. It implies the probability of moving from a state $s$ to the state $s'$ while performing an action $a$.\n",
    "* __Reward function__ - Reward function is denoted by $R(s,a,s')$. It implies the reward the agent obtains moving from a state $s$ to the state  $s'$ while performing an action $a$.\n",
    "\n",
    "Let's now understand how to obtain all the above information from the frozen lake environment we just created using the gym.\n",
    "\n",
    "\n",
    "\n",
    "## States\n",
    "A state space consists of all of our states. We can obtain the number of states in our environment by just typing `env.observation_space` as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Discrete(16)\n"
     ]
    }
   ],
   "source": [
    "print(env.observation_space)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It implies that we have 16 discrete states in our state space starting from the state S to G. Note that, in the gym, the states will be encoded as a number, so the state S will be encoded as 0, state F will be encoded as 1 and so on as shown below:\n",
    "\n",
    "\n",
    "![title](Images/5.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Actions\n",
    "\n",
    "We learned that the action space consists of all the possible actions in the environment. We can obtain the action space by `env.action_space` as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Discrete(4)\n"
     ]
    }
   ],
   "source": [
    "print(env.action_space)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It implies that we have 4 discrete actions in our action space which are left, down, right, up. Note that, similar to states, actions also will be encoded into numbers as shown below:\n",
    "\n",
    "\n",
    "![title](Images/6.PNG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transition probability and Reward function\n",
    "\n",
    "Now, let's look at how to obtain the transition probability and the reward function. We learned that in the stochastic environment, we cannot say that by performing some action $a$, agent will always reach the next state $s'$ exactly because there will be some randomness associated with the stochastic environment and by performing an action $a$ in the state $s$, agent reaches the next state  with some probability.\n",
    "\n",
    "Let's suppose we are in state 2 (F). Now if we perform action 1 (down) in state 2, we can reach the state 6 as shown below:\n",
    "\n",
    "\n",
    "![title](Images/7.png)\n",
    "\n",
    "Our frozen lake environment is a stochastic environment. When our environment is stochastic we won't always reach the state 6 by performing action 1(down) in state 2, we also reach other states with some probability. So when we perform an action 1 (down) in the state 2, we reach state 1 with probability 0.33333, we reach state 6 with probability 0.33333 and we reach the state 3 with probability 0.33333 as shown below:\n",
    "\n",
    "\n",
    "![title](Images/8.png)\n",
    "\n",
    "\n",
    "As we can notice, in the stochastic environment we reach the next states with some probability. Now, let's learn how to obtain this transition probability using the gym environment.  \n",
    "\n",
    "We can obtain the transition probability and the reward function by just typing `env.P[state][action]` So, in order to obtain the transition probability of moving from the state S to the other states by performing an action right, we can type, `env.P[S][right]`. But we cannot just type state S and action right directly since they are encoded into numbers. We learned that state S is encoded as 0 and the action right is encoded as 2, so, in order to obtain the transition probability of state S by performing an action right, we type `env.P[0][2]` as shown below:\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False), (0.3333333333333333, 0, 0.0, False)]\n"
     ]
    }
   ],
   "source": [
    "print(env.P[0][2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What does this imply? Our output is in the form of `[(transition probability, next state, reward, Is terminal state?)]` It implies that if we perform an action 2 (right) in state 0 (S) then:\n",
    "\n",
    "* We reach the state 4 (F) with probability 0.33333 and receive 0 reward. \n",
    "* We reach the state 1 (F) with probability 0.33333 and receive 0 reward.\n",
    "* We reach the same state 0 (S) with probability 0.33333 and receive 0 reward.\n",
    "\n",
    "The transition probability is shown below:\n",
    "\n",
    "\n",
    "\n",
    "![title](Images/9.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus, when we type `env.P[state][action]` we get the result in the form of `[(transition probability, next state, reward, Is terminal state?)]`. The last value is the boolean and it implies that whether the next state is a terminal state, since 4, 1 and 0 are not the terminal states it is given as false. \n",
    "\n",
    "The output of `env.P[0][2]` is shown in the below table for more clarity:\n",
    "\n",
    "\n",
    "![title](Images/10.PNG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's understand this with one more example. Let's suppose we are in the state 3 (F) as shown below:\n",
    "\n",
    "\n",
    "![title](Images/11.png)\n",
    "\n",
    "Say, we perform action 1 (down) in the state 3(F). Then the transition probability of the state 3(F) by performing action 1(down) can be obtained as shown below:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0.3333333333333333, 2, 0.0, False), (0.3333333333333333, 7, 0.0, True), (0.3333333333333333, 3, 0.0, False)]\n"
     ]
    }
   ],
   "source": [
    "print(env.P[3][1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we learned, our output is in the form of `[(transition probability, next state, reward, Is terminal state?)]` It implies that if we perform an action 1 (down) in state 3 (F) then:\n",
    "\n",
    "* We reach the state 2 (F) with probability 0.33333 and receive 0 reward. \n",
    "* We reach the state 7 (H) with probability 0.33333 and receive 0 reward.\n",
    "* We reach the same state 3 (F) with probability 0.33333 and receive 0 reward.\n",
    "\n",
    "\n",
    "The transition probability is shown below:\n",
    "\n",
    "\n",
    "\n",
    "![title](Images/12.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The output of `env.P[3][1]` is shown in the below table for more clarity:\n",
    "\n",
    "\n",
    "![title](Images/13.PNG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe, in the second row of our output, we have, `(0.33333, 7, 0.0, True)`,and the last value here is marked as True. It implies that state 7 is a terminal state. That is, if we perform action 1(down) in state 3(F) then we reach the state 7(H) with 0.33333 probability and since 7(H) is a hole, the agent dies if it reaches the state 7(H). Thus 7(H) is a terminal state and so it is marked as True. \n",
    "\n",
    "Thus, we learned how to obtain the state space, action space, transition probability and the reward function using the gym environment. In the next section, we will learn how to generate an episode. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 02. A Guide to the Gym Toolkit/2.04. Classic Control Environments.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classic control environments\n",
    "\n",
    "The gym provides environments for several classic control tasks such as cart pole balancing, swinging up the pendulum, mountain car climbing and so on. Let's understand how to create a gym environment for a cart pole balancing task. The cart pole environment is shown below:\n",
    "\n",
    "\n",
    "![title](Images/17.PNG)\n",
    "\n",
    "Cart Pole balancing is one of the classical control problems. As shown in the above figure, the pole is attached to the cart and the goal of our agent is to balance the pole on the cart, that is, the goal of our agent is to keep the pole straight up standing on the cart as shown below:\n",
    "\n",
    "![title](Images/18.PNG)\n",
    "\n",
    "\n",
    "So the agent tries to push the cart left and right to keep the pole standing straight on the cart. Thus our agent performs two actions which are pushing the cart to the left and pushing the cart to the right to keep the pole standing straight on the cart. You can also check this very interesting video https://youtu.be/qMlcsc43-lg which shows how the RL agent balances the pole on the cart by moving the cart left and right. \n",
    "\n",
    "Now, let's learn how to create the cart pole environment using the gym. The environment id of the cart pole environment in the gym is `CartPole-v0` , so we can just use our `make` function to create the cart pole environment as shown below:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "env = gym.make(\"CartPole-v0\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After creating, we can view our environment using the `render` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "env.render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also close rendering the environment using the `close` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## State space\n",
    "\n",
    "Now, let's look at the state space of our cart pole environment. Wait! What are the states here? In the frozen lake environment we had discrete 16 states from (S to G). But how can we describe the states here? Can we describe the state by cart position? Yes! Note that the cart position is a continuous value. So, in this case, our state space will be continuous values, unlike the frozen lake environment where our state space had discrete values (S to G).\n",
    "\n",
    "But with just the cart position alone we cannot describe the state of the environment completely. So we include cart velocity, pole angle and pole velocity at the tip. So we can describe our state space by an array of values as shown below:\n",
    "\n",
    "`array([cart position, cart velocity, pole angle, pole velocity at the tip])`\n",
    "\n",
    "Note that all of these values are continuous, that is:\n",
    "\n",
    "* The value of cart position ranges from -4.8 to 4.8\n",
    "* The value of cart velocity ranges from -Inf to Inf\n",
    "* The value of pole angle ranges from -0.418 radians to 0.418 radians \n",
    "* The value of pole velocity city at the tip ranges from -Inf to Inf\n",
    "\n",
    "Thus, our state space contains an array of continuous values. Let's learn how can we obtain this from the gym. In order to get the state space, we can just type `env.observation_space` as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Box(4,)\n"
     ]
    }
   ],
   "source": [
    "print(env.observation_space)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Box implies that our state space consists of continuous values and not discrete values. That is, in the frozen lake environment, we obtained the state space as `Discrete(16)` which implies that we have 16 discrete states (S to G). But now we got our state space as `Box(4,)` which implies that our state space is continuous and consists of an array of 4 values.\n",
    "\n",
    "For example, let's reset our environment and see how our initial state space will look like. We can reset the environment using the `reset` function:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[-0.0468521   0.04980211 -0.0063804   0.04013309]\n"
     ]
    }
   ],
   "source": [
    "print(env.reset())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It implies our initial state space, as we can notice, we have an array of 4 values which denotes the cart position, cart velocity, pole angle and pole velocity at the tip respectively. That is:\n",
    "\n",
    "![title](Images/19.PNG)\n",
    "\n",
    "Okay, how can we obtain the maximum and minimum values of our state space? We can obtain the maximum values of our state space using `env.observation_space.high` and the minimum values of our state space using `env.observation_space.low`\n",
    "\n",
    "For example, let's look at the maximum value of our state space:\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]\n"
     ]
    }
   ],
   "source": [
    "print(env.observation_space.high)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It implies that:\n",
    "\n",
    "1. The maximum value of the cart position is 4.8\n",
    "2. We learned that the maximum value of cart velocity is  +Inf, we know that infinity is not really a number, so it is represented using the largest positive real value 3.4028235e+38.\n",
    "3. The maximum value of the pole angle is 0.418 radians.\n",
    "4. The maximum value of pole velocity at the tip is +Inf, so it is represented using largest positive real value 3.4028235e+38\n",
    "\n",
    "Similarly, we can obtain the minimum value of our state space as:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]\n"
     ]
    }
   ],
   "source": [
    "print(env.observation_space.low)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Action space\n",
    "\n",
    "Now, let's look at the action space. We already learned that in the cart pole environment we perform two actions which are pushing the cart to the left and pushing the cart to the right and thus action space is discrete since we have only two discrete actions.\n",
    "\n",
    "In order to get the action space, we can just type `env.action_space` as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Discrete(2)\n"
     ]
    }
   ],
   "source": [
    "print(env.action_space)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe `Discrete(2)` implies that our action space is discrete and we have two actions in our action space. Note that the actions will be encoded into numbers as shown below:\n",
    "\n",
    "\n",
    "![title](Images/20.PNG)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 02. A Guide to the Gym Toolkit/2.05. Cart Pole Balancing with Random Policy.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cart Pole Balancing with Random Policy\n",
    "\n",
    "Let's create an agent with the random policy, that is, we create the agent that selects the random action in the environment and tries to balance the pole. The agent receives +1 reward every time the pole stands straight up on the cart. We will generate over 100 episodes and we will see the return (sum of rewards) obtained over each episode. Let's learn this step by step.\n",
    "\n",
    "First, create our cart pole environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "env = gym.make('CartPole-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Set the number of episodes and number of time steps in the episode:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_episodes = 100\n",
    "num_timesteps = 50"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Episode: 0, Return: 23.0\n",
      "Episode: 10, Return: 12.0\n",
      "Episode: 20, Return: 23.0\n",
      "Episode: 30, Return: 15.0\n",
      "Episode: 40, Return: 19.0\n",
      "Episode: 50, Return: 10.0\n",
      "Episode: 60, Return: 16.0\n",
      "Episode: 70, Return: 10.0\n",
      "Episode: 80, Return: 22.0\n",
      "Episode: 90, Return: 38.0\n"
     ]
    }
   ],
   "source": [
    "#for each episode\n",
    "for i in range(num_episodes):\n",
    "    \n",
    "    #set the Return to 0\n",
    "    Return = 0\n",
    "    #initialize the state by resetting the environment\n",
    "    state = env.reset()\n",
    "    \n",
    "    #for each step in the episode\n",
    "    for t in range(num_timesteps):\n",
    "        #render the environment\n",
    "        env.render()\n",
    "        \n",
    "        #randomly select an action by sampling from the environment\n",
    "        random_action = env.action_space.sample()\n",
    "        \n",
    "        #perform the randomly selected action\n",
    "        next_state, reward, done, info = env.step(random_action)\n",
    "\n",
    "        #update the return\n",
    "        Return = Return + reward\n",
    "\n",
    "        #if the next state is a terminal state then end the episode\n",
    "        if done:\n",
    "            break\n",
    "    #for every 10 episodes, print the return (sum of rewards)\n",
    "    if i%10==0:\n",
    "        print('Episode: {}, Return: {}'.format(i, Return))\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Close the environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "env.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 02. A Guide to the Gym Toolkit/README.md
================================================
# 2. A Guide to the Gym Toolkit
* 2.1. Setting Up our Machine
   * 2.1.1. Installing Anaconda
   * 2.1.2. Installing the Gym Toolkit
   * 2.1.3. Common Error Fixes
* 2.2. Creating our First Gym Environment
   * 2.2.1. Exploring the Environment
   * 2.2.2. States
   * 2.2.3. Actions
   * 2.2.4. Transition Probability and Reward Function
* 2.3. Generating an episode
* 2.4. Classic Control Environments
   * 2.4.1. State Space
   * 2.4.2. Action Space
* 2.5. Cart Pole Balancing with Random Policy
* 2.6. Atari Game Environments
   * 2.6.1. General Environment
   * 2.6.2. Deterministic Environment
* 2.7. Agent Playing the Tennis Game
* 2.8. Recording the Game
* 2.9. Other environments
   * 2.9.1. Box 2D
   * 2.9.2. Mujoco
   * 2.9.3. Robotics
   * 2.9.4. Toy text
   * 2.9.5. Algorithms
* 2.10. Environment Synopsis

================================================
FILE: 03. Bellman Equation and Dynamic Programming/.ipynb_checkpoints/3.06. Solving the Frozen Lake Problem with Value Iteration-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Solving the Frozen Lake Problem with Value Iteration\n",
    "\n",
    "In the previous chapter, we have learned about the frozen lake environment. The frozen\n",
    "lake environment is shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/4.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's recap the frozen lake environment a bit. In the frozen lake environment shown above,\n",
    "the following applies:\n",
    "    \n",
    "* S implies the starting state\n",
    "* F implies the frozen states\n",
    "* H implies the hold states\n",
    "* G implies the goal state\n",
    "\n",
    "We learned that in the frozen lake environment, our goal is to reach the goal state G from\n",
    "the starting state S without visiting the hole states H. That is, while trying to reach the goal\n",
    "state G from the starting state S if the agent visits the hole state H then it will fall into the\n",
    "hole and die as shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/5.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, the goal of the agent is to reach the state G starting from the state S without visiting the\n",
    "hole states H as shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/6.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How can we achieve this goal? That is, how can we reach the state G from S without\n",
    "visiting H? We learned that the optimal policy tells the agent to perform correct action in\n",
    "each state. So, if we find the optimal policy then we can reach the state G from S visiting the state H. Okay, how to find the optimal policy? We can use the value iteration method\n",
    "we just learned to find the optimal policy.\n",
    "\n",
    "\n",
    "Remember that all our states (S to G) will be encoded from 0 to 16 and all the four actions -\n",
    "left, down, up, right will be encoded from 0 to 3 in the gym toolkit.\n",
    "So, in this section, we will learn how to find the optimal policy using the value iteration\n",
    "method so that the agent can reach the state G from S without visiting H."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's create the frozen lake environment using gym:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('FrozenLake-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the frozen lake environment using the render function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[41mS\u001b[0mFFF\n",
      "FHFH\n",
      "FFFH\n",
      "HFFG\n"
     ]
    }
   ],
   "source": [
    "env.render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can notice, our agent is in the state S and it has to reach the state G without visiting\n",
    "the states H. So, let's learn how to compute the optimal policy using the value iteration\n",
    "method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's learn how to compute the optimal value function and then we will see how to\n",
    "extract the optimal policy from the computed optimal value function. \n",
    "\n",
    "\n",
    "## Computing optimal value function\n",
    "\n",
    "We will define a function called `value_iteration` where we compute the optimal value\n",
    "function iteratively by taking maximum over Q function. For\n",
    "better understanding, let's closely look at the every line of the function and then we look at\n",
    "the complete function at the end which gives us more clarity.\n",
    "\n",
    "\n",
    "\n",
    "Define `value_iteration` function which takes the environment as a parameter: \n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def value_iteration(env):\n",
    "\n",
    "    #set the number of iterations\n",
    "    num_iterations = 1000\n",
    "    \n",
    "    #set the threshold number for checking the convergence of the value function\n",
    "    threshold = 1e-20\n",
    "    \n",
    "    #we also set the discount factor\n",
    "    gamma = 1.0\n",
    "    \n",
    "    #now, we will initialize the value table, with the value of all states to zero\n",
    "    value_table = np.zeros(env.observation_space.n)\n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        \n",
    "        #update the value table, that is, we learned that on every iteration, we use the updated value\n",
    "        #table (state values) from the previous iteration\n",
    "        updated_value_table = np.copy(value_table) \n",
    "             \n",
    "        #now, we compute the value function (state value) by taking the maximum of Q value.\n",
    "        \n",
    "        #thus, for each state, we compute the Q values of all the actions in the state and then\n",
    "        #we update the value of the state as the one which has maximum Q value as shown below:\n",
    "        for s in range(env.observation_space.n):\n",
    "            \n",
    "            Q_values = [sum([prob*(r + gamma * updated_value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                                        \n",
    "            value_table[s] = max(Q_values) \n",
    "                        \n",
    "        #after computing the value table, that is, value of all the states, we check whether the\n",
    "        #difference between value table obtained in the current iteration and previous iteration is\n",
    "        #less than or equal to a threshold value if it is less then we break the loop and return the\n",
    "        #value table as our optimal value function as shown below:\n",
    "    \n",
    "        if (np.sum(np.fabs(updated_value_table - value_table)) <= threshold):\n",
    "             break\n",
    "    \n",
    "    return value_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, that we have computed the optimal value function by taking the maximum over Q\n",
    "values, let's see how to extract the optimal policy from the optimal value function. \n",
    "\n",
    "\n",
    "## Extracting optimal policy from the optimal value function\n",
    "\n",
    "In the previous step, we computed the optimal value function. Now, let see how to extract\n",
    "the optimal policy from the computed optimal value function.\n",
    "\n",
    "\n",
    "First, we define a function called `extract_policy` which takes the `value_table` as a\n",
    "parameter: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_policy(value_table):\n",
    "    \n",
    "    #set the discount factor\n",
    "    gamma = 1.0\n",
    "     \n",
    "    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to\n",
    "    #be zero\n",
    "    policy = np.zeros(env.observation_space.n) \n",
    "    \n",
    "    #now, we compute the Q function using the optimal value function obtained from the\n",
    "    #previous step. After computing the Q function, we can extract policy by selecting action which has\n",
    "    #maximum Q value. Since we are computing the Q function using the optimal value\n",
    "    #function, the policy extracted from the Q function will be the optimal policy. \n",
    "    \n",
    "    #As shown below, for each state, we compute the Q values for all the actions in the state and\n",
    "    #then we extract policy by selecting the action which has maximum Q value.\n",
    "    \n",
    "    #for each state\n",
    "    for s in range(env.observation_space.n):\n",
    "        \n",
    "        #compute the Q value of all the actions in the state\n",
    "        Q_values = [sum([prob*(r + gamma * value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                \n",
    "        #extract policy by selecting the action which has maximum Q value\n",
    "        policy[s] = np.argmax(np.array(Q_values))        \n",
    "    \n",
    "    return policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "That's it! Now, we will see how to extract the optimal policy in our frozen lake\n",
    "environment. \n",
    "\n",
    "## Putting it all together\n",
    "We learn that in the frozen lake environment our goal is to find the optimal policy which\n",
    "selects the correct action in each state so that we can reach the state G from the state\n",
    "A without visiting the hole states.\n",
    "\n",
    "First, we compute the optimal value function using our `value_iteration` function by\n",
    "passing our frozen lake environment as the parameter: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_value_function = value_iteration(env=env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we extract the optimal policy from the optimal value function using our\n",
    "extract_policy function as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_policy = extract_policy(optimal_value_function)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can print the obtained optimal policy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\n"
     ]
    }
   ],
   "source": [
    "print(optimal_policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe, our optimal policy tells us to\n",
    "perform the correct action in each state. \n",
    "\n",
    "Now, that we have learned what is value iteration and how to perform the value iteration\n",
    "method to compute the optimal policy in our frozen lake environment, in the next section\n",
    "we will learn about another interesting method called the policy iteration. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 03. Bellman Equation and Dynamic Programming/.ipynb_checkpoints/3.08. Solving the Frozen Lake Problem with Policy Iteration-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Solving the Frozen Lake Problem with Policy Iteration\n",
    "\n",
    "We learned that in the frozen lake environment, our goal is to reach the goal state G from\n",
    "the starting state S without visiting the hole states H. Now, let's learn how to compute the optimal policy using the policy iteration method in the frozen lake environment.\n",
    "\n",
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's create the frozen lake environment using gym:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('FrozenLake-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We learned that in the policy iteration, we compute the value function using the policy\n",
    "iteratively. Once we found the optimal value function then the policy which is used to\n",
    "compute the optimal value function will be the optimal policy.\n",
    "\n",
    "So, first, let's learn how to compute the value function using the policy. \n",
    "\n",
    "\n",
    "## Computing value function using policy\n",
    "\n",
    "This step is exactly the same as how we computed the value function in the value iteration\n",
    "method but with a small difference. Here we compute the value function using the policy\n",
    "but in the value iteration method, we compute the value function by taking the maximum\n",
    "over Q values. Now, let's learn how to define a function that computes the value function\n",
    "using the given policy.\n",
    "\n",
    "\n",
    "Let's define a function called `compute_value_function` which takes the policy as a\n",
    "parameter:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_value_function(policy):\n",
    "    \n",
    "    #now, let's define the number of iterations\n",
    "    num_iterations = 1000\n",
    "    \n",
    "    #define the threshold value\n",
    "    threshold = 1e-20\n",
    "    \n",
    "    #set the discount factor\n",
    "    gamma = 1.0\n",
    "    \n",
    "    #now, we will initialize the value table, with the value of all states to zero\n",
    "    value_table = np.zeros(env.observation_space.n)\n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        \n",
    "        #update the value table, that is, we learned that on every iteration, we use the updated value\n",
    "        #table (state values) from the previous iteration\n",
    "        updated_value_table = np.copy(value_table)\n",
    "        \n",
    "        \n",
    "\n",
    "        #thus, for each state, we select the action according to the given policy and then we update the\n",
    "        #value of the state using the selected action as shown below\n",
    "        \n",
    "        #for each state\n",
    "        for s in range(env.observation_space.n):\n",
    "            \n",
    "            #select the action in the state according to the policy\n",
    "            a = policy[s]\n",
    "            \n",
    "            #compute the value of the state using the selected action\n",
    "            value_table[s] = sum([prob * (r + gamma * updated_value_table[s_]) \n",
    "                                        for prob, s_, r, _ in env.P[s][a]])\n",
    "            \n",
    "        #after computing the value table, that is, value of all the states, we check whether the\n",
    "        #difference between value table obtained in the current iteration and previous iteration is\n",
    "        #less than or equal to a threshold value if it is less then we break the loop and return the\n",
    "        #value table as an accurate value function of the given policy\n",
    "\n",
    "        if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):\n",
    "            break\n",
    "            \n",
    "    return value_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now that we have computed the value function of the policy, let's see how to extract the\n",
    "policy from the value function. \n",
    "\n",
    "## Extracting policy from the value function\n",
    "\n",
    "This step is exactly the same as how we extracted policy from the value function in the\n",
    "value iteration method. Thus, similar to what we learned in the value iteration method, we\n",
    "define a function called `extract_policy` to extract a policy given the value function as\n",
    "shown below:\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_policy(value_table):\n",
    "    \n",
    "    #set the discount factor\n",
    "    gamma = 1.0\n",
    "     \n",
    "    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to\n",
    "    #be zero\n",
    "    policy = np.zeros(env.observation_space.n) \n",
    "    \n",
    "    #now, we compute the Q function using the optimal value function obtained from the\n",
    "    #previous step. After computing the Q function, we can extract policy by selecting action which has\n",
    "    #maximum Q value. Since we are computing the Q function using the optimal value\n",
    "    #function, the policy extracted from the Q function will be the optimal policy. \n",
    "    \n",
    "    #As shown below, for each state, we compute the Q values for all the actions in the state and\n",
    "    #then we extract policy by selecting the action which has maximum Q value.\n",
    "    \n",
    "    #for each state\n",
    "    for s in range(env.observation_space.n):\n",
    "        \n",
    "        #compute the Q value of all the actions in the state\n",
    "        Q_values = [sum([prob*(r + gamma * value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                \n",
    "        #extract policy by selecting the action which has maximum Q value\n",
    "        policy[s] = np.argmax(np.array(Q_values))        \n",
    "    \n",
    "    return policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Putting it all together\n",
    "\n",
    "First, let's define a function called `policy_iteration` which takes the environment as a\n",
    "parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy_iteration(env):\n",
    "    \n",
    "    #set the number of iterations\n",
    "    num_iterations = 1000\n",
    "    \n",
    "    #we learned that in the policy iteration method, we begin by initializing a random policy.\n",
    "    #so, we will initialize the random policy which selects the action 0 in all the states\n",
    "    policy = np.zeros(env.observation_space.n)  \n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        #compute the value function using the policy\n",
    "        value_function = compute_value_function(policy)\n",
    "        \n",
    "        #extract the new policy from the computed value function\n",
    "        new_policy = extract_policy(value_function)\n",
    "           \n",
    "        #if the policy and new_policy are same then break the loop\n",
    "        if (np.all(policy == new_policy)):\n",
    "            break\n",
    "        \n",
    "        #else, update the current policy to new_policy\n",
    "        policy = new_policy\n",
    "        \n",
    "    return policy\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, let's learn how to perform policy iteration and find the optimal policy in the frozen\n",
    "lake environment. \n",
    "\n",
    "So, we just feed the frozen lake environment to our `policy_iteration`\n",
    "function as shown below and get the optimal policy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_policy = policy_iteration(env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can print the optimal policy: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\n"
     ]
    }
   ],
   "source": [
    "print(optimal_policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe, our optimal policy tells us to perform the correct action in each\n",
    "state. Thus, we learned how to perform the policy iteration method to compute the optimal\n",
    "policy. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 03. Bellman Equation and Dynamic Programming/3.06. Solving the Frozen Lake Problem with Value Iteration.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Solving the Frozen Lake Problem with Value Iteration\n",
    "\n",
    "In the previous chapter, we have learned about the frozen lake environment. The frozen\n",
    "lake environment is shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/4.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's recap the frozen lake environment a bit. In the frozen lake environment shown above,\n",
    "the following applies:\n",
    "    \n",
    "* S implies the starting state\n",
    "* F implies the frozen states\n",
    "* H implies the hold states\n",
    "* G implies the goal state\n",
    "\n",
    "We learned that in the frozen lake environment, our goal is to reach the goal state G from\n",
    "the starting state S without visiting the hole states H. That is, while trying to reach the goal\n",
    "state G from the starting state S if the agent visits the hole state H then it will fall into the\n",
    "hole and die as shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/5.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, the goal of the agent is to reach the state G starting from the state S without visiting the\n",
    "hole states H as shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/6.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How can we achieve this goal? That is, how can we reach the state G from S without\n",
    "visiting H? We learned that the optimal policy tells the agent to perform correct action in\n",
    "each state. So, if we find the optimal policy then we can reach the state G from S visiting the state H. Okay, how to find the optimal policy? We can use the value iteration method\n",
    "we just learned to find the optimal policy.\n",
    "\n",
    "\n",
    "Remember that all our states (S to G) will be encoded from 0 to 16 and all the four actions -\n",
    "left, down, up, right will be encoded from 0 to 3 in the gym toolkit.\n",
    "So, in this section, we will learn how to find the optimal policy using the value iteration\n",
    "method so that the agent can reach the state G from S without visiting H."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's create the frozen lake environment using gym:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('FrozenLake-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the frozen lake environment using the render function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[41mS\u001b[0mFFF\n",
      "FHFH\n",
      "FFFH\n",
      "HFFG\n"
     ]
    }
   ],
   "source": [
    "env.render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can notice, our agent is in the state S and it has to reach the state G without visiting\n",
    "the states H. So, let's learn how to compute the optimal policy using the value iteration\n",
    "method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's learn how to compute the optimal value function and then we will see how to\n",
    "extract the optimal policy from the computed optimal value function. \n",
    "\n",
    "\n",
    "## Computing optimal value function\n",
    "\n",
    "We will define a function called `value_iteration` where we compute the optimal value\n",
    "function iteratively by taking maximum over Q function. For\n",
    "better understanding, let's closely look at the every line of the function and then we look at\n",
    "the complete function at the end which gives us more clarity.\n",
    "\n",
    "\n",
    "\n",
    "Define `value_iteration` function which takes the environment as a parameter: \n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def value_iteration(env):\n",
    "\n",
    "    #set the number of iterations\n",
    "    num_iterations = 1000\n",
    "    \n",
    "    #set the threshold number for checking the convergence of the value function\n",
    "    threshold = 1e-20\n",
    "    \n",
    "    #we also set the discount factor\n",
    "    gamma = 1.0\n",
    "    \n",
    "    #now, we will initialize the value table, with the value of all states to zero\n",
    "    value_table = np.zeros(env.observation_space.n)\n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        \n",
    "        #update the value table, that is, we learned that on every iteration, we use the updated value\n",
    "        #table (state values) from the previous iteration\n",
    "        updated_value_table = np.copy(value_table) \n",
    "             \n",
    "        #now, we compute the value function (state value) by taking the maximum of Q value.\n",
    "        \n",
    "        #thus, for each state, we compute the Q values of all the actions in the state and then\n",
    "        #we update the value of the state as the one which has maximum Q value as shown below:\n",
    "        for s in range(env.observation_space.n):\n",
    "            \n",
    "            Q_values = [sum([prob*(r + gamma * updated_value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                                        \n",
    "            value_table[s] = max(Q_values) \n",
    "                        \n",
    "        #after computing the value table, that is, value of all the states, we check whether the\n",
    "        #difference between value table obtained in the current iteration and previous iteration is\n",
    "        #less than or equal to a threshold value if it is less then we break the loop and return the\n",
    "        #value table as our optimal value function as shown below:\n",
    "    \n",
    "        if (np.sum(np.fabs(updated_value_table - value_table)) <= threshold):\n",
    "             break\n",
    "    \n",
    "    return value_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, that we have computed the optimal value function by taking the maximum over Q\n",
    "values, let's see how to extract the optimal policy from the optimal value function. \n",
    "\n",
    "\n",
    "## Extracting optimal policy from the optimal value function\n",
    "\n",
    "In the previous step, we computed the optimal value function. Now, let see how to extract\n",
    "the optimal policy from the computed optimal value function.\n",
    "\n",
    "\n",
    "First, we define a function called `extract_policy` which takes the `value_table` as a\n",
    "parameter: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_policy(value_table):\n",
    "    \n",
    "    #set the discount factor\n",
    "    gamma = 1.0\n",
    "     \n",
    "    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to\n",
    "    #be zero\n",
    "    policy = np.zeros(env.observation_space.n) \n",
    "    \n",
    "    #now, we compute the Q function using the optimal value function obtained from the\n",
    "    #previous step. After computing the Q function, we can extract policy by selecting action which has\n",
    "    #maximum Q value. Since we are computing the Q function using the optimal value\n",
    "    #function, the policy extracted from the Q function will be the optimal policy. \n",
    "    \n",
    "    #As shown below, for each state, we compute the Q values for all the actions in the state and\n",
    "    #then we extract policy by selecting the action which has maximum Q value.\n",
    "    \n",
    "    #for each state\n",
    "    for s in range(env.observation_space.n):\n",
    "        \n",
    "        #compute the Q value of all the actions in the state\n",
    "        Q_values = [sum([prob*(r + gamma * value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                \n",
    "        #extract policy by selecting the action which has maximum Q value\n",
    "        policy[s] = np.argmax(np.array(Q_values))        \n",
    "    \n",
    "    return policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "That's it! Now, we will see how to extract the optimal policy in our frozen lake\n",
    "environment. \n",
    "\n",
    "## Putting it all together\n",
    "We learn that in the frozen lake environment our goal is to find the optimal policy which\n",
    "selects the correct action in each state so that we can reach the state G from the state\n",
    "A without visiting the hole states.\n",
    "\n",
    "First, we compute the optimal value function using our `value_iteration` function by\n",
    "passing our frozen lake environment as the parameter: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_value_function = value_iteration(env=env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we extract the optimal policy from the optimal value function using our\n",
    "extract_policy function as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_policy = extract_policy(optimal_value_function)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can print the obtained optimal policy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\n"
     ]
    }
   ],
   "source": [
    "print(optimal_policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe, our optimal policy tells us to\n",
    "perform the correct action in each state. \n",
    "\n",
    "Now, that we have learned what is value iteration and how to perform the value iteration\n",
    "method to compute the optimal policy in our frozen lake environment, in the next section\n",
    "we will learn about another interesting method called the policy iteration. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 03. Bellman Equation and Dynamic Programming/3.08. Solving the Frozen Lake Problem with Policy Iteration.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Solving the Frozen Lake Problem with Policy Iteration\n",
    "\n",
    "We learned that in the frozen lake environment, our goal is to reach the goal state G from\n",
    "the starting state S without visiting the hole states H. Now, let's learn how to compute the optimal policy using the policy iteration method in the frozen lake environment.\n",
    "\n",
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's create the frozen lake environment using gym:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('FrozenLake-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We learned that in the policy iteration, we compute the value function using the policy\n",
    "iteratively. Once we found the optimal value function then the policy which is used to\n",
    "compute the optimal value function will be the optimal policy.\n",
    "\n",
    "So, first, let's learn how to compute the value function using the policy. \n",
    "\n",
    "\n",
    "## Computing value function using policy\n",
    "\n",
    "This step is exactly the same as how we computed the value function in the value iteration\n",
    "method but with a small difference. Here we compute the value function using the policy\n",
    "but in the value iteration method, we compute the value function by taking the maximum\n",
    "over Q values. Now, let's learn how to define a function that computes the value function\n",
    "using the given policy.\n",
    "\n",
    "\n",
    "Let's define a function called `compute_value_function` which takes the policy as a\n",
    "parameter:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_value_function(policy):\n",
    "    \n",
    "    #now, let's define the number of iterations\n",
    "    num_iterations = 1000\n",
    "    \n",
    "    #define the threshold value\n",
    "    threshold = 1e-20\n",
    "    \n",
    "    #set the discount factor\n",
    "    gamma = 1.0\n",
    "    \n",
    "    #now, we will initialize the value table, with the value of all states to zero\n",
    "    value_table = np.zeros(env.observation_space.n)\n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        \n",
    "        #update the value table, that is, we learned that on every iteration, we use the updated value\n",
    "        #table (state values) from the previous iteration\n",
    "        updated_value_table = np.copy(value_table)\n",
    "        \n",
    "        \n",
    "\n",
    "        #thus, for each state, we select the action according to the given policy and then we update the\n",
    "        #value of the state using the selected action as shown below\n",
    "        \n",
    "        #for each state\n",
    "        for s in range(env.observation_space.n):\n",
    "            \n",
    "            #select the action in the state according to the policy\n",
    "            a = policy[s]\n",
    "            \n",
    "            #compute the value of the state using the selected action\n",
    "            value_table[s] = sum([prob * (r + gamma * updated_value_table[s_]) \n",
    "                                        for prob, s_, r, _ in env.P[s][a]])\n",
    "            \n",
    "        #after computing the value table, that is, value of all the states, we check whether the\n",
    "        #difference between value table obtained in the current iteration and previous iteration is\n",
    "        #less than or equal to a threshold value if it is less then we break the loop and return the\n",
    "        #value table as an accurate value function of the given policy\n",
    "\n",
    "        if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):\n",
    "            break\n",
    "            \n",
    "    return value_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now that we have computed the value function of the policy, let's see how to extract the\n",
    "policy from the value function. \n",
    "\n",
    "## Extracting policy from the value function\n",
    "\n",
    "This step is exactly the same as how we extracted policy from the value function in the\n",
    "value iteration method. Thus, similar to what we learned in the value iteration method, we\n",
    "define a function called `extract_policy` to extract a policy given the value function as\n",
    "shown below:\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_policy(value_table):\n",
    "    \n",
    "    #set the discount factor\n",
    "    gamma = 1.0\n",
    "     \n",
    "    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to\n",
    "    #be zero\n",
    "    policy = np.zeros(env.observation_space.n) \n",
    "    \n",
    "    #now, we compute the Q function using the optimal value function obtained from the\n",
    "    #previous step. After computing the Q function, we can extract policy by selecting action which has\n",
    "    #maximum Q value. Since we are computing the Q function using the optimal value\n",
    "    #function, the policy extracted from the Q function will be the optimal policy. \n",
    "    \n",
    "    #As shown below, for each state, we compute the Q values for all the actions in the state and\n",
    "    #then we extract policy by selecting the action which has maximum Q value.\n",
    "    \n",
    "    #for each state\n",
    "    for s in range(env.observation_space.n):\n",
    "        \n",
    "        #compute the Q value of all the actions in the state\n",
    "        Q_values = [sum([prob*(r + gamma * value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                \n",
    "        #extract policy by selecting the action which has maximum Q value\n",
    "        policy[s] = np.argmax(np.array(Q_values))        \n",
    "    \n",
    "    return policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Putting it all together\n",
    "\n",
    "First, let's define a function called `policy_iteration` which takes the environment as a\n",
    "parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy_iteration(env):\n",
    "    \n",
    "    #set the number of iterations\n",
    "    num_iterations = 1000\n",
    "    \n",
    "    #we learned that in the policy iteration method, we begin by initializing a random policy.\n",
    "    #so, we will initialize the random policy which selects the action 0 in all the states\n",
    "    policy = np.zeros(env.observation_space.n)  \n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        #compute the value function using the policy\n",
    "        value_function = compute_value_function(policy)\n",
    "        \n",
    "        #extract the new policy from the computed value function\n",
    "        new_policy = extract_policy(value_function)\n",
    "           \n",
    "        #if the policy and new_policy are same then break the loop\n",
    "        if (np.all(policy == new_policy)):\n",
    "            break\n",
    "        \n",
    "        #else, update the current policy to new_policy\n",
    "        policy = new_policy\n",
    "        \n",
    "    return policy\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, let's learn how to perform policy iteration and find the optimal policy in the frozen\n",
    "lake environment. \n",
    "\n",
    "So, we just feed the frozen lake environment to our `policy_iteration`\n",
    "function as shown below and get the optimal policy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_policy = policy_iteration(env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can print the optimal policy: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\n"
     ]
    }
   ],
   "source": [
    "print(optimal_policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe, our optimal policy tells us to perform the correct action in each\n",
    "state. Thus, we learned how to perform the policy iteration method to compute the optimal\n",
    "policy. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 03. Bellman Equation and Dynamic Programming/README.md
================================================
# 3. Bellman Equation and Dynamic Programming
* 3.1. The Bellman Equation
   * 3.1.1. Bellman Equation of the Value Function
   * 3.1.2. Bellman Equation of the Q Function
* 3.2. Bellman Optimality Equation
* 3.3. Relation Between Value and Q Function
* 3.4. Dynamic Programming
* 3.5. Value Iteration
   * 3.5.1. Algorithm - Value Iteration
* 3.6. Solving the Frozen Lake Problem with Value Iteration
* 3.7. Policy iteration
   * 3.7.1. Algorithm - Policy iteration
* 3.8. Solving the Frozen Lake Problem with Policy Iteration
* 3.9. Is DP Applicable to all Environments?

================================================
FILE: 04. Monte Carlo Methods/.ipynb_checkpoints/4.01. Understanding the Monte Carlo Method-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Understanding the Monte Carlo method\n",
    "\n",
    "Before understanding how the Monte Carlo method is useful in reinforcement learning, first, let's understand what is Monte Carlo method and how does it work. The Monte Carlo method is a statistical technique used to find an approximate solution through sampling. \n",
    "\n",
    "For instance, the Monte Carlo method approximates the expectation of a random variable by sampling and when the sample size is greater the approximation will be better. Let's suppose we have a random variable X and say we need to compute the expected value of X, that is E[X], then we can compute it by taking the sum of values of X multiplied by their respective probabilities as shown below:\n",
    "\n",
    "$$ E(X) = \\sum_{i=1}^N x_i p(x_i) $$\n",
    "\n",
    "But instead of computing the expectation like this, can we approximate them with the Monte Carlo method? Yes! We can estimate the expected value of X by just sampling the values of X for some N times and compute the average value of X as the expected value of X as shown below:\n",
    "\n",
    "$$ \\mathbb{E}_{x \\sim p(x)}[X]  \\approx \\frac{1}{N} \\sum_i x_i $$\n",
    "\n",
    "\n",
    "When N is larger our approximation will be better. Thus, with the Monte Carlo method, we can approximate the solution through sampling and our approximation will be better when the sample size is large.\n",
    "\n",
    "In the upcoming sections, we will learn how exactly the Monte Carlo method is used in reinforcement learning. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 04. Monte Carlo Methods/.ipynb_checkpoints/4.02.  Prediction and control tasks-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prediction and control tasks\n",
    "\n",
    "In reinforcement learning, we perform two important tasks, and they are:\n",
    "* The prediction task\n",
    "* The control task\n",
    "\n",
    "## Prediction task\n",
    "In the prediction task, a policy π is given as an input and we try to predict the value\n",
    "function or Q function using the given policy. But what is the use of doing this? Our\n",
    "goal is to evaluate the given policy.That is, we need to determine whether the given policy is good or bad.  How can we determine that? If the agent obtains\n",
    "a good return using the given policy then we can say that our policy is good. Thus,\n",
    "to evaluate the given policy, we need to understand what the return the agent would\n",
    "obtain if it uses the given policy. To obtain the return, we predict the value function\n",
    "or Q function using the given policy.\n",
    "\n",
    "That is, we learned that the value function or value of a state denotes the expected\n",
    "return an agent would obtain starting from that state following some policy π. Thus,\n",
    "by predicting the value function using the given policy π, we can understand what\n",
    "the expected return the agent would obtain in each state if it uses the given\n",
    "policy π. If the return is good then we can say that the given policy is good.\n",
    "\n",
    "Similarly, we learned that the Q function or Q value denotes the expected return the\n",
    "agent would obtain starting from the state s and an action a following the policy π .\n",
    "Thus, predicting the Q function using the given policy π, we can understand what the\n",
    "expected return the agent would obtain in each state-action pair if it uses the given\n",
    "policy. If the return is good then we can say that the given policy is good.\n",
    "\n",
    "Thus, we can evaluate the given policy π by computing the value and Q functions.\n",
    "Note that, in the prediction task, we don't make any change to the given input policy.\n",
    "We keep the given policy as fixed and predict the value function or Q function using\n",
    "the given policy and obtain the expected return. Based on the expected return, we\n",
    "evaluate the given policy.\n",
    "\n",
    "\n",
    "## Control task\n",
    "\n",
    "Unlike the prediction task, in the control task, we will not be given any policy as\n",
    "an input. In the control task, our goal is to find the optimal policy. So, we will start\n",
    "off by initializing a random policy and we try to find the optimal policy iteratively.\n",
    "That is, we try to find an optimal policy that gives the maximum return.\n",
    "\n",
    "Thus, in a nutshell, in the prediction task, we evaluate the given input policy by\n",
    "predicting the value function or Q function, which helps us to understand the\n",
    "expected return an agent would get if it uses the given policy, while in the control\n",
    "task our goal is to find the optimal policy and we will not be given any policy as\n",
    "input; so we will start off by initializing a random policy and we try to find the\n",
    "optimal policy iteratively.\n",
    "\n",
    "Now that we have understood what prediction and control tasks are, in the next\n",
    "section, we will learn how to use the Monte Carlo method for performing the\n",
    "prediction and control tasks."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 04. Monte Carlo Methods/.ipynb_checkpoints/4.05. Every-visit MC Prediction with Blackjack Game-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Every-visit MC prediction with blackjack game\n",
    "\n",
    "To understand this section clearly, you can recap every visit Monte Carlo method we\n",
    "learned earlier. Let's now understand how to implement the every-visit MC prediction with\n",
    "the blackjack game step by step:\n",
    "\n",
    "Import the necessary libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import pandas as pd\n",
    "from collections import defaultdict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a blackjack environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('Blackjack-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining a policy\n",
    "\n",
    "We learned that in the prediction method, we will be given an input policy and we predict\n",
    "the value function of the given input policy. So, now, we first define a policy function\n",
    "which acts as an input policy. That is, we define the input policy whose value function will\n",
    "be predicted in the upcoming steps.\n",
    "\n",
    "As shown below, our policy function takes the state as an input and if the `state[0]`, sum of\n",
    "our cards value is greater than `19`, then it will return action 0 (stand) else it will return\n",
    "action 1 (hit):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy(state):\n",
    "    return 0 if state[0] > 19 else 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We defined an optimal policy, that is, it makes more sense to perform an action 0 (stand)\n",
    "when our sum value is already greater than 19. That is, when the sum value is greater than\n",
    "19 we don't have to perform 1 (hit) action and receive a new card which may cause us to\n",
    "lose the game or burst.\n",
    "\n",
    "For example, let's generate an initial state by resetting the environment as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(20, 7, False)\n"
     ]
    }
   ],
   "source": [
    "state = env.reset()\n",
    "print(state)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can notice, `state[0] = 20`, that is our sum of cards value is 20, so in this case, our\n",
    "policy will return the action 0 (stand) as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n"
     ]
    }
   ],
   "source": [
    "print(policy(state))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, that we have defined the policy, in the next section, we will predict the value\n",
    "function (state values) of this policy. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating an episode\n",
    "Next, we generate an episode using the given policy, so, we, define a function\n",
    "called `generate_episode` which takes the policy as an input and generates the episode\n",
    "using the given policy.\n",
    "\n",
    "First, let's set the number of time steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_timestep = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_episode(policy):\n",
    "    \n",
    "    #let's define a list called episode for storing the episode\n",
    "    episode = []\n",
    "    \n",
    "    #initialize the state by resetting the environment\n",
    "    state = env.reset()\n",
    "    \n",
    "    #then for each time step\n",
    "    for i in range(num_timestep):\n",
    "        \n",
    "        #select the action according to the given policy\n",
    "        action = policy(state)\n",
    "        \n",
    "        #perform the action and store the next state information\n",
    "        next_state, reward, done, info = env.step(action)\n",
    "        \n",
    "        #store the state, action, reward into our episode list\n",
    "        episode.append((state, action, reward))\n",
    "        \n",
    "        #If the next state is a final state then break the loop else update the next state to the current state\n",
    "        if done:\n",
    "            break\n",
    "            \n",
    "        state = next_state\n",
    "\n",
    "    return episode"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at how the output of our `generate_episode` function looks like. Note\n",
    "that we generate episode using the policy we defined earlier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[((12, 10, False), 1, 0), ((15, 10, False), 1, -1)]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate_episode(policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe our output is in the form of `[(state, action, reward)]`. As shown above,\n",
    "we have two states in our episode. We performed action 1 (hit) in the state `(10, 2,\n",
    "False)` and received a 0 reward and the action 0 (stand) in the state `(20, 2, False)` and\n",
    "received 1.0 reward.\n",
    "\n",
    "Now that we have learned how to generate an episode using the given policy, next, we will\n",
    "look at how to compute the value of the state (value function) using every visit-MC\n",
    "method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing the value function\n",
    "\n",
    "We learned that in order to predict the value function, we generate several episodes using\n",
    "the given policy and compute the value of the state as an average return across several\n",
    "episodes. Let's see how to do implement that.\n",
    "\n",
    "First, we define the `total_return` and `N` as a dictionary for storing the total return and the\n",
    "number of times the state is visited across episodes respectively. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_return = defaultdict(float)\n",
    "N = defaultdict(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set the number of iterations, that is, the number of episodes, we want to generate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_iterations = 500000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "#then, for every iteration\n",
    "for i in range(num_iterations):\n",
    "    \n",
    "    #generate the episode using the given policy, that is, generate an episode using the policy\n",
    "    #function we defined earlier\n",
    "    episode = generate_episode(policy)\n",
    "    \n",
    "    #store all the states, actions, rewards obtained from the episode\n",
    "    states, actions, rewards = zip(*episode)\n",
    "    \n",
    "    #then for each step in the episode \n",
    "    for t, state in enumerate(states):\n",
    "        \n",
    "            #compute the return R of the state as the sum of reward\n",
    "            R = (sum(rewards[t:]))\n",
    "            \n",
    "            #update the total_return of the state\n",
    "            total_return[state] =  total_return[state] + R\n",
    "            \n",
    "            #update the number of times the state is visited in the episode\n",
    "            N[state] =  N[state] + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After computing the `total_return` and `N` We can just convert them into a pandas data\n",
    "frame for a better understanding. [Note that this is just to give a clear understanding of the\n",
    "algorithm, we don't necessarily have to convert to the pandas data frame, we can also\n",
    "implement this efficiently just using the dictionary]\n",
    "\n",
    "\n",
    "Convert `total_return` dictionary to a data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_return = pd.DataFrame(total_return.items(),columns=['state', 'total_return'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convert the counter `N` dictionary to a data frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = pd.DataFrame(N.items(),columns=['state', 'N'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Merge the two data frames on states:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.merge(total_return, N, on=\"state\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look at the first few rows of the data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>state</th>\n",
       "      <th>total_return</th>\n",
       "      <th>N</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>(7, 7, False)</td>\n",
       "      <td>-4.0</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>(11, 7, False)</td>\n",
       "      <td>19.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>(16, 7, False)</td>\n",
       "      <td>-38.0</td>\n",
       "      <td>104</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>(19, 7, False)</td>\n",
       "      <td>55.0</td>\n",
       "      <td>113</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>(20, 8, False)</td>\n",
       "      <td>96.0</td>\n",
       "      <td>129</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>(20, 2, False)</td>\n",
       "      <td>94.0</td>\n",
       "      <td>142</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>(15, 5, False)</td>\n",
       "      <td>-42.0</td>\n",
       "      <td>93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>(20, 5, False)</td>\n",
       "      <td>62.0</td>\n",
       "      <td>115</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>(12, 3, False)</td>\n",
       "      <td>-55.0</td>\n",
       "      <td>91</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>9</td>\n",
       "      <td>(15, 3, False)</td>\n",
       "      <td>-36.0</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            state  total_return    N\n",
       "0   (7, 7, False)          -4.0   16\n",
       "1  (11, 7, False)          19.0   43\n",
       "2  (16, 7, False)         -38.0  104\n",
       "3  (19, 7, False)          55.0  113\n",
       "4  (20, 8, False)          96.0  129\n",
       "5  (20, 2, False)          94.0  142\n",
       "6  (15, 5, False)         -42.0   93\n",
       "7  (20, 5, False)          62.0  115\n",
       "8  (12, 3, False)         -55.0   91\n",
       "9  (15, 3, False)         -36.0   96"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe from above, we have the total return and\n",
    "the number of times the state is visited.\n",
    "\n",
    "Next, we can compute the value of the state as the average return, thus, we can write:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['value'] = df['total_return']/df['N']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the first few rows of the data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>state</th>\n",
       "      <th>total_return</th>\n",
       "      <th>N</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>(7, 7, False)</td>\n",
       "      <td>-4.0</td>\n",
       "      <td>16</td>\n",
       "      <td>-0.250000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>(11, 7, False)</td>\n",
       "      <td>19.0</td>\n",
       "      <td>43</td>\n",
       "      <td>0.441860</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>(16, 7, False)</td>\n",
       "      <td>-38.0</td>\n",
       "      <td>104</td>\n",
       "      <td>-0.365385</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>(19, 7, False)</td>\n",
       "      <td>55.0</td>\n",
       "      <td>113</td>\n",
       "      <td>0.486726</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>(20, 8, False)</td>\n",
       "      <td>96.0</td>\n",
       "      <td>129</td>\n",
       "      <td>0.744186</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>(20, 2, False)</td>\n",
       "      <td>94.0</td>\n",
       "      <td>142</td>\n",
       "      <td>0.661972</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>(15, 5, False)</td>\n",
       "      <td>-42.0</td>\n",
       "      <td>93</td>\n",
       "      <td>-0.451613</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>(20, 5, False)</td>\n",
       "      <td>62.0</td>\n",
       "      <td>115</td>\n",
       "      <td>0.539130</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>(12, 3, False)</td>\n",
       "      <td>-55.0</td>\n",
       "      <td>91</td>\n",
       "      <td>-0.604396</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>9</td>\n",
       "      <td>(15, 3, False)</td>\n",
       "      <td>-36.0</td>\n",
       "      <td>96</td>\n",
       "      <td>-0.375000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            state  total_return    N     value\n",
       "0   (7, 7, False)          -4.0   16 -0.250000\n",
       "1  (11, 7, False)          19.0   43  0.441860\n",
       "2  (16, 7, False)         -38.0  104 -0.365385\n",
       "3  (19, 7, False)          55.0  113  0.486726\n",
       "4  (20, 8, False)          96.0  129  0.744186\n",
       "5  (20, 2, False)          94.0  142  0.661972\n",
       "6  (15, 5, False)         -42.0   93 -0.451613\n",
       "7  (20, 5, False)          62.0  115  0.539130\n",
       "8  (12, 3, False)         -55.0   91 -0.604396\n",
       "9  (15, 3, False)         -36.0   96 -0.375000"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "scrolled": false
   },
   "source": [
    "As we can observe we now have the value of the state which is just the average of a return\n",
    "of the state across several episodes. Thus, we have successfully predicted the value function\n",
    "of the given policy using the every-visit MC method.\n",
    "\n",
    "Okay, let's check the value of some states and understand how accurately our value\n",
    "function is estimated according to the given policy. Recall that when we started off, to\n",
    "generate episodes, we used the optimal policy which selects action 0 (stand) when the sum\n",
    "value is greater than 19 and action 1 (hit) when the sum value is less than 19.\n",
    "\n",
    "Let's evaluate the value of the state `(21,9,False)`, as we can observe, our sum of cards\n",
    "value is already 21 and so this is a good state and should have a high value. Let's see what's\n",
    "our estimated value of the state:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.90163934])"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df['state']==(21,9,False)]['value'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe above our value of the state is high.\n",
    "Now, let's check the value of the state `(5,8,False)` as we can notice, our sum of cards\n",
    "value is just 5 and even the one dealer's single card has a high value, 8, then, in this case,\n",
    "the value of the state should be less. Let's see what's our estimated value of the state:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.08333333])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df['state']==(5,8,False)]['value'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can notice, the value of the state is less.\n",
    "Thus, we learned how to predict the value function of the given policy using the every-visit\n",
    "MC prediction method, in the next section, we will look at how to compute the value of the\n",
    "state using the first-visit mC method. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 04. Monte Carlo Methods/.ipynb_checkpoints/4.06. First-visit MC Prediction with Blackjack Game-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# First-visit MC prediction with blackjack game\n",
    "\n",
    "To understand this section clearly, you can recap first visit Monte Carlo method we\n",
    "learned earlier. Let's now understand how to implement the first-visit MC prediction with\n",
    "the blackjack game step by step:\n",
    "\n",
    "Import the necessary libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import pandas as pd\n",
    "from collections import defaultdict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a blackjack environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('Blackjack-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining a policy\n",
    "\n",
    "We learned that in the prediction method, we will be given an input policy and we predict\n",
    "the value function of the given input policy. So, now, we first define a policy function\n",
    "which acts as an input policy. That is, we define the input policy whose value function will\n",
    "be predicted in the upcoming steps.\n",
    "\n",
    "As shown below, our policy function takes the state as an input and if the `state[0]`, sum of\n",
    "our cards value is greater than 19, then it will return action 0 (stand) else it will return\n",
    "action 1 (hit):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy(state):\n",
    "    return 0 if state[0] > 19 else 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We defined an optimal policy, that is, it makes more sense to perform an action 0 (stand)\n",
    "when our sum value is already greater than 19. That is, when the sum value is greater than\n",
    "19 we don't have to perform 1 (hit) action and receive a new card which may cause us to\n",
    "lose the game or burst.\n",
    "\n",
    "For example, let's generate an initial state by resetting the environment as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(11, 6, False)\n"
     ]
    }
   ],
   "source": [
    "state = env.reset()\n",
    "print(state)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can notice, `state[0] = 11`, that is our sum of cards value is 11, so in this case, our\n",
    "policy will return the action 1 (hit) as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n"
     ]
    }
   ],
   "source": [
    "print(policy(state))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, that we have defined the policy, in the next section, we will predict the value\n",
    "function (state values) of this policy. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating an episode\n",
    "Next, we generate an episode using the given policy, so, we, define a function\n",
    "called `generate_episode` which takes the policy as an input and generates the episode\n",
    "using the given policy.\n",
    "\n",
    "First, let's set the number of time steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_timestep = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_episode(policy):\n",
    "    \n",
    "    #let's define a list called episode for storing the episode\n",
    "    episode = []\n",
    "    \n",
    "    #initialize the state by resetting the environment\n",
    "    state = env.reset()\n",
    "    \n",
    "    #then for each time step\n",
    "    for i in range(num_timestep):\n",
    "        \n",
    "        #select the action according to the given policy\n",
    "        action = policy(state)\n",
    "        \n",
    "        #perform the action and store the next state information\n",
    "        next_state, reward, done, info = env.step(action)\n",
    "        \n",
    "        #store the state, action, reward into our episode list\n",
    "        episode.append((state, action, reward))\n",
    "        \n",
    "        #If the next state is a final state then break the loop else update the next state to the current state\n",
    "        if done:\n",
    "            break\n",
    "            \n",
    "        state = next_state\n",
    "\n",
    "    return episode"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at how the output of our `generate_episode` function looks like. Note\n",
    "that we generate episode using the policy we defined earlier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[((15, 10, False), 1, -1)]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate_episode(policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe our output is in the form of `[(state, action, reward)]`. As shown above,\n",
    "we have two states in our episode. We performed action 1 (hit) in the state `(10, 2, False)` and received a 0 reward and the action 0 (stand) in the state `(20, 2, False)` and\n",
    "received 1.0 reward.\n",
    "\n",
    "Now that we have learned how to generate an episode using the given policy, next, we will\n",
    "look at how to compute the value of the state (value function) using first visit-MC\n",
    "method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing the value function\n",
    "\n",
    "We learned that in order to predict the value function, we generate several episodes using\n",
    "the given policy and compute the value of the state as an average return across several\n",
    "episodes. Let's see how to do implement that.\n",
    "\n",
    "First, we define the `total_return` and `N` as a dictionary for storing the total return and the\n",
    "number of times the state is visited across episodes respectively. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_return = defaultdict(float)\n",
    "N = defaultdict(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set the number of iterations, that is, the number of episodes, we want to generate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_iterations = 10000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "#then, for every iteration\n",
    "for i in range(num_iterations):\n",
    "    \n",
    "    #generate the episode using the given policy, that is, generate an episode using the policy\n",
    "    #function we defined earlier\n",
    "    episode = generate_episode(policy)\n",
    "    \n",
    "    #store all the states, actions, rewards obtained from the episode\n",
    "    states, actions, rewards = zip(*episode)\n",
    "    \n",
    "    #then, for each step in the episode\n",
    "    for t, state in enumerate(states):\n",
    "        \n",
    "        #if the state is not visited already\n",
    "        if state not in states[0:t]:\n",
    "                \n",
    "            #compute the return R of the state as the sum of reward\n",
    "            R = (sum(rewards[t:]))\n",
    "            \n",
    "            #update the total_return of the state\n",
    "            total_return[state] =  total_return[state] + R\n",
    "            \n",
    "            #update the number of times the state is visited in the episode\n",
    "            N[state] =  N[state] + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After computing the `total_return` and `N` We can just convert them into a pandas data\n",
    "frame for a better understanding. [Note that this is just to give a clear understanding of the\n",
    "algorithm, we don't necessarily have to convert to the pandas data frame, we can also\n",
    "implement this efficiently just using the dictionary]\n",
    "\n",
    "\n",
    "Convert `total_returns` dictionary to a data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_return = pd.DataFrame(total_return.items(),columns=['state', 'total_return'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convert the counter `N` dictionary to a data frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = pd.DataFrame(N.items(),columns=['state', 'N'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Merge the two data frames on states:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.merge(total_return, N, on=\"state\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look at the first few rows of the data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>state</th>\n",
       "      <th>total_return</th>\n",
       "      <th>N</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>(16, 3, False)</td>\n",
       "      <td>-53.0</td>\n",
       "      <td>98</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>(11, 9, False)</td>\n",
       "      <td>6.0</td>\n",
       "      <td>49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>(21, 9, False)</td>\n",
       "      <td>67.0</td>\n",
       "      <td>68</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>(9, 8, False)</td>\n",
       "      <td>-3.0</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>(16, 8, False)</td>\n",
       "      <td>-60.0</td>\n",
       "      <td>95</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>(17, 8, False)</td>\n",
       "      <td>-69.0</td>\n",
       "      <td>117</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>(12, 5, False)</td>\n",
       "      <td>-38.0</td>\n",
       "      <td>91</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>(20, 5, False)</td>\n",
       "      <td>95.0</td>\n",
       "      <td>122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>(9, 6, False)</td>\n",
       "      <td>2.0</td>\n",
       "      <td>39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>9</td>\n",
       "      <td>(14, 6, False)</td>\n",
       "      <td>-53.0</td>\n",
       "      <td>115</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            state  total_return    N\n",
       "0  (16, 3, False)         -53.0   98\n",
       "1  (11, 9, False)           6.0   49\n",
       "2  (21, 9, False)          67.0   68\n",
       "3   (9, 8, False)          -3.0   27\n",
       "4  (16, 8, False)         -60.0   95\n",
       "5  (17, 8, False)         -69.0  117\n",
       "6  (12, 5, False)         -38.0   91\n",
       "7  (20, 5, False)          95.0  122\n",
       "8   (9, 6, False)           2.0   39\n",
       "9  (14, 6, False)         -53.0  115"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe from above, we have the total return and\n",
    "the number of times the state is visited.\n",
    "\n",
    "Next, we can compute the value of the state as the average return, thus, we can write:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['value'] = df['total_return']/df['N']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the first few rows of the data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>state</th>\n",
       "      <th>total_return</th>\n",
       "      <th>N</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>(16, 3, False)</td>\n",
       "      <td>-53.0</td>\n",
       "      <td>98</td>\n",
       "      <td>-0.540816</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>(11, 9, False)</td>\n",
       "      <td>6.0</td>\n",
       "      <td>49</td>\n",
       "      <td>0.122449</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>(21, 9, False)</td>\n",
       "      <td>67.0</td>\n",
       "      <td>68</td>\n",
       "      <td>0.985294</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>(9, 8, False)</td>\n",
       "      <td>-3.0</td>\n",
       "      <td>27</td>\n",
       "      <td>-0.111111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>(16, 8, False)</td>\n",
       "      <td>-60.0</td>\n",
       "      <td>95</td>\n",
       "      <td>-0.631579</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>(17, 8, False)</td>\n",
       "      <td>-69.0</td>\n",
       "      <td>117</td>\n",
       "      <td>-0.589744</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>(12, 5, False)</td>\n",
       "      <td>-38.0</td>\n",
       "      <td>91</td>\n",
       "      <td>-0.417582</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>(20, 5, False)</td>\n",
       "      <td>95.0</td>\n",
       "      <td>122</td>\n",
       "      <td>0.778689</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>(9, 6, False)</td>\n",
       "      <td>2.0</td>\n",
       "      <td>39</td>\n",
       "      <td>0.051282</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>9</td>\n",
       "      <td>(14, 6, False)</td>\n",
       "      <td>-53.0</td>\n",
       "      <td>115</td>\n",
       "      <td>-0.460870</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            state  total_return    N     value\n",
       "0  (16, 3, False)         -53.0   98 -0.540816\n",
       "1  (11, 9, False)           6.0   49  0.122449\n",
       "2  (21, 9, False)          67.0   68  0.985294\n",
       "3   (9, 8, False)          -3.0   27 -0.111111\n",
       "4  (16, 8, False)         -60.0   95 -0.631579\n",
       "5  (17, 8, False)         -69.0  117 -0.589744\n",
       "6  (12, 5, False)         -38.0   91 -0.417582\n",
       "7  (20, 5, False)          95.0  122  0.778689\n",
       "8   (9, 6, False)           2.0   39  0.051282\n",
       "9  (14, 6, False)         -53.0  115 -0.460870"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "scrolled": false
   },
   "source": [
    "As we can observe we now have the value of the state which is just the average of a return\n",
    "of the state across several episodes. Thus, we have successfully predicted the value function\n",
    "of the given policy using the first-visit MC method.\n",
    "\n",
    "Okay, let's check the value of some states and understand how accurately our value\n",
    "function is estimated according to the given policy. Recall that when we started off, to\n",
    "generate episodes, we used the optimal policy which selects action 0 (stand) when the sum\n",
    "value is greater than 19 and action 1 (hit) when the sum value is less than 19.\n",
    "\n",
    "Let's evaluate the value of the state `(21,9,False)`, as we can observe, our sum of cards\n",
    "value is already 21 and so this is a good state and should have a high value. Let's see what's\n",
    "our estimated value of the state:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.98529412])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df['state']==(21,9,False)]['value'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe above our value of the state is high.\n",
    "Now, let's check the value of the state `(5,8,False)` as we can notice, our sum of cards\n",
    "value is just 5 and even the one dealer's single card has a high value, 8, then, in this case,\n",
    "the value of the state should be less. Let's see what's our estimated value of the state:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.55555556])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df['state']==(5,8,False)]['value'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can notice, the value of the state is less.\n",
    "Thus, we learned how to predict the value function of the given policy using the first-visit\n",
    "MC prediction method. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 04. Monte Carlo Methods/.ipynb_checkpoints/4.13. Implementing On-Policy MC Control-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Implementing On-policy MC control\n",
    "\n",
    "Now, let's learn how to implement the MC control method with epsilon-greedy policy for playing the blackjack game, that is, we will see how can we use the MC control method for\n",
    "finding the optimal policy in the blackjack game:\n",
    "\n",
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import pandas as pd\n",
    "from collections import defaultdict\n",
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a blackjack environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('Blackjack-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize the dictionary for storing the Q values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "Q = defaultdict(float)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize the dictionary for storing the total return of the state-action pair:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_return = defaultdict(float)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize the dictionary for storing the count of the number of times a state-action pair is\n",
    "visited:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = defaultdict(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define the epsilon-greedy policy\n",
    "\n",
    "We learned that we select actions based on the epsilon-greedy policy, so we define a\n",
    "function called `epsilon_greedy_policy` which takes the state and Q value as an input\n",
    "and returns the action to be performed in the given state:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def epsilon_greedy_policy(state,Q):\n",
    "    \n",
    "    #set the epsilon value to 0.5\n",
    "    epsilon = 0.5\n",
    "    \n",
    "    #sample a random value from the uniform distribution, if the sampled value is less than\n",
    "    #epsilon then we select a random action else we select the best action which has maximum Q\n",
    "    #value as shown below\n",
    "    \n",
    "    if random.uniform(0,1) < epsilon:\n",
    "        return env.action_space.sample()\n",
    "    else:\n",
    "        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating an episode\n",
    "\n",
    "Now, let's generate an episode using the epsilon-greedy policy. We define a function called\n",
    "`generate_episode` which takes the Q value as an input and returns the episode.\n",
    "\n",
    "First, let's set the number of time steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_timesteps = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_episode(Q):\n",
    "    \n",
    "    #initialize a list for storing the episode\n",
    "    episode = []\n",
    "    \n",
    "    #initialize the state using the reset function\n",
    "    state = env.reset()\n",
    "    \n",
    "    #then for each time step\n",
    "    for t in range(num_timesteps):\n",
    "        \n",
    "        #select the action according to the epsilon-greedy policy\n",
    "        action = epsilon_greedy_policy(state,Q)\n",
    "        \n",
    "        #perform the selected action and store the next state information\n",
    "        next_state, reward, done, info = env.step(action)\n",
    "        \n",
    "        #store the state, action, reward in the episode list\n",
    "        episode.append((state, action, reward))\n",
    "        \n",
    "        #if the next state is a final state then break the loop else update the next state to the current\n",
    "        #state\n",
    "        if done:\n",
    "            break\n",
    "            \n",
    "        state = next_state\n",
    "\n",
    "    return episode"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing the optimal policy\n",
    "\n",
    "Now, let's learn how to compute the optimal policy. First, let's set the number of iterations, that is, the number of episodes, we want to generate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_iterations = 50000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We learned that in the on-policy control method, we will not be given any policy as an\n",
    "input. So, we initialize a random policy in the first iteration and improve the policy\n",
    "iteratively by computing Q value. Since we extract the policy from the Q function, we don't\n",
    "have to explicitly define the policy. As the Q value improves the policy also improves\n",
    "implicitly. That is, in the first iteration we generate episode by extracting the policy\n",
    "(epsilon-greedy) from the initialized Q function. Over a series of iterations, we will find the\n",
    "optimal Q function and hence we also find the optimal policy. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "#for each iteration\n",
    "for i in range(num_iterations):\n",
    "    \n",
    "    #so, here we pass our initialized Q function to generate an episode\n",
    "    episode = generate_episode(Q)\n",
    "    \n",
    "    #get all the state-action pairs in the episode\n",
    "    all_state_action_pairs = [(s, a) for (s,a,r) in episode]\n",
    "    \n",
    "    #store all the rewards obtained in the episode in the rewards list\n",
    "    rewards = [r for (s,a,r) in episode]\n",
    "\n",
    "    #for each step in the episode \n",
    "    for t, (state, action, reward) in enumerate(episode):\n",
    "\n",
    "        #if the state-action pair is occurring for the first time in the episode\n",
    "        if not (state, action) in all_state_action_pairs[0:t]:\n",
    "            \n",
    "            #compute the return R of the state-action pair as the sum of rewards\n",
    "            R = sum(rewards[t:])\n",
    "            \n",
    "            #update total return of the state-action pair\n",
    "            total_return[(state,action)] = total_return[(state,action)] + R\n",
    "            \n",
    "            #update the number of times the state-action pair is visited\n",
    "            N[(state, action)] += 1\n",
    "\n",
    "            #compute the Q value by just taking the average\n",
    "            Q[(state,action)] = total_return[(state, action)] / N[(state, action)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus on every iteration, the Q value improves and so does policy.\n",
    "After all the iterations, we can have a look at the Q value of each state-action in the pandas\n",
    "data frame for more clarity.\n",
    "\n",
    "First, let's convert the Q value dictionary to a pandas data\n",
    "frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(Q.items(),columns=['state_action pair','value'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the first few rows of the data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>state_action pair</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>((14, 10, False), 0)</td>\n",
       "      <td>-0.641944</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>((14, 10, False), 1)</td>\n",
       "      <td>-0.617698</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>((11, 10, False), 1)</td>\n",
       "      <td>-0.170015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>((12, 3, False), 0)</td>\n",
       "      <td>-0.180328</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>((12, 3, False), 1)</td>\n",
       "      <td>-0.320388</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>((13, 1, False), 0)</td>\n",
       "      <td>-0.752381</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>((11, 6, False), 1)</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>((17, 6, False), 0)</td>\n",
       "      <td>-0.118644</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>((10, 9, False), 0)</td>\n",
       "      <td>-0.714286</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>9</td>\n",
       "      <td>((10, 9, False), 1)</td>\n",
       "      <td>-0.041322</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>10</td>\n",
       "      <td>((14, 4, False), 0)</td>\n",
       "      <td>-0.148289</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       state_action pair     value\n",
       "0   ((14, 10, False), 0) -0.641944\n",
       "1   ((14, 10, False), 1) -0.617698\n",
       "2   ((11, 10, False), 1) -0.170015\n",
       "3    ((12, 3, False), 0) -0.180328\n",
       "4    ((12, 3, False), 1) -0.320388\n",
       "5    ((13, 1, False), 0) -0.752381\n",
       "6    ((11, 6, False), 1)  0.000000\n",
       "7    ((17, 6, False), 0) -0.118644\n",
       "8    ((10, 9, False), 0) -0.714286\n",
       "9    ((10, 9, False), 1) -0.041322\n",
       "10   ((14, 4, False), 0) -0.148289"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(11)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe, we have the Q values for all the state-action pairs. Now we can extract\n",
    "the policy by selecting the action which has maximum Q value in each state. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To learn more how to select action based on this Q value, check the book under the section, implementing on-policy control."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 04. Monte Carlo Methods/4.05. Every-visit MC Prediction with Blackjack Game.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Every-visit MC prediction with blackjack game\n",
    "\n",
    "To understand this section clearly, you can recap every visit Monte Carlo method we\n",
    "learned earlier. Let's now understand how to implement the every-visit MC prediction with\n",
    "the blackjack game step by step:\n",
    "\n",
    "Import the necessary libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import pandas as pd\n",
    "from collections import defaultdict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a blackjack environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('Blackjack-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining a policy\n",
    "\n",
    "We learned that in the prediction method, we will be given an input policy and we predict\n",
    "the value function of the given input policy. So, now, we first define a policy function\n",
    "which acts as an input policy. That is, we define the input policy whose value function will\n",
    "be predicted in the upcoming steps.\n",
    "\n",
    "As shown below, our policy function takes the state as an input and if the `state[0]`, sum of\n",
    "our cards value is greater than `19`, then it will return action 0 (stand) else it will return\n",
    "action 1 (hit):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy(state):\n",
    "    return 0 if state[0] > 19 else 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We defined an optimal policy, that is, it makes more sense to perform an action 0 (stand)\n",
    "when our sum value is already greater than 19. That is, when the sum value is greater than\n",
    "19 we don't have to perform 1 (hit) action and receive a new card which may cause us to\n",
    "lose the game or burst.\n",
    "\n",
    "For example, let's generate an initial state by resetting the environment as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(20, 7, False)\n"
     ]
    }
   ],
   "source": [
    "state = env.reset()\n",
    "print(state)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can notice, `state[0] = 20`, that is our sum of cards value is 20, so in this case, our\n",
    "policy will return the action 0 (stand) as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n"
     ]
    }
   ],
   "source": [
    "print(policy(state))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, that we have defined the policy, in the next section, we will predict the value\n",
    "function (state values) of this policy. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating an episode\n",
    "Next, we generate an episode using the given policy, so, we, define a function\n",
    "called `generate_episode` which takes the policy as an input and generates the episode\n",
    "using the given policy.\n",
    "\n",
    "First, let's set the number of time steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_timestep = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_episode(policy):\n",
    "    \n",
    "    #let's define a list called episode for storing the episode\n",
    "    episode = []\n",
    "    \n",
    "    #initialize the state by resetting the environment\n",
    "    state = env.reset()\n",
    "    \n",
    "    #then for each time step\n",
    "    for i in range(num_timestep):\n",
    "        \n",
    "        #select the action according to the given policy\n",
    "        action = policy(state)\n",
    "        \n",
    "        #perform the action and store the next state information\n",
    "        next_state, reward, done, info = env.step(action)\n",
    "        \n",
    "        #store the state, action, reward into our episode list\n",
    "        episode.append((state, action, reward))\n",
    "        \n",
    "        #If the next state is a final state then break the loop else update the next state to the current state\n",
    "        if done:\n",
    "            break\n",
    "            \n",
    "        state = next_state\n",
    "\n",
    "    return episode"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at how the output of our `generate_episode` function looks like. Note\n",
    "that we generate episode using the policy we defined earlier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[((12, 10, False), 1, 0), ((15, 10, False), 1, -1)]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate_episode(policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe our output is in the form of `[(state, action, reward)]`. As shown above,\n",
    "we have two states in our episode. We performed action 1 (hit) in the state `(10, 2,\n",
    "False)` and received a 0 reward and the action 0 (stand) in the state `(20, 2, False)` and\n",
    "received 1.0 reward.\n",
    "\n",
    "Now that we have learned how to generate an episode using the given policy, next, we will\n",
    "look at how to compute the value of the state (value function) using every visit-MC\n",
    "method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing the value function\n",
    "\n",
    "We learned that in order to predict the value function, we generate several episodes using\n",
    "the given policy and compute the value of the state as an average return across several\n",
    "episodes. Let's see how to do implement that.\n",
    "\n",
    "First, we define the `total_return` and `N` as a dictionary for storing the total return and the\n",
    "number of times the state is visited across episodes respectively. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_return = defaultdict(float)\n",
    "N = defaultdict(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set the number of iterations, that is, the number of episodes, we want to generate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_iterations = 500000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "#then, for every iteration\n",
    "for i in range(num_iterations):\n",
    "    \n",
    "    #generate the episode using the given policy, that is, generate an episode using the policy\n",
    "    #function we defined earlier\n",
    "    episode = generate_episode(policy)\n",
    "    \n",
    "    #store all the states, actions, rewards obtained from the episode\n",
    "    states, actions, rewards = zip(*episode)\n",
    "    \n",
    "    #then for each step in the episode \n",
    "    for t, state in enumerate(states):\n",
    "        \n",
    "            #compute the return R of the state as the sum of reward\n",
    "            R = (sum(rewards[t:]))\n",
    "            \n",
    "            #update the total_return of the state\n",
    "            total_return[state] =  total_return[state] + R\n",
    "            \n",
    "            #update the number of times the state is visited in the episode\n",
    "            N[state] =  N[state] + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After computing the `total_return` and `N` We can just convert them into a pandas data\n",
    "frame for a better understanding. [Note that this is just to give a clear understanding of the\n",
    "algorithm, we don't necessarily have to convert to the pandas data frame, we can also\n",
    "implement this efficiently just using the dictionary]\n",
    "\n",
    "\n",
    "Convert `total_return` dictionary to a data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_return = pd.DataFrame(total_return.items(),columns=['state', 'total_return'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convert the counter `N` dictionary to a data frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = pd.DataFrame(N.items(),columns=['state', 'N'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Merge the two data frames on states:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.merge(total_return, N, on=\"state\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look at the first few rows of the data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>state</th>\n",
       "      <th>total_return</th>\n",
       "      <th>N</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>(7, 7, False)</td>\n",
       "      <td>-4.0</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>(11, 7, False)</td>\n",
       "      <td>19.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>(16, 7, False)</td>\n",
       "      <td>-38.0</td>\n",
       "      <td>104</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>(19, 7, False)</td>\n",
       "      <td>55.0</td>\n",
       "      <td>113</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>(20, 8, False)</td>\n",
       "      <td>96.0</td>\n",
       "      <td>129</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>(20, 2, False)</td>\n",
       "      <td>94.0</td>\n",
       "      <td>142</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>(15, 5, False)</td>\n",
       "      <td>-42.0</td>\n",
       "      <td>93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>(20, 5, False)</td>\n",
       "      <td>62.0</td>\n",
       "      <td>115</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>(12, 3, False)</td>\n",
       "      <td>-55.0</td>\n",
       "      <td>91</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>9</td>\n",
       "      <td>(15, 3, False)</td>\n",
       "      <td>-36.0</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            state  total_return    N\n",
       "0   (7, 7, False)          -4.0   16\n",
       "1  (11, 7, False)          19.0   43\n",
       "2  (16, 7, False)         -38.0  104\n",
       "3  (19, 7, False)          55.0  113\n",
       "4  (20, 8, False)          96.0  129\n",
       "5  (20, 2, False)          94.0  142\n",
       "6  (15, 5, False)         -42.0   93\n",
       "7  (20, 5, False)          62.0  115\n",
       "8  (12, 3, False)         -55.0   91\n",
       "9  (15, 3, False)         -36.0   96"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe from above, we have the total return and\n",
    "the number of times the state is visited.\n",
    "\n",
    "Next, we can compute the value of the state as the average return, thus, we can write:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['value'] = df['total_return']/df['N']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the first few rows of the data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>state</th>\n",
       "      <th>total_return</th>\n",
       "      <th>N</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>(7, 7, False)</td>\n",
       "    
Download .txt
gitextract_lj237dqn/

├── 01. Fundamentals of Reinforcement Learning/
│   ├── .ipynb_checkpoints/
│   │   ├── 1.01. Basic Idea of Reinforcement Learning -checkpoint.ipynb
│   │   ├── 1.02. Key Elements of Reinforcement Learning -checkpoint.ipynb
│   │   ├── 1.03. Reinforcement Learning Algorithm-checkpoint.ipynb
│   │   ├── 1.04. RL agent in the Grid World -checkpoint.ipynb
│   │   ├── 1.05. How RL differs from other ML paradigms?-checkpoint.ipynb
│   │   ├── 1.06. Markov Decision Processes-checkpoint.ipynb
│   │   └── 1.07. Action space, Policy, Episode and Horizon-checkpoint.ipynb
│   └── 1.01. Key Elements of Reinforcement Learning .ipynb
├── 02. A Guide to the Gym Toolkit/
│   ├── 2.02.  Creating our First Gym Environment.ipynb
│   ├── 2.04. Classic Control Environments.ipynb
│   ├── 2.05. Cart Pole Balancing with Random Policy.ipynb
│   └── README.md
├── 03. Bellman Equation and Dynamic Programming/
│   ├── .ipynb_checkpoints/
│   │   ├── 3.06. Solving the Frozen Lake Problem with Value Iteration-checkpoint.ipynb
│   │   └── 3.08. Solving the Frozen Lake Problem with Policy Iteration-checkpoint.ipynb
│   ├── 3.06. Solving the Frozen Lake Problem with Value Iteration.ipynb
│   ├── 3.08. Solving the Frozen Lake Problem with Policy Iteration.ipynb
│   └── README.md
├── 04. Monte Carlo Methods/
│   ├── .ipynb_checkpoints/
│   │   ├── 4.01. Understanding the Monte Carlo Method-checkpoint.ipynb
│   │   ├── 4.02.  Prediction and control tasks-checkpoint.ipynb
│   │   ├── 4.05. Every-visit MC Prediction with Blackjack Game-checkpoint.ipynb
│   │   ├── 4.06. First-visit MC Prediction with Blackjack Game-checkpoint.ipynb
│   │   └── 4.13. Implementing On-Policy MC Control-checkpoint.ipynb
│   ├── 4.05. Every-visit MC Prediction with Blackjack Game.ipynb
│   ├── 4.06. First-visit MC Prediction with Blackjack Game.ipynb
│   ├── 4.13. Implementing On-Policy MC Control.ipynb
│   └── README.md
├── 05. Understanding Temporal Difference Learning/
│   ├── .ipynb_checkpoints/
│   │   ├── 5.03. Predicting the Value of States in a Frozen Lake Environment-checkpoint.ipynb
│   │   ├── 5.06. Computing Optimal Policy using SARSA-checkpoint.ipynb
│   │   └── 5.08. Computing the Optimal Policy using Q Learning-checkpoint.ipynb
│   ├── 5.06. Computing Optimal Policy using SARSA.ipynb
│   ├── 5.08. Computing the Optimal Policy using Q Learning.ipynb
│   └── README.md
├── 06. Case Study: The MAB Problem/
│   ├── .ipynb_checkpoints/
│   │   ├── 6.04. Implementing epsilon-greedy -checkpoint.ipynb
│   │   ├── 6.06. Implementing Softmax Exploration-checkpoint.ipynb
│   │   ├── 6.08. Implementing UCB-checkpoint.ipynb
│   │   ├── 6.1-checkpoint.ipynb
│   │   ├── 6.10. Implementing Thompson Sampling-checkpoint.ipynb
│   │   └── 6.12. Finding the Best Advertisement Banner using Bandits-checkpoint.ipynb
│   ├── 6.03. Epsilon-Greedy.ipynb
│   ├── 6.04. Implementing epsilon-greedy .ipynb
│   ├── 6.06. Implementing Softmax Exploration.ipynb
│   ├── 6.08. Implementing UCB.ipynb
│   ├── 6.10. Implementing Thompson Sampling.ipynb
│   ├── 6.12. Finding the Best Advertisement Banner using Bandits.ipynb
│   └── README.md
├── 07. Deep learning foundations/
│   ├── .ipynb_checkpoints/
│   │   └── 7.05 Building Neural Network from scratch-checkpoint.ipynb
│   ├── 7.05 Building Neural Network from scratch.ipynb
│   └── README.md
├── 08. A primer on TensorFlow/
│   ├── .ipynb_checkpoints/
│   │   ├── 8.05 Handwritten digits classification using TensorFlow-checkpoint.ipynb
│   │   └── 8.10 MNIST digits classification in TensorFlow 2.0-checkpoint.ipynb
│   ├── 8.05 Handwritten digits classification using TensorFlow.ipynb
│   ├── 8.08 Math operations in TensorFlow.ipynb
│   ├── 8.10 MNIST digits classification in TensorFlow 2.0.ipynb
│   ├── README.md
│   └── graphs/
│       └── events.out.tfevents.1559122983.ml-dev
├── 09.  Deep Q Network and its Variants/
│   ├── .ipynb_checkpoints/
│   │   ├── 7.03. Playing Atari Games using DQN-Copy1-checkpoint.ipynb
│   │   ├── 7.03. Playing Atari Games using DQN-checkpoint.ipynb
│   │   └── 9.03. Playing Atari Games using DQN-checkpoint.ipynb
│   ├── 9.03. Playing Atari Games using DQN.ipynb
│   └── READEME.md
├── 10. Policy Gradient Method/
│   ├── .ipynb_checkpoints/
│   │   ├── 10.01. Why Policy based Methods-checkpoint.ipynb
│   │   ├── 10.02. Policy Gradient Intuition-checkpoint.ipynb
│   │   ├── 10.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb
│   │   └── 8.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb
│   ├── 10.07. Cart Pole Balancing with Policy Gradient.ipynb
│   └── README.md
├── 11. Actor Critic Methods - A2C and A3C/
│   ├── .ipynb_checkpoints/
│   │   ├── 11.01. Overview of actor critic method-checkpoint.ipynb
│   │   ├── 11.05. Mountain Car Climbing using A3C-checkpoint.ipynb
│   │   └── 9.05. Mountain Car Climbing using A3C-checkpoint.ipynb
│   ├── 11.05. Mountain Car Climbing using A3C.ipynb
│   ├── README.md
│   └── logs/
│       └── events.out.tfevents.1596718791.Sudharsan
├── 12. Learning DDPG, TD3 and SAC/
│   ├── .ipynb_checkpoints/
│   │   ├── 10.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb
│   │   ├── 12.01. DDPG-checkpoint.ipynb
│   │   ├── 12.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb
│   │   ├── 12.03. Twin delayed DDPG-checkpoint.ipynb
│   │   └── Swinging up the pendulum using DDPG -checkpoint.ipynb
│   ├── 12.05. Swinging Up the Pendulum using DDPG .ipynb
│   └── README.md
├── 13. TRPO, PPO and ACKTR Methods/
│   ├── .ipynb_checkpoints/
│   │   ├──  Implementing PPO-clipped method-checkpoint.ipynb
│   │   ├── 11.09. Implementing PPO-Clipped Method-checkpoint.ipynb
│   │   ├── 13.01. Trust Region Policy Optimization-checkpoint.ipynb
│   │   └── 13.09. Implementing PPO-Clipped Method-checkpoint.ipynb
│   ├── 13.09. Implementing PPO-Clipped Method.ipynb
│   └── README.md
├── 14. Distributional Reinforcement Learning/
│   ├── .ipynb_checkpoints/
│   │   ├── 12.03. Playing Atari games using Categorical DQN-checkpoint.ipynb
│   │   ├── 14.03. Playing Atari games using Categorical DQN-checkpoint.ipynb
│   │   ├── Playing Atari games using Categorical DQN-checkpoint.ipynb
│   │   └── c51 done-Copy1-checkpoint.ipynb
│   ├── 14.03. Playing Atari games using Categorical DQN.ipynb
│   └── README.md
├── 15. Imitation Learning and Inverse RL/
│   ├── .ipynb_checkpoints/
│   │   ├── 13.01. Supervised Imitation Learning -checkpoint.ipynb
│   │   └── 13.02. DAgger-checkpoint.ipynb
│   └── README.md
├── 16. Deep Reinforcement Learning with Stable Baselines/
│   ├── .ipynb_checkpoints/
│   │   ├── 14.01. Creating our First Agent with Baseline-checkpoint.ipynb
│   │   ├── 14.04. Playing Atari games with DQN and its variants-checkpoint.ipynb
│   │   ├── 14.05. Implementing DQN variants-checkpoint.ipynb
│   │   ├── 14.06. Lunar Lander using A2C-checkpoint.ipynb
│   │   ├── 14.07. Creating a custom network-checkpoint.ipynb
│   │   ├── 14.08. Swinging up a pendulum using DDPG-checkpoint.ipynb
│   │   ├── 16.01. Creating our First Agent with Stable Baseline-checkpoint.ipynb
│   │   ├── 16.04. Playing Atari games with DQN and its variants-checkpoint.ipynb
│   │   ├── 16.05. Implementing DQN variants-checkpoint.ipynb
│   │   ├── 16.06. Lunar Lander using A2C-checkpoint.ipynb
│   │   ├── 16.07. Creating a custom network-checkpoint.ipynb
│   │   ├── 16.08. Swinging up a pendulum using DDPG-checkpoint.ipynb
│   │   ├── 16.09. Training an agent to walk using TRPO-checkpoint.ipynb
│   │   ├── 16.10. Training cheetah bot to run using PPO-checkpoint.ipynb
│   │   ├── Creating a custom network-checkpoint.ipynb
│   │   ├── Implementing DQN variants-checkpoint.ipynb
│   │   ├── Lunar Lander using A2C-checkpoint.ipynb
│   │   ├── Playing Atari games with DQN and its variants-checkpoint.ipynb
│   │   ├── Swinging up a pendulum using DDPG-checkpoint.ipynb
│   │   ├── Training an agent to walk using TRPO-checkpoint.ipynb
│   │   ├── Training cheetah bot to run using PPO-checkpoint.ipynb
│   │   └── Untitled-checkpoint.ipynb
│   ├── 16.01. Creating our First Agent with Stable Baseline.ipynb
│   ├── 16.04. Playing Atari games with DQN and its variants.ipynb
│   ├── 16.05. Implementing DQN variants.ipynb
│   ├── 16.06. Lunar Lander using A2C.ipynb
│   ├── 16.07. Creating a custom network.ipynb
│   ├── 16.08. Swinging up a pendulum using DDPG.ipynb
│   ├── 16.09. Training an agent to walk using TRPO.ipynb
│   ├── 16.10. Training cheetah bot to run using PPO.ipynb
│   ├── README.md
│   └── logs/
│       └── DDPG_1/
│           └── events.out.tfevents.1582974711.Sudharsan
├── 17. Reinforcement Learning Frontiers/
│   ├── .ipynb_checkpoints/
│   │   └── 15.01. Meta Reinforcement Learning-checkpoint.ipynb
│   └── README.md
└── README.md
Condensed preview — 129 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,227K chars).
[
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.01. Basic Idea of Reinforcement Learning -checkpoint.ipynb",
    "chars": 4876,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction\\n\",\n    \"\\n\",\n    \"\\"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.02. Key Elements of Reinforcement Learning -checkpoint.ipynb",
    "chars": 4040,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Basic Idea of Reinforcement Learn"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.03. Reinforcement Learning Algorithm-checkpoint.ipynb",
    "chars": 2003,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Reinforcement Learning algorithm\\"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.04. RL agent in the Grid World -checkpoint.ipynb",
    "chars": 6616,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# RL agent in the Grid World \\n\",\n "
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.05. How RL differs from other ML paradigms?-checkpoint.ipynb",
    "chars": 4684,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# How RL differs from other ML para"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.06. Markov Decision Processes-checkpoint.ipynb",
    "chars": 9257,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Markov Decision Processes \\n\",\n  "
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.07. Action space, Policy, Episode and Horizon-checkpoint.ipynb",
    "chars": 14041,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Action space, Policy, Episode, Ho"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.01. Key Elements of Reinforcement Learning .ipynb",
    "chars": 4876,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction\\n\",\n    \"\\n\",\n    \"\\"
  },
  {
    "path": "02. A Guide to the Gym Toolkit/2.02.  Creating our First Gym Environment.ipynb",
    "chars": 13368,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Creating our first gym environmen"
  },
  {
    "path": "02. A Guide to the Gym Toolkit/2.04. Classic Control Environments.ipynb",
    "chars": 8431,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Classic control environments\\n\",\n"
  },
  {
    "path": "02. A Guide to the Gym Toolkit/2.05. Cart Pole Balancing with Random Policy.ipynb",
    "chars": 3446,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Cart Pole Balancing with Random P"
  },
  {
    "path": "02. A Guide to the Gym Toolkit/README.md",
    "chars": 819,
    "preview": "# 2. A Guide to the Gym Toolkit\n* 2.1. Setting Up our Machine\n   * 2.1.1. Installing Anaconda\n   * 2.1.2. Installing the"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/.ipynb_checkpoints/3.06. Solving the Frozen Lake Problem with Value Iteration-checkpoint.ipynb",
    "chars": 11949,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"# Solving"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/.ipynb_checkpoints/3.08. Solving the Frozen Lake Problem with Policy Iteration-checkpoint.ipynb",
    "chars": 9850,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Solving the Frozen Lake Problem w"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/3.06. Solving the Frozen Lake Problem with Value Iteration.ipynb",
    "chars": 11949,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"# Solving"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/3.08. Solving the Frozen Lake Problem with Policy Iteration.ipynb",
    "chars": 9850,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Solving the Frozen Lake Problem w"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/README.md",
    "chars": 572,
    "preview": "# 3. Bellman Equation and Dynamic Programming\n* 3.1. The Bellman Equation\n   * 3.1.1. Bellman Equation of the Value Func"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.01. Understanding the Monte Carlo Method-checkpoint.ipynb",
    "chars": 2064,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Understanding the Monte Carlo met"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.02.  Prediction and control tasks-checkpoint.ipynb",
    "chars": 3936,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Prediction and control tasks\\n\",\n"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.05. Every-visit MC Prediction with Blackjack Game-checkpoint.ipynb",
    "chars": 20702,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Every-visit MC prediction with bl"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.06. First-visit MC Prediction with Blackjack Game-checkpoint.ipynb",
    "chars": 20667,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# First-visit MC prediction with bl"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.13. Implementing On-Policy MC Control-checkpoint.ipynb",
    "chars": 12377,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing On-policy MC control"
  },
  {
    "path": "04. Monte Carlo Methods/4.05. Every-visit MC Prediction with Blackjack Game.ipynb",
    "chars": 20702,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Every-visit MC prediction with bl"
  },
  {
    "path": "04. Monte Carlo Methods/4.06. First-visit MC Prediction with Blackjack Game.ipynb",
    "chars": 20667,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# First-visit MC prediction with bl"
  },
  {
    "path": "04. Monte Carlo Methods/4.13. Implementing On-Policy MC Control.ipynb",
    "chars": 12377,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing On-policy MC control"
  },
  {
    "path": "04. Monte Carlo Methods/README.md",
    "chars": 925,
    "preview": "### 4. Monte Carlo Methods\n* 4.1. Understanding the Monte Carlo Method\n* 4.2. Prediction and Control Tasks\n   * 4.2.1. P"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/.ipynb_checkpoints/5.03. Predicting the Value of States in a Frozen Lake Environment-checkpoint.ipynb",
    "chars": 9655,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Predicting the value of states in"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/.ipynb_checkpoints/5.06. Computing Optimal Policy using SARSA-checkpoint.ipynb",
    "chars": 4984,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Computing optimal policy using SA"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/.ipynb_checkpoints/5.08. Computing the Optimal Policy using Q Learning-checkpoint.ipynb",
    "chars": 4885,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Computing the optimal policy usin"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/5.06. Computing Optimal Policy using SARSA.ipynb",
    "chars": 4984,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Computing optimal policy using SA"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/5.08. Computing the Optimal Policy using Q Learning.ipynb",
    "chars": 4885,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Computing the optimal policy usin"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/README.md",
    "chars": 477,
    "preview": "### 5. Understanding Temporal Difference Learning\n* 5.1. TD Learning\n* 5.2. TD Prediction\n   * 5.2.1. TD Prediction Algo"
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.04. Implementing epsilon-greedy -checkpoint.ipynb",
    "chars": 6031,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing epsilon-greedy \\n\",\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.06. Implementing Softmax Exploration-checkpoint.ipynb",
    "chars": 6593,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing Softmax Exploration\\"
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.08. Implementing UCB-checkpoint.ipynb",
    "chars": 6540,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing UCB\\n\",\n    \"\\n\",\n  "
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.1-checkpoint.ipynb",
    "chars": 5803,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# The MAB problem\\n\",\n    \"\\n\",\n   "
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.10. Implementing Thompson Sampling-checkpoint.ipynb",
    "chars": 6966,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing Thompson sampling\\n\""
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.12. Finding the Best Advertisement Banner using Bandits-checkpoint.ipynb",
    "chars": 22701,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Finding the best advertisement ba"
  },
  {
    "path": "06. Case Study: The MAB Problem/6.03. Epsilon-Greedy.ipynb",
    "chars": 3921,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Epsilon-greedy\\n\",\n    \"\\n\",\n    "
  },
  {
    "path": "06. Case Study: The MAB Problem/6.04. Implementing epsilon-greedy .ipynb",
    "chars": 6031,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing epsilon-greedy \\n\",\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/6.06. Implementing Softmax Exploration.ipynb",
    "chars": 6593,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing Softmax Exploration\\"
  },
  {
    "path": "06. Case Study: The MAB Problem/6.08. Implementing UCB.ipynb",
    "chars": 6540,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing UCB\\n\",\n    \"\\n\",\n  "
  },
  {
    "path": "06. Case Study: The MAB Problem/6.10. Implementing Thompson Sampling.ipynb",
    "chars": 6966,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing Thompson sampling\\n\""
  },
  {
    "path": "06. Case Study: The MAB Problem/6.12. Finding the Best Advertisement Banner using Bandits.ipynb",
    "chars": 22701,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Finding the best advertisement ba"
  },
  {
    "path": "06. Case Study: The MAB Problem/README.md",
    "chars": 447,
    "preview": "# 6. Case Study: The MAB Problem\n* 6.1. The MAB Problem\n* 6.2. Creating Bandit in the Gym\n* 6.3. Epsilon-Greedy\n* 6.4. I"
  },
  {
    "path": "07. Deep learning foundations/.ipynb_checkpoints/7.05 Building Neural Network from scratch-checkpoint.ipynb",
    "chars": 22669,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Building Neural Network from Scra"
  },
  {
    "path": "07. Deep learning foundations/7.05 Building Neural Network from scratch.ipynb",
    "chars": 22669,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Building Neural Network from Scra"
  },
  {
    "path": "07. Deep learning foundations/README.md",
    "chars": 842,
    "preview": "# [Chapter 7. Deep Learning Foundations](#)\n\n* 7.1. Biological and artifical neurons\n* 7.2. ANN and its layers \n* 7.3. E"
  },
  {
    "path": "08. A primer on TensorFlow/.ipynb_checkpoints/8.05 Handwritten digits classification using TensorFlow-checkpoint.ipynb",
    "chars": 23951,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Handwritten digits classification"
  },
  {
    "path": "08. A primer on TensorFlow/.ipynb_checkpoints/8.10 MNIST digits classification in TensorFlow 2.0-checkpoint.ipynb",
    "chars": 6836,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# MNIST digit classification in Ten"
  },
  {
    "path": "08. A primer on TensorFlow/8.05 Handwritten digits classification using TensorFlow.ipynb",
    "chars": 23951,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Handwritten digits classification"
  },
  {
    "path": "08. A primer on TensorFlow/8.08 Math operations in TensorFlow.ipynb",
    "chars": 21364,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Math operations in TensorFlow\\n\","
  },
  {
    "path": "08. A primer on TensorFlow/8.10 MNIST digits classification in TensorFlow 2.0.ipynb",
    "chars": 6836,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# MNIST digit classification in Ten"
  },
  {
    "path": "08. A primer on TensorFlow/README.md",
    "chars": 802,
    "preview": "\n\n# [Chapter 8. Getting to Know TensorFlow](#)\n\n* 8.1. What is TensorFlow?\n* 8.2. Understanding Computational Graphs and"
  },
  {
    "path": "09.  Deep Q Network and its Variants/.ipynb_checkpoints/7.03. Playing Atari Games using DQN-Copy1-checkpoint.ipynb",
    "chars": 14609,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"ename\""
  },
  {
    "path": "09.  Deep Q Network and its Variants/.ipynb_checkpoints/7.03. Playing Atari Games using DQN-checkpoint.ipynb",
    "chars": 15570,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using DQN \\n\""
  },
  {
    "path": "09.  Deep Q Network and its Variants/.ipynb_checkpoints/9.03. Playing Atari Games using DQN-checkpoint.ipynb",
    "chars": 15913,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using DQN \\n\""
  },
  {
    "path": "09.  Deep Q Network and its Variants/9.03. Playing Atari Games using DQN.ipynb",
    "chars": 15913,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using DQN \\n\""
  },
  {
    "path": "09.  Deep Q Network and its Variants/READEME.md",
    "chars": 675,
    "preview": "# 9. Deep Q Network and its Variants\n\n* 9.1. What is Deep Q Network?\n* 9.2. Understanding DQN\n   * 9.2.1. Replay Buffer\n"
  },
  {
    "path": "10. Policy Gradient Method/.ipynb_checkpoints/10.01. Why Policy based Methods-checkpoint.ipynb",
    "chars": 6984,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Why policy-based methods?\\n\",\n  "
  },
  {
    "path": "10. Policy Gradient Method/.ipynb_checkpoints/10.02. Policy Gradient Intuition-checkpoint.ipynb",
    "chars": 4603,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Policy gradient intuition\\n\",\n   "
  },
  {
    "path": "10. Policy Gradient Method/.ipynb_checkpoints/10.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb",
    "chars": 12439,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Cart pole balancing with policy g"
  },
  {
    "path": "10. Policy Gradient Method/.ipynb_checkpoints/8.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb",
    "chars": 12968,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Cart pole balancing with policy g"
  },
  {
    "path": "10. Policy Gradient Method/10.07. Cart Pole Balancing with Policy Gradient.ipynb",
    "chars": 12439,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Cart pole balancing with policy g"
  },
  {
    "path": "10. Policy Gradient Method/README.md",
    "chars": 480,
    "preview": "# 10. Policy Gradient Method\n* 10.1. Why Policy Based Methods?\n* 10.2. Policy Gradient Intuition\n* 10.3. Understanding t"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/.ipynb_checkpoints/11.01. Overview of actor critic method-checkpoint.ipynb",
    "chars": 3482,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Overview of actor critic method\\n"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/.ipynb_checkpoints/11.05. Mountain Car Climbing using A3C-checkpoint.ipynb",
    "chars": 25577,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Mountain car climbing using A3C\\n"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/.ipynb_checkpoints/9.05. Mountain Car Climbing using A3C-checkpoint.ipynb",
    "chars": 27575,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Mountain car climbing using A3C\\n"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/11.05. Mountain Car Climbing using A3C.ipynb",
    "chars": 25577,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Mountain car climbing using A3C\\n"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/README.md",
    "chars": 364,
    "preview": "# 11. Actor Critic Methods - A2C and A3C\n* 11.1. Overview of Actor Critic Method\n* 11.2. Understanding the Actor Critic "
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/10.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb",
    "chars": 19730,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Swinging up the pendulum using DD"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/12.01. DDPG-checkpoint.ipynb",
    "chars": 6416,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Deep deterministic policy gradien"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/12.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb",
    "chars": 19118,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Swinging up the pendulum using DD"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/12.03. Twin delayed DDPG-checkpoint.ipynb",
    "chars": 3910,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Twin delayed DDPG\\n\",\n    \"\\n\",\n "
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/Swinging up the pendulum using DDPG -checkpoint.ipynb",
    "chars": 18451,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Swinging up the pendulum using DD"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/12.05. Swinging Up the Pendulum using DDPG .ipynb",
    "chars": 19118,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Swinging up the pendulum using DD"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/README.md",
    "chars": 878,
    "preview": "# 12. Learning DDPG, TD3 and SAC\n* 12.1. Deep Deterministic Policy Gradient\n   * 12.1.1. An Overview of DDPG\n* 12.2. Com"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/.ipynb_checkpoints/ Implementing PPO-clipped method-checkpoint.ipynb",
    "chars": 15287,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"# Impleme"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/.ipynb_checkpoints/11.09. Implementing PPO-Clipped Method-checkpoint.ipynb",
    "chars": 17519,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"# Impleme"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/.ipynb_checkpoints/13.01. Trust Region Policy Optimization-checkpoint.ipynb",
    "chars": 5034,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Trust Region Policy Optimization\\"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/.ipynb_checkpoints/13.09. Implementing PPO-Clipped Method-checkpoint.ipynb",
    "chars": 15945,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"# Impleme"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/13.09. Implementing PPO-Clipped Method.ipynb",
    "chars": 15945,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"# Impleme"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/README.md",
    "chars": 1159,
    "preview": "# 13. TRPO, PPO and ACKTR Methods\n* 13.1 Trust Region Policy Optimization\n* 13.2. Math Essentials\n   * 13.2.1. Taylor se"
  },
  {
    "path": "14. Distributional Reinforcement Learning/.ipynb_checkpoints/12.03. Playing Atari games using Categorical DQN-checkpoint.ipynb",
    "chars": 20580,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using Categor"
  },
  {
    "path": "14. Distributional Reinforcement Learning/.ipynb_checkpoints/14.03. Playing Atari games using Categorical DQN-checkpoint.ipynb",
    "chars": 19658,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using Categor"
  },
  {
    "path": "14. Distributional Reinforcement Learning/.ipynb_checkpoints/Playing Atari games using Categorical DQN-checkpoint.ipynb",
    "chars": 19345,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using Categor"
  },
  {
    "path": "14. Distributional Reinforcement Learning/.ipynb_checkpoints/c51 done-Copy1-checkpoint.ipynb",
    "chars": 16792,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": "
  },
  {
    "path": "14. Distributional Reinforcement Learning/14.03. Playing Atari games using Categorical DQN.ipynb",
    "chars": 19658,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using Categor"
  },
  {
    "path": "14. Distributional Reinforcement Learning/README.md",
    "chars": 755,
    "preview": "# 14. Distributional Reinforcement Learning\n* 14.1. Why Distributional Reinforcement Learning?\n* 14.2. Categorical DQN\n "
  },
  {
    "path": "15. Imitation Learning and Inverse RL/.ipynb_checkpoints/13.01. Supervised Imitation Learning -checkpoint.ipynb",
    "chars": 3309,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Supervised Imitation Learning \\n\""
  },
  {
    "path": "15. Imitation Learning and Inverse RL/.ipynb_checkpoints/13.02. DAgger-checkpoint.ipynb",
    "chars": 3266,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# DAgger\\n\",\n    \"\\n\",\n    \"\\n\",\n  "
  },
  {
    "path": "15. Imitation Learning and Inverse RL/README.md",
    "chars": 582,
    "preview": "# 15. Imitation Learning and Inverse RL\n* 15.1. Supervised Imitation Learning\n* 15.2. DAgger\n   * 15.2. Understanding DA"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.01. Creating our First Agent with Baseline-checkpoint.ipynb",
    "chars": 5155,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating our first agent with ba"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.04. Playing Atari games with DQN and its variants-checkpoint.ipynb",
    "chars": 3416,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games with DQN\\n\",\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.05. Implementing DQN variants-checkpoint.ipynb",
    "chars": 4045,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing DQN variants\\n\",\n   "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.06. Lunar Lander using A2C-checkpoint.ipynb",
    "chars": 4057,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Lunar Lander using A2C\\n\",\n    \"\\"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.07. Creating a custom network-checkpoint.ipynb",
    "chars": 4567,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a custom network\\n\",\n  "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.08. Swinging up a pendulum using DDPG-checkpoint.ipynb",
    "chars": 5977,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Swinging up a pendulum using DDP"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.01. Creating our First Agent with Stable Baseline-checkpoint.ipynb",
    "chars": 5457,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Creating our first agent with Sta"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.04. Playing Atari games with DQN and its variants-checkpoint.ipynb",
    "chars": 3420,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games with DQN\\n\",\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.05. Implementing DQN variants-checkpoint.ipynb",
    "chars": 4049,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing DQN variants\\n\",\n   "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.06. Lunar Lander using A2C-checkpoint.ipynb",
    "chars": 4056,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Lunar Lander using A2C\\n\",\n    \"\\"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.07. Creating a custom network-checkpoint.ipynb",
    "chars": 4570,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a custom network\\n\",\n  "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.08. Swinging up a pendulum using DDPG-checkpoint.ipynb",
    "chars": 5977,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Swinging up a pendulum using DDP"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.09. Training an agent to walk using TRPO-checkpoint.ipynb",
    "chars": 5776,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training an agent to walk using T"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.10. Training cheetah bot to run using PPO-checkpoint.ipynb",
    "chars": 5308,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training cheetah bot to run using"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Creating a custom network-checkpoint.ipynb",
    "chars": 4300,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a custom network\\n\",\n  "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Implementing DQN variants-checkpoint.ipynb",
    "chars": 3846,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing DQN variants\\n\",\n   "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Lunar Lander using A2C-checkpoint.ipynb",
    "chars": 3857,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Lunar Lander using A2C\\n\",\n    \"\\"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Playing Atari games with DQN and its variants-checkpoint.ipynb",
    "chars": 3563,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games with DQN and "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Swinging up a pendulum using DDPG-checkpoint.ipynb",
    "chars": 5143,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Swinging up a pendulum using DDP"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Training an agent to walk using TRPO-checkpoint.ipynb",
    "chars": 5780,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training an agent to walk using T"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Training cheetah bot to run using PPO-checkpoint.ipynb",
    "chars": 5297,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training cheetah bot to run using"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Untitled-checkpoint.ipynb",
    "chars": 72,
    "preview": "{\n \"cells\": [],\n \"metadata\": {},\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.01. Creating our First Agent with Stable Baseline.ipynb",
    "chars": 5457,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Creating our first agent with Sta"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.04. Playing Atari games with DQN and its variants.ipynb",
    "chars": 3420,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games with DQN\\n\",\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.05. Implementing DQN variants.ipynb",
    "chars": 4049,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing DQN variants\\n\",\n   "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.06. Lunar Lander using A2C.ipynb",
    "chars": 4056,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Lunar Lander using A2C\\n\",\n    \"\\"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.07. Creating a custom network.ipynb",
    "chars": 4570,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a custom network\\n\",\n  "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.08. Swinging up a pendulum using DDPG.ipynb",
    "chars": 5977,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Swinging up a pendulum using DDP"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.09. Training an agent to walk using TRPO.ipynb",
    "chars": 5776,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training an agent to walk using T"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.10. Training cheetah bot to run using PPO.ipynb",
    "chars": 5308,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training cheetah bot to run using"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/README.md",
    "chars": 961,
    "preview": "\n# 16. Deep Reinforcement Learning with Stable Baselines\n\n\n* 16.1. Creating our First Agent with Baseline\n   * 16.1.1. E"
  },
  {
    "path": "17. Reinforcement Learning Frontiers/.ipynb_checkpoints/15.01. Meta Reinforcement Learning-checkpoint.ipynb",
    "chars": 2647,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Meta Reinforcement Learning \\n\",\n"
  },
  {
    "path": "17. Reinforcement Learning Frontiers/README.md",
    "chars": 461,
    "preview": "# 17. Reinforcement Learning Frontiers\n* 17.1. Meta Reinforcement Learning\n* 17.2. Model Agnostic Meta Learning\n* 17.3. "
  },
  {
    "path": "README.md",
    "chars": 4286,
    "preview": "\n\n\n# [Deep Reinforcement Learning With Python](https://www.amazon.com/gp/product/B08HSHV72N/ref=dbs_a_def_rwt_bibl_vppi_"
  }
]

// ... and 3 more files (download for full content)

About this extraction

This page contains the full source code of the PacktPublishing/Deep-Reinforcement-Learning-with-Python GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 129 files (1.1 MB), approximately 342.9k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!