Full Code of sudharsan13296/Deep-Reinforcement-Learning-With-Python for AI

master c2a179457f96 cached

141 files

1.1 MB

358.4k tokens

1 requests

Download .txt

Showing preview only (1,204K chars total). Download the full file or copy to clipboard to get everything.

Repository: sudharsan13296/Deep-Reinforcement-Learning-With-Python
Branch: master
Commit: c2a179457f96
Files: 141
Total size: 1.1 MB

Directory structure:
gitextract_jqc__b22/

├── 01. Fundamentals of Reinforcement Learning/
│   ├── .ipynb_checkpoints/
│   │   ├── 1.01. Basic Idea of Reinforcement Learning -checkpoint.ipynb
│   │   ├── 1.02. Key Elements of Reinforcement Learning -checkpoint.ipynb
│   │   ├── 1.03. Reinforcement Learning Algorithm-checkpoint.ipynb
│   │   ├── 1.04. RL agent in the Grid World -checkpoint.ipynb
│   │   ├── 1.05. How RL differs from other ML paradigms?-checkpoint.ipynb
│   │   ├── 1.06. Markov Decision Processes-checkpoint.ipynb
│   │   └── 1.07. Action space, Policy, Episode and Horizon-checkpoint.ipynb
│   ├── 1.01. Key Elements of Reinforcement Learning .ipynb
│   ├── 1.02. Basic Idea of Reinforcement Learning.ipynb
│   ├── 1.03. Reinforcement Learning Algorithm.ipynb
│   ├── 1.04. RL agent in the Grid World .ipynb
│   ├── 1.05. How RL differs from other ML paradigms?.ipynb
│   ├── 1.06. Markov Decision Processes.ipynb
│   ├── 1.07. Action space, Policy, Episode and Horizon.ipynb
│   ├── 1.08.  Return, Discount Factor and Math Essentials.ipynb
│   ├── 1.09 Value function and Q function.ipynb
│   ├── 1.10. Model-Based and Model-Free Learning .ipynb
│   ├── 1.11. Different Types of Environments.ipynb
│   ├── 1.12. Applications of Reinforcement Learning.ipynb
│   └── 1.13. Reinforcement Learning Glossary.ipynb
├── 02. A Guide to the Gym Toolkit/
│   ├── 2.02.  Creating our First Gym Environment.ipynb
│   ├── 2.05. Cart Pole Balancing with Random Policy.ipynb
│   └── README.md
├── 03. Bellman Equation and Dynamic Programming/
│   ├── .ipynb_checkpoints/
│   │   ├── 3.06. Solving the Frozen Lake Problem with Value Iteration-checkpoint.ipynb
│   │   └── 3.08. Solving the Frozen Lake Problem with Policy Iteration-checkpoint.ipynb
│   ├── 3.06. Solving the Frozen Lake Problem with Value Iteration.ipynb
│   ├── 3.08. Solving the Frozen Lake Problem with Policy Iteration.ipynb
│   └── README.md
├── 04. Monte Carlo Methods/
│   ├── .ipynb_checkpoints/
│   │   ├── 4.01. Understanding the Monte Carlo Method-checkpoint.ipynb
│   │   ├── 4.02.  Prediction and control tasks-checkpoint.ipynb
│   │   ├── 4.05. Every-visit MC Prediction with Blackjack Game-checkpoint.ipynb
│   │   ├── 4.06. First-visit MC Prediction with Blackjack Game-checkpoint.ipynb
│   │   └── 4.13. Implementing On-Policy MC Control-checkpoint.ipynb
│   ├── 4.13. Implementing On-Policy MC Control.ipynb
│   └── README.md
├── 05. Understanding Temporal Difference Learning/
│   ├── .ipynb_checkpoints/
│   │   ├── 5.03. Predicting the Value of States in a Frozen Lake Environment-checkpoint.ipynb
│   │   ├── 5.06. Computing Optimal Policy using SARSA-checkpoint.ipynb
│   │   └── 5.08. Computing the Optimal Policy using Q Learning-checkpoint.ipynb
│   ├── 5.03. Predicting the Value of States in a Frozen Lake Environment.ipynb
│   ├── 5.06. Computing Optimal Policy using SARSA.ipynb
│   ├── 5.08. Computing the Optimal Policy using Q Learning.ipynb
│   └── README.md
├── 06. Case Study: The MAB Problem/
│   ├── .ipynb_checkpoints/
│   │   ├── 6.01 .The MAB Problem-checkpoint.ipynb
│   │   ├── 6.04. Implementing epsilon-greedy -checkpoint.ipynb
│   │   ├── 6.06. Implementing Softmax Exploration-checkpoint.ipynb
│   │   ├── 6.08. Implementing UCB-checkpoint.ipynb
│   │   ├── 6.1-checkpoint.ipynb
│   │   ├── 6.10. Implementing Thompson Sampling-checkpoint.ipynb
│   │   └── 6.12. Finding the Best Advertisement Banner using Bandits-checkpoint.ipynb
│   ├── 6.01 .The MAB Problem.ipynb
│   ├── 6.03. Epsilon-Greedy.ipynb
│   ├── 6.04. Implementing epsilon-greedy .ipynb
│   ├── 6.06. Implementing Softmax Exploration.ipynb
│   ├── 6.08. Implementing UCB.ipynb
│   ├── 6.10. Implementing Thompson Sampling.ipynb
│   ├── 6.12. Finding the Best Advertisement Banner using Bandits.ipynb
│   └── README.md
├── 07. Deep learning foundations/
│   ├── .ipynb_checkpoints/
│   │   └── 7.05 Building Neural Network from scratch-checkpoint.ipynb
│   ├── 7.05 Building Neural Network from scratch.ipynb
│   └── README.md
├── 08. A primer on TensorFlow/
│   ├── .ipynb_checkpoints/
│   │   ├── 8.05 Handwritten digits classification using TensorFlow-checkpoint.ipynb
│   │   └── 8.10 MNIST digits classification in TensorFlow 2.0-checkpoint.ipynb
│   ├── 8.05 Handwritten digits classification using TensorFlow.ipynb
│   ├── 8.08 Math operations in TensorFlow.ipynb
│   ├── 8.10 MNIST digits classification in TensorFlow 2.0.ipynb
│   ├── README.md
│   └── graphs/
│       └── events.out.tfevents.1559122983.ml-dev
├── 09.  Deep Q Network and its Variants/
│   ├── .ipynb_checkpoints/
│   │   ├── 7.03. Playing Atari Games using DQN-Copy1-checkpoint.ipynb
│   │   ├── 7.03. Playing Atari Games using DQN-checkpoint.ipynb
│   │   └── 9.03. Playing Atari Games using DQN-checkpoint.ipynb
│   ├── 9.03. Playing Atari Games using DQN.ipynb
│   └── READEME.md
├── 10. Policy Gradient Method/
│   ├── .ipynb_checkpoints/
│   │   ├── 10.01. Why Policy based Methods-checkpoint.ipynb
│   │   ├── 10.02. Policy Gradient Intuition-checkpoint.ipynb
│   │   ├── 10.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb
│   │   └── 8.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb
│   ├── 10.07. Cart Pole Balancing with Policy Gradient.ipynb
│   └── README.md
├── 11. Actor Critic Methods - A2C and A3C/
│   ├── .ipynb_checkpoints/
│   │   ├── 11.01. Overview of actor critic method-checkpoint.ipynb
│   │   ├── 11.05. Mountain Car Climbing using A3C-checkpoint.ipynb
│   │   └── 9.05. Mountain Car Climbing using A3C-checkpoint.ipynb
│   ├── 11.05. Mountain Car Climbing using A3C.ipynb
│   ├── README.md
│   └── logs/
│       └── events.out.tfevents.1596718791.Sudharsan
├── 12. Learning DDPG, TD3 and SAC/
│   ├── .ipynb_checkpoints/
│   │   ├── 10.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb
│   │   ├── 12.01. DDPG-checkpoint.ipynb
│   │   ├── 12.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb
│   │   ├── 12.03. Twin delayed DDPG-checkpoint.ipynb
│   │   └── Swinging up the pendulum using DDPG -checkpoint.ipynb
│   ├── 12.05. Swinging Up the Pendulum using DDPG .ipynb
│   └── README.md
├── 13. TRPO, PPO and ACKTR Methods/
│   ├── .ipynb_checkpoints/
│   │   ├──  Implementing PPO-clipped method-checkpoint.ipynb
│   │   ├── 11.09. Implementing PPO-Clipped Method-checkpoint.ipynb
│   │   ├── 13.01. Trust Region Policy Optimization-checkpoint.ipynb
│   │   └── 13.09. Implementing PPO-Clipped Method-checkpoint.ipynb
│   ├── 13.09. Implementing PPO-Clipped Method.ipynb
│   └── README.md
├── 14. Distributional Reinforcement Learning/
│   ├── .ipynb_checkpoints/
│   │   ├── 12.03. Playing Atari games using Categorical DQN-checkpoint.ipynb
│   │   ├── 14.03. Playing Atari games using Categorical DQN-checkpoint.ipynb
│   │   ├── Playing Atari games using Categorical DQN-checkpoint.ipynb
│   │   └── c51 done-Copy1-checkpoint.ipynb
│   ├── 14.03. Playing Atari games using Categorical DQN.ipynb
│   └── README.md
├── 15. Imitation Learning and Inverse RL/
│   ├── .ipynb_checkpoints/
│   │   ├── 13.01. Supervised Imitation Learning -checkpoint.ipynb
│   │   └── 13.02. DAgger-checkpoint.ipynb
│   ├── 15.02. DAgger.ipynb
│   └── README.md
├── 16. Deep Reinforcement Learning with Stable Baselines/
│   ├── .ipynb_checkpoints/
│   │   ├── 14.01. Creating our First Agent with Baseline-checkpoint.ipynb
│   │   ├── 14.04. Playing Atari games with DQN and its variants-checkpoint.ipynb
│   │   ├── 14.05. Implementing DQN variants-checkpoint.ipynb
│   │   ├── 14.06. Lunar Lander using A2C-checkpoint.ipynb
│   │   ├── 14.07. Creating a custom network-checkpoint.ipynb
│   │   ├── 14.08. Swinging up a pendulum using DDPG-checkpoint.ipynb
│   │   ├── 16.01. Creating our First Agent with Stable Baseline-checkpoint.ipynb
│   │   ├── 16.04. Playing Atari games with DQN and its variants-checkpoint.ipynb
│   │   ├── 16.05. Implementing DQN variants-checkpoint.ipynb
│   │   ├── 16.06. Lunar Lander using A2C-checkpoint.ipynb
│   │   ├── 16.07. Creating a custom network-checkpoint.ipynb
│   │   ├── 16.08. Swinging up a pendulum using DDPG-checkpoint.ipynb
│   │   ├── 16.09. Training an agent to walk using TRPO-checkpoint.ipynb
│   │   ├── 16.10. Training cheetah bot to run using PPO-checkpoint.ipynb
│   │   ├── Creating a custom network-checkpoint.ipynb
│   │   ├── Implementing DQN variants-checkpoint.ipynb
│   │   ├── Lunar Lander using A2C-checkpoint.ipynb
│   │   ├── Playing Atari games with DQN and its variants-checkpoint.ipynb
│   │   ├── Swinging up a pendulum using DDPG-checkpoint.ipynb
│   │   ├── Training an agent to walk using TRPO-checkpoint.ipynb
│   │   ├── Training cheetah bot to run using PPO-checkpoint.ipynb
│   │   └── Untitled-checkpoint.ipynb
│   ├── 16.04. Playing Atari games with DQN and its variants.ipynb
│   ├── 16.05. Implementing DQN variants.ipynb
│   ├── 16.06. Lunar Lander using A2C.ipynb
│   ├── 16.07. Creating a custom network.ipynb
│   ├── 16.08. Swinging up a pendulum using DDPG.ipynb
│   ├── 16.09. Training an agent to walk using TRPO.ipynb
│   ├── 16.10. Training cheetah bot to run using PPO.ipynb
│   ├── README.md
│   └── logs/
│       └── DDPG_1/
│           └── events.out.tfevents.1582974711.Sudharsan
├── 17. Reinforcement Learning Frontiers/
│   ├── .ipynb_checkpoints/
│   │   └── 15.01. Meta Reinforcement Learning-checkpoint.ipynb
│   └── README.md
└── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.01. Basic Idea of Reinforcement Learning -checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "\n",
    "Reinforcement Learning (RL) is one of the areas of Machine Learning (ML). Unlike\n",
    "other ML paradigms, such as supervised and unsupervised learning, RL works in a\n",
    "trial and error fashion by interacting with its environment.\n",
    "\n",
    "RL is one of the most active areas of research in artificial intelligence, and it is\n",
    "believed that RL will take us a step closer towards achieving artificial general\n",
    "intelligence. RL has evolved rapidly in the past few years with a wide variety of\n",
    "applications ranging from building a recommendation system to self-driving cars.\n",
    "The major reason for this evolution is the advent of deep reinforcement learning,\n",
    "which is a combination of deep learning and RL. With the emergence of new RL\n",
    "algorithms and libraries, RL is clearly one of the most promising areas of ML.\n",
    "\n",
    "In this chapter, we will build a strong foundation in RL by exploring several\n",
    "important and fundamental concepts involved in RL. In this chapter, we will learn about the following topics:\n",
    "\n",
    "* Key elements of RL\n",
    "* The basic idea of RL\n",
    "* The RL algorithm\n",
    "* How RL differs from other ML paradigms\n",
    "* The Markov Decision Processes\n",
    "* Fundamental concepts of RL\n",
    "* Applications of RL\n",
    "* RL glossary\n",
    "\n",
    "We will begin the chapter by understanding Key elements of RL. This will help us understand the\n",
    "basic idea of RL."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Key Elements of Reinforcement Learning \n",
    "\n",
    "Let's begin by understanding some key elements of RL.\n",
    "\n",
    "## Agent \n",
    "\n",
    "An agent is a software program that learns to make intelligent decisions. We can\n",
    "say that an agent is a learner in the RL setting. For instance, a chess player can be\n",
    "considered an agent since the player learns to make the best moves (decisions) to win\n",
    "the game. Similarly, Mario in a Super Mario Bros video game can be considered an\n",
    "agent since Mario explores the game and learns to make the best moves in the game.\n",
    "\n",
    "\n",
    "## Environment \n",
    "The environment is the world of the agent. The agent stays within the environment.\n",
    "For instance, coming back to our chess game, a chessboard is called the environment\n",
    "since the chess player (agent) learns to play the game of chess within the chessboard\n",
    "(environment). Similarly, in Super Mario Bros, the world of Mario is called the\n",
    "environment.\n",
    "\n",
    "## State and action\n",
    "A state is a position or a moment in the environment that the agent can be in. We\n",
    "learned that the agent stays within the environment, and there can be many positions\n",
    "in the environment that the agent can stay in, and those positions are called states.\n",
    "For instance, in our chess game example, each position on the chessboard is called\n",
    "the state. The state is usually denoted by s.\n",
    "\n",
    "The agent interacts with the environment and moves from one state to another\n",
    "by performing an action. In the chess game environment, the action is the move\n",
    "performed by the player (agent). The action is usually denoted by a.\n",
    "\n",
    "\n",
    "## Reward\n",
    "\n",
    "We learned that the agent interacts with an environment by performing an action\n",
    "and moves from one state to another. Based on the action, the agent receives a\n",
    "reward. A reward is nothing but a numerical value, say, +1 for a good action and -1\n",
    "for a bad action. How do we decide if an action is good or bad?\n",
    "In our chess game example, if the agent makes a move in which it takes one of the\n",
    "opponent's chess pieces, then it is considered a good action and the agent receives\n",
    "a positive reward. Similarly, if the agent makes a move that leads to the opponent\n",
    "taking the agent's chess piece, then it is considered a bad action and the agent\n",
    "receives a negative reward. The reward is denoted by r.\n",
    "\n",
    "\n",
    "In the next section, let us explore basic idea of reinforcement learning. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.02. Key Elements of Reinforcement Learning -checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic Idea of Reinforcement Learning \n",
    "\n",
    "\n",
    "\n",
    "Let's begin with an analogy. Let's suppose we are teaching a dog (agent) to catch a\n",
    "ball. Instead of teaching the dog explicitly to catch a ball, we just throw a ball and\n",
    "every time the dog catches the ball, we give the dog a cookie (reward). If the dog\n",
    "fails to catch the ball, then we do not give it a cookie. So, the dog will figure out\n",
    "what action caused it to receive a cookie and repeat that action. Thus, the dog will\n",
    "understand that catching the ball caused it to receive a cookie and will attempt to\n",
    "repeat catching the ball. Thus, in this way, the dog will learn to catch a ball while\n",
    "aiming to maximize the cookies it can receive.\n",
    "\n",
    "Similarly, in an RL setting, we will not teach the agent what to do or how to do it;\n",
    "instead, we will give a reward to the agent for every action it does. We will give\n",
    "a positive reward to the agent when it performs a good action and we will give a\n",
    "negative reward to the agent when it performs a bad action. The agent begins by\n",
    "performing a random action and if the action is good, we then give the agent a\n",
    "positive reward so that the agent understands it has performed a good action and it\n",
    "will repeat that action. If the action performed by the agent is bad, then we will give\n",
    "the agent a negative reward so that the agent will understand it has performed a bad\n",
    "action and it will not repeat that action.\n",
    "\n",
    "Thus, RL can be viewed as a trial and error learning process where the agent tries out\n",
    "different actions and learns the good action, which gives a positive reward.\n",
    "\n",
    "In the dog analogy, the dog represents the agent, and giving a cookie to the dog\n",
    "upon it catching the ball is a positive reward and not giving a cookie is a negative\n",
    "reward. So, the dog (agent) explores different actions, which are catching the ball\n",
    "and not catching the ball, and understands that catching the ball is a good action as it\n",
    "brings the dog a positive reward (getting a cookie).\n",
    "\n",
    "\n",
    "Let's further explore the idea of RL with one more simple example. Let's suppose we\n",
    "want to teach a robot (agent) to walk without hitting a mountain, as the following figure shows: \n",
    "\n",
    "![title](Images/1.png)\n",
    "\n",
    "We will not teach the robot explicitly to not go in the direction of the mountain.\n",
    "Instead, if the robot hits the mountain and gets stuck, we give the robot a negative\n",
    "reward, say -1. So, the robot will understand that hitting the mountain is the wrong\n",
    "action, and it will not repeat that action:\n",
    "\n",
    "\n",
    "![title](Images/2.png)\n",
    "\n",
    "Similarly, when the robot walks in the right direction without hitting the mountain,\n",
    "we give the robot a positive reward, say +1. So, the robot will understand that not\n",
    "hitting the mountain is a good action, and it will repeat that action:\n",
    "\n",
    "![title](Images/3.png)\n",
    "\n",
    "Thus, in the RL setting, the agent explores different actions and learns the best action\n",
    "based on the reward it gets.\n",
    "Now that we have a basic idea of how RL works, in the upcoming sections, we will\n",
    "go into more detail and also learn the important concepts involved in RL."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.03. Reinforcement Learning Algorithm-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reinforcement Learning algorithm\n",
    "\n",
    "The steps involved in a typical RL algorithm are as follows:\n",
    "\n",
    "1. First, the agent interacts with the environment by performing an action.\n",
    "2. By performing an action, the agent moves from one state to another.\n",
    "3. Then the agent will receive a reward based on the action it performed.\n",
    "4. Based on the reward, the agent will understand whether the action is good or bad.\n",
    "5. If the action was good, that is, if the agent received a positive reward, then the agent will prefer performing that action, else the agent will try performing other actions in search of a positive reward.\n",
    "\n",
    "RL is basically a trial and error learning process. Now, let's revisit our chess game\n",
    "example. The agent (software program) is the chess player. So, the agent interacts\n",
    "with the environment (chessboard) by performing an action (moves). If the agent\n",
    "gets a positive reward for an action, then it will prefer performing that action; else it\n",
    "will find a different action that gives a positive reward.\n",
    "\n",
    "Ultimately, the goal of the agent is to maximize the reward it gets. If the agent\n",
    "receives a good reward, then it means it has performed a good action. If the agent\n",
    "performs a good action, then it implies that it can win the game. Thus, the agent\n",
    "learns to win the game by maximizing the reward."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.04. RL agent in the Grid World -checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# RL agent in the Grid World \n",
    "\n",
    "Let's strengthen our understanding of RL by looking at another simple example.\n",
    "Consider the following grid world environment:\n",
    "\n",
    "![title](Images/4.png)\n",
    "\n",
    "The positions A to I in the environment are called the states of the environment.\n",
    "The goal of the agent is to reach state I by starting from state A without visiting\n",
    "the shaded states (B, C, G, and H). Thus, in order to achieve the goal, whenever\n",
    "our agent visits a shaded state, we will give a negative reward (say -1) and when it\n",
    "visits an unshaded state, we will give a positive reward (say +1). The actions in the\n",
    "environment are moving up, down, right and left. The agent can perform any of these\n",
    "four actions to reach state I from state A.\n",
    "\n",
    "The first time the agent interacts with the environment (the first iteration), the agent\n",
    "is unlikely to perform the correct action in each state, and thus it receives a negative\n",
    "reward. That is, in the first iteration, the agent performs a random action in each\n",
    "state, and this may lead the agent to receive a negative reward. But over a series of\n",
    "iterations, the agent learns to perform the correct action in each state through the\n",
    "reward it obtains, helping it achieve the goal. Let us explore this in detail.\n",
    "\n",
    "## Iteration 1:\n",
    "\n",
    "As we learned, in the first iteration, the agent performs a random action in each state.\n",
    "For instance, look at the following figure. In the first iteration, the agent moves right\n",
    "from state A and reaches the new state B. But since B is the shaded state, the agent\n",
    "will receive a negative reward and so the agent will understand that moving right is\n",
    "not a good action in state A. When it visits state A next time, it will try out a different\n",
    "action instead of moving right:\n",
    "\n",
    "![title](Images/5.PNG)\n",
    "\n",
    "As the avove figure shows, from state B, the agent moves down and reaches the new state\n",
    "E. Since E is an unshaded state, the agent will receive a positive reward, so the agent\n",
    "will understand that moving down from state B is a good action.\n",
    "\n",
    "From state E, the agent moves right and reaches state F. Since F is an unshaded state,\n",
    "the agent receives a positive reward, and it will understand that moving right from\n",
    "state E is a good action. From state F, the agent moves down and reaches the goal\n",
    "state I and receives a positive reward, so the agent will understand that moving\n",
    "down from state F is a good action.\n",
    "\n",
    "\n",
    "## Iteration 2:\n",
    "\n",
    "In the second iteration, from state A, instead of moving right, the agent tries out a\n",
    "different action as the agent learned in the previous iteration that moving right is not\n",
    "a good action in state A.\n",
    "\n",
    "Thus, as the following figure shows, in this iteration the agent moves down from state A and\n",
    "reaches state D. Since D is an unshaded state, the agent receives a positive reward\n",
    "and now the agent will understand that moving down is a good action in state A:\n",
    "\n",
    "\n",
    "![title](Images/6.PNG)\n",
    "\n",
    "As shown in the preceding figure, from state D, the agent moves down and reaches\n",
    "state G. But since G is a shaded state, the agent will receive a negative reward and\n",
    "so the agent will understand that moving down is not a good action in state D, and\n",
    "when it visits state D next time, it will try out a different action instead of moving\n",
    "down.\n",
    "\n",
    "From G, the agent moves right and reaches state H. Since H is a shaded state, it will\n",
    "receive a negative reward and understand that moving right is not a good action in\n",
    "state G.\n",
    "\n",
    "From H it moves right and reaches the goal state I and receives a positive reward, so\n",
    "the agent will understand that moving right from state H is a good action.\n",
    "\n",
    "\n",
    "## Iteration 3:\n",
    "\n",
    "In the third iteration, the agent moves down from state A since, in the second\n",
    "iteration, our agent learned that moving down is a good action in state A. So, the\n",
    "agent moves down from state A and reaches the next state, D, as the following figure shows:\n",
    "\n",
    "![title](Images/7.PNG)\n",
    "\n",
    "Now, from state D, the agent tries a different action instead of moving down since in\n",
    "the second iteration our agent learned that moving down is not a good action in state\n",
    "D. So, in this iteration, the agent moves right from state D and reaches state E.\n",
    "\n",
    "From state E, the agent moves right as the agent already learned in the first iteration\n",
    "that moving right from state E is a good action and reaches state F.\n",
    "\n",
    "Now, from state F, the agent moves down since the agent learned in the first iteration\n",
    "that moving down is a good action in state F, and reaches the goal state I.\n",
    "\n",
    "The following figure shows the result of the third iteration:\n",
    "![title](Images/7.PNG)\n",
    "\n",
    "As we can see, our agent has successfully learned to reach the goal state I from state\n",
    "A without visiting the shaded states based on the rewards.\n",
    "\n",
    "In this way, the agent will try out different actions in each state and understand\n",
    "whether an action is good or bad based on the reward it obtains. The goal of the\n",
    "agent is to maximize rewards. So, the agent will always try to perform good actions\n",
    "that give a positive reward, and when the agent performs good actions in each state,\n",
    "then it ultimately leads the agent to achieve the goal.\n",
    "\n",
    "Note that these iterations are called episodes in RL terminology. We will learn more\n",
    "about episodes later in the chapter."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.05. How RL differs from other ML paradigms?-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How RL differs from other ML paradigms?\n",
    "\n",
    "We can categorize ML into three types:\n",
    "* Supervised learning\n",
    "* Unsupervised learning\n",
    "* Reinforcement learning\n",
    "\n",
    "In supervised learning, the machine learns from training data. The training data\n",
    "consists of a labeled pair of inputs and outputs. So, we train the model (agent)\n",
    "using the training data in such a way that the model can generalize its learning to\n",
    "new unseen data. It is called supervised learning because the training data acts as a\n",
    "supervisor, since it has a labeled pair of inputs and outputs, and it guides the model\n",
    "in learning the given task.\n",
    "\n",
    "Now, let's understand the difference between supervised and reinforcement learning\n",
    "with an example. Consider the dog analogy we discussed earlier in the chapter. In\n",
    "supervised learning, to teach the dog to catch a ball, we will teach it explicitly by\n",
    "specifying turn left, go right, move forward seven steps, catch the ball, and so on\n",
    "in the form of training data. But in RL, we just throw a ball, and every time the dog\n",
    "catches the ball, we give it a cookie (reward). So, the dog will learn to catch the ball\n",
    "while trying to maximize the cookies (reward) it can get.\n",
    "\n",
    "Let's consider one more example. Say we want to train the model to play chess using\n",
    "supervised learning. In this case, we will have training data that includes all the\n",
    "moves a player can make in each state, along with labels indicating whether it is a\n",
    "good move or not. Then, we train the model to learn from this training data, whereas\n",
    "in the case of RL, our agent will not be given any sort of training data; instead, we\n",
    "just give a reward to the agent for each action it performs. Then, the agent will learn\n",
    "by interacting with the environment and, based on the reward it gets, it will choose\n",
    "its actions.\n",
    "\n",
    "Similar to supervised learning, in unsupervised learning, we train the model (agent)\n",
    "based on the training data. But in the case of unsupervised learning, the training data\n",
    "does not contain any labels; that is, it consists of only inputs and not outputs. The\n",
    "goal of unsupervised learning is to determine hidden patterns in the input. There is\n",
    "a common misconception that RL is a kind of unsupervised learning, but it is not. In\n",
    "unsupervised learning, the model learns the hidden structure, whereas, in RL, the\n",
    "model learns by maximizing the reward.\n",
    "\n",
    "For instance, consider a movie recommendation system. Say we want to recommend\n",
    "a new movie to the user. With unsupervised learning, the model (agent) will find\n",
    "movies similar to the movies the user (or users with a profile similar to the user) has\n",
    "viewed before and recommend new movies to the user.\n",
    "\n",
    "With RL, the agent constantly receives feedback from the user. This feedback\n",
    "represents rewards (a reward could be ratings the user has given for a movie they\n",
    "have watched, time spent watching a movie, time spent watching trailers, and so on).\n",
    "Based on the rewards, an RL agent will understand the movie preference of the user\n",
    "and then suggest new movies accordingly.\n",
    "\n",
    "Since the RL agent is learning with the aid of rewards, it can understand if the user's\n",
    "movie preference changes and suggest new movies according to the user's changed\n",
    "movie preference dynamically.\n",
    "\n",
    "Thus, we can say that in both supervised and unsupervised learning the model\n",
    "(agent) learns based on the given training dataset, whereas in RL the agent learns\n",
    "by directly interacting with the environment. Thus, RL is essentially an interaction\n",
    "between the agent and its environment.\n",
    "\n",
    "Before moving on to the fundamental concepts of RL, we will introduce a popular\n",
    "process to aid decision-making in an RL environment.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.06. Markov Decision Processes-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Markov Decision Processes \n",
    "\n",
    "The Markov Decision Process (MDP) provides a mathematical framework for\n",
    "solving the RL problem. Almost all RL problems can be modeled as an MDP. MDPs\n",
    "are widely used for solving various optimization problems. In this section, we will\n",
    "understand what an MDP is and how it is used in RL.\n",
    "\n",
    "To understand an MDP, first, we need to learn about the Markov property and\n",
    "Markov chain.\n",
    "\n",
    "\n",
    "## Markov Property and Markov Chain \n",
    "\n",
    "The Markov property states that the future depends only on the present and not\n",
    "on the past. The Markov chain, also known as the Markov process, consists of a\n",
    "sequence of states that strictly obey the Markov property; that is, the Markov chain\n",
    "is the probabilistic model that solely depends on the current state to predict the next\n",
    "state and not the previous states, that is, the future is conditionally independent of\n",
    "the past.\n",
    "\n",
    "For example, if we want to predict the weather and we know that the current state is\n",
    "cloudy, we can predict that the next state could be rainy. We concluded that the next\n",
    "state is likely to be rainy only by considering the current state (cloudy) and not the\n",
    "previous states, which might have been sunny, windy, and so on.\n",
    "However, the Markov property does not hold for all processes. For instance,\n",
    "throwing a dice (the next state) has no dependency on the previous number that\n",
    "showed up on the dice (the current state).\n",
    "\n",
    "Moving from one state to another is called a transition, and its probability is called\n",
    "a transition probability. We denote the transition probability by $P(s'|s) $. It indicates\n",
    "the probability of moving from the state $s$ to the next state $s'$.\n",
    "\n",
    "Say we have three states (cloudy, rainy, and windy) in our Markov chain. Then we can represent the\n",
    "probability of transitioning from one state to another using a table called a Markov\n",
    "table, as shown in the following table:\n",
    "\n",
    "![title](Images/8.PNG)\n",
    "\n",
    "From the above table, we can observe that:\n",
    "\n",
    "* From the state cloudy, we transition to the state rainy with 70% probability and to the state windy with 30% probability.\n",
    "\n",
    "* From the state rainy, we transition to the same state rainy with 80% probability and to the state cloudy with 20% probability.\n",
    "\n",
    "* From the state windy, we transition to the state rainy with 100% probability.\n",
    "\n",
    "We can also represent this transition information of the Markov chain in the form of\n",
    "a state diagram, as shown below:\n",
    "\n",
    "\n",
    "![title](Images/9.png)\n",
    "We can also formulate the transition probabilities into a matrix called the transition\n",
    "matrix, as shown below:\n",
    "\n",
    "![title](Images/10.PNG)\n",
    "\n",
    "Thus, to conclude, we can say that the Markov chain or Markov process consists of a\n",
    "set of states along with their transition probabilities.\n",
    "\n",
    "## Markov Reward Process\n",
    "\n",
    "The Markov Reward Process (MRP) is an extension of the Markov chain with the\n",
    "reward function. That is, we learned that the Markov chain consists of states and a\n",
    "transition probability. The MRP consists of states, a transition probability, and also a\n",
    "reward function.\n",
    "\n",
    "A reward function tells us the reward we obtain in each state. For instance, based on\n",
    "our previous weather example, the reward function tells us the reward we obtain\n",
    "in the state cloudy, the reward we obtain in the state windy, and so on. The reward\n",
    "function is usually denoted by $R(s)$.\n",
    "\n",
    "Thus, the MRP consists of states $s$, a transition probability $P(s|s')$\n",
    "function $R(s)$. \n",
    "\n",
    "## Markov Decision Process\n",
    "\n",
    "The Markov Decision Process (MDP) is an extension of the MRP with actions. That\n",
    "is, we learned that the MRP consists of states, a transition probability, and a reward\n",
    "function. The MDP consists of states, a transition probability, a reward function,\n",
    "and also actions. We learned that the Markov property states that the next state is\n",
    "dependent only on the current state and is not based on the previous state. Is the\n",
    "Markov property applicable to the RL setting? Yes! In the RL environment, the agent\n",
    "makes decisions only based on the current state and not based on the past states. So,\n",
    "we can model an RL environment as an MDP.\n",
    "\n",
    "Let's understand this with an example. Given any environment, we can formulate\n",
    "the environment using an MDP. For instance, let's consider the same grid world\n",
    "environment we learned earlier. The following figure shows the grid world environment,\n",
    "and the goal of the agent is to reach state I from state A without visiting the shaded\n",
    "state\n",
    "\n",
    "\n",
    "![title](Images/11.png)\n",
    "\n",
    "An agent makes a decision (action) in the environment only based on the current\n",
    "state the agent is in and not based on the past state. So, we can formulate our\n",
    "environment as an MDP. We learned that the MDP consists of states, actions,\n",
    "transition probabilities, and a reward function. Now, let's learn how this relates to\n",
    "our RL environment:\n",
    "\n",
    "__States__ – A set of states present in the environment. Thus, in the grid world\n",
    "environment, we have states A to I.\n",
    "\n",
    "__Actions__ – A set of actions that our agent can perform in each state. An agent\n",
    "performs an action and moves from one state to another. Thus, in the grid world\n",
    "environment, the set of actions is up, down, left, and right.\n",
    "\n",
    "__Transition probability__ – The transition probability is denoted by $ P(s'|s,a) $. It\n",
    "implies the probability of moving from a state $s$ to the next state $s'$ while performing\n",
    "an action $a$. If you observe, in the MRP, the transition probability is just $ P(s'|s,a) $ that\n",
    "is, the probability of going from state $s$ to state $s'$ and it doesn't include actions. But in MDP we include the actions, thus the transition probability is denoted by $ P(s'|s,a) $. \n",
    "\n",
    "For example, in our grid world environment, say, the transition probability of moving from state A to state B while performing an action right is 100% then it can be expressed as: $P( B |A , \\text{right}) = 1.0 $. We can also view this in the state diagram as shown below:\n",
    "\n",
    "\n",
    "![title](Images/12.png)\n",
    "\n",
    "Suppose, our agent is in state C and the transition probability of moving from state C to the state F while performing an action down is 90% then it can be expressed as: $P( F |C , \\text{down}) = 0.9 $. We can also view this in the state diagram as shown below:\n",
    "\n",
    "\n",
    "![title](Images/13.png)\n",
    "\n",
    "__Reward function__ -  The reward function is denoted by $R(s,a,s') $. It implies the reward our agent obtains while transitioning from a state $s$ to the state $s'$ while performing an action $a$. \n",
    "\n",
    "Say, the reward we obtain while transitioning from the state A to the state B while performing an action right is -1, then it can be expressed as $R(A, \\text{right}, B) = -1 $. We can also view this in the state diagram as shown below:\n",
    "\n",
    "\n",
    "![title](Images/14.png)\n",
    "\n",
    "Suppose, our agent is in state C and say, the reward we obtain while transitioning from the state C to the state F while performing an action down is  +1, then it can be expressed as $R(C, \\text{down}, F) = +1 $. We can also view this in the state diagram as shown below:\n",
    "\n",
    "\n",
    "![title](Images/15.png)\n",
    "\n",
    "\n",
    "Thus, an RL environment can be represented as an MDP with states, actions,\n",
    "transition probability, and the reward function. But wait! What is the use of\n",
    "representing the RL environment using the MDP? We can solve the RL problem easily\n",
    "once we model our environment as the MDP. For instance, once we model our grid\n",
    "world environment using the MDP, then we can easily find how to reach the goal\n",
    "state I from state A without visiting the shaded states. We will learn more about this\n",
    "in the upcoming chapters. Next, we will go through more essential concepts of RL.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.07. Action space, Policy, Episode and Horizon-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Action space, Policy, Episode, Horizon\n",
    "\n",
    "In this section, we will learn about the several important fundamental concepts that are involved in reinforcement learning. \n",
    "\n",
    "## Action space\n",
    "Consider the grid world environment shown below:\n",
    "\n",
    "![title](Images/16.png)\n",
    "\n",
    "\n",
    "In the preceding grid world environment, the goal of the agent is to reach state I\n",
    "starting from state A without visiting the shaded states. In each of the states, the\n",
    "agent can perform any of the four actions—up, down, left, and right—to achieve the\n",
    "goal. The set of all possible actions in the environment is called the action space.\n",
    "Thus, for this grid world environment, the action space will be [up, down, left, right].\n",
    "We can categorize action spaces into two types:\n",
    "\n",
    "* Discrete action space \n",
    "* Continuous action space\n",
    "\n",
    "__Discrete action space__ -When our action space consists of actions that are discrete,\n",
    "then it is called a discrete action space. For instance, in the grid world environment,\n",
    "our action space consists of four discrete actions, which are up, down, left, right, and\n",
    "so it is called a discrete action space.\n",
    "\n",
    "__Continuous action space__ - When our action space consists of actions that are\n",
    "continuous, then it is called a continuous action space. For instance, let's suppose\n",
    "we are training an agent to drive a car, then our action space will consist of several\n",
    "actions that have continuous values, such as the speed at which we need to drive the\n",
    "car, the number of degrees we need to rotate the wheel, and so on. In cases where\n",
    "our action space consists of actions that are continuous, it is called a continuous\n",
    "action space.\n",
    "\n",
    "## Policy\n",
    "\n",
    "A policy defines the agent's behavior in an environment. The policy tells the agent\n",
    "what action to perform in each state. For instance, in the grid world environment, we\n",
    "have states A to I and four possible actions. The policy may tell the agent to move\n",
    "down in state A, move right in state D, and so on.\n",
    "\n",
    "To interact with the environment for the first time, we initialize a random policy, that\n",
    "is, the random policy tells the agent to perform a random action in each state. Thus,\n",
    "in an initial iteration, the agent performs a random action in each state and tries to\n",
    "learn whether the action is good or bad based on the reward it obtains. Over a series\n",
    "of iterations, an agent will learn to perform good actions in each state, which gives a\n",
    "positive reward. Thus, we can say that over a series of iterations, the agent will learn\n",
    "a good policy that gives a positive reward.\n",
    "\n",
    "The optimal policy is shown in the following figure. As we can observe, the agent selects the\n",
    "action in each state based on the optimal policy and reaches the terminal state I from\n",
    "the starting state A without visiting the shaded states:\n",
    "\n",
    "![title](Images/17.png)\n",
    "\n",
    "Thus, the optimal policy tells the agent to perform the correct action in each state so\n",
    "that the agent can receive a good reward.\n",
    "\n",
    "A policy can be classified into two:\n",
    "\n",
    "* Deterministic Policy\n",
    "* Stochastic Policy\n",
    "\n",
    "### Deterministic Policy\n",
    "The policy which we just learned above is called deterministic policy. That is, deterministic policy tells the agent to perform a one particular action in a state. Thus, the deterministic policy maps the state to one particular action and is often denoted by $\\mu$. Given a state $s$ at a time $t$, a deterministic policy tells the agent to perform a one particular action $a$. It can be expressed as:\n",
    "\n",
    "$$a_t = \\mu(s_t) $$\n",
    "\n",
    "For instance, consider our grid world example, given a state A, the deterministic policy $\\mu$ tells the agent to perform an action down and it can be expressed as:\n",
    "\n",
    "$$\\mu (A) = \\text{Down} $$\n",
    "\n",
    "Thus, according to the deterministic policy, whenever the agent visits state A, it performs the action down. \n",
    "\n",
    "### Stochastic Policy\n",
    "\n",
    "Unlike deterministic policy, the stochastic policy does not map the state directly to one particular action, instead, it maps the state to a probability distribution over an action space. \n",
    "\n",
    "That is, we learned that given a state, the deterministic policy will tell the agent to perform one particular action in the given state, so, whenever the agent visits the state it always performs the same particular action. But with stochastic policy, given a state, the stochastic policy will return a probability distribution over an action space so instead of performing the same action every time the agent visits the state, the agent performs different actions each time based on a probability distribution returned by the stochastic policy. \n",
    "\n",
    "Let's understand this with an example, we know that our grid world environment's action space consists of 4 actions which are [up, down, left, right]. Given a state A, the stochastic policy returns the probability distribution over the action space as [0.10,0.70,0.10,0.10]. Now, whenever the agent visits the state A, instead of selecting the same particular action every time, the agent selects the action up 10% of the time, action down 70% of the time, action left 10% of time and action right 10% of the time. \n",
    "\n",
    "The difference between the deterministic policy and stochastic policy is shown below, as we can observe the deterministic policy maps the state to one particular action whereas the stochastic policy maps the state to the probability distribution over an action space:\n",
    "\n",
    "![title](Images/18.png)\n",
    "\n",
    "Thus, stochastic policy maps the state to a probability distribution over action space and it is often denoted by $\\pi$.  Say, we have a state $s$ and action $a$ at a time $t$, then we can express the stochastic policy as:\n",
    "\n",
    "\n",
    "$$a_t \\sim \\pi(s_t) $$\n",
    "\n",
    "Or it can also be expressed as $\\pi(a_t |s_t) $. \n",
    "\n",
    "We can categorize the stochastic policy into two:\n",
    "\n",
    "* Categorical policy\n",
    "* Gaussian policy\n",
    "\n",
    "### Categorical policy \n",
    "A stochastic policy is called a categorical policy when the action space is discrete. That is, the stochastic policy uses categorical probability distribution over action space to select actions when the action space is discrete. For instance, in the grid world environment, we have just seen above, we select actions based on categorical probability distribution (discrete distribution) as the action space of the environment is discrete. As shown below, given a state A, we select an action based on the categorical probability distribution over the action space:\n",
    "\n",
    "\n",
    "\n",
    "![title](Images/19.png)\n",
    "### Gaussian policy \n",
    "A stochastic policy is called a gaussian policy when our action space is continuous. That is, the stochastic policy uses Gaussian probability distribution over action space to select actions when the action space is continuous. Let's understand this with a small example. Suppose we training an agent to drive a car and say we have one continuous action in our action space. Let the action be the speed of the car and the value of the speed of the car ranges from 0 to 150 kmph. Then, the stochastic policy uses the Gaussian distribution over the action space to select action as shown below:\n",
    "\n",
    "![title](Images/20.png)\n",
    "\n",
    "\n",
    "We will learn more about the gaussian policy in the upcoming chapters.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Episode \n",
    "\n",
    "The agent interacts with the environment by performing some action starting from the initial state and reach the final state. This agent-environment interaction starting from the initial state until the final state is called an episode. For instance, in the car racing video game, the agent plays the game by starting from the initial state (starting point of the race) and reach the final state (endpoint of the race). This is considered an episode. An episode is also often called trajectory (path taken by the agent) and it is denoted by $\\tau$. \n",
    "\n",
    "An agent can play the game for any number of episodes and each episode is independent of each other. What is the use of playing the game for multiple numbers of episodes? In order to learn the optimal policy, that is, the policy which tells the agent to perform correct action in each state, the agent plays the game for many episodes. \n",
    "\n",
    "For example, let's say we are playing a car racing game, for the first time, we may not win the game and we play the game several times to understand more about the game and discover some good strategies for winning the game. Similarly, in the first episode, the agent may not win the game and it plays the game for several episodes to understand more about the game environment and good strategies to win the game. \n",
    "\n",
    "\n",
    "\n",
    "Say, we begin the game from an initial state at a time step t=0 and reach the final state at a time step T then the episode information consists of the agent environment interaction such as state, action, and reward starting from the initial state till the final state, that is, $(s_0, a_0,r_0,s_1,a_1,r_1,\\dots,s_T) $\n",
    "\n",
    "An episode (or) trajectory is shown below:\n",
    "\n",
    "![title](Images/21.png)\n",
    "\n",
    "\n",
    "Let's strengthen our understanding of the episode and optimal policy with the grid world environment. We learned that in the grid world environment, the goal of our agent is to reach the final state I starting from the initial state A without visiting the shaded states. An agent receives +1 reward when it visits the unshaded states and -1 reward when it visits the shaded states.\n",
    "\n",
    "When we say, generate an episode it means going from initial state to the final state. The agent generates the first episode using a random policy and explores the environment and over several episodes, it will learn the optimal policy. \n",
    "\n",
    "### Episode 1:\n",
    "\n",
    "As shown below, in the first episode, the agent uses random policy and selects random action in each state starting from the initial state until the final state and observe the reward:\n",
    "\n",
    "\n",
    "![title](Images/22.png)\n",
    "\n",
    "\n",
    "### Episode 2:\n",
    "\n",
    "In the second episode, the agent tries a different policy to avoid negative rewards which it had received in the previous episode. For instance, as we can observe in the previous episode, the agent selected an action right in the state A and received a negative reward, so in this episode, instead of selecting action right in the state A, it tries a different action say, down as shown below:\n",
    "\n",
    "\n",
    "![title](Images/23.png)\n",
    "\n",
    "### Episode n:\n",
    "\n",
    "Thus, over a series of the episodes, the agent learns the optimal policy, that is, the policy which takes the agent to the final state I from the state A without visiting the shaded states as shown below:\n",
    "\n",
    "\n",
    "![title](Images/24.png)\n",
    "\n",
    "# Episodic and Continuous tasks \n",
    "A reinforcement learning task can be categorized into two:\n",
    "* Episodic task\n",
    "* Continuous task\n",
    "\n",
    "__Episodic task__ - As the name suggests episodic task is the one that has the terminal state. That is, episodic tasks are basically tasks made up of episodes and thus they have a terminal state. Example: Car racing game. \n",
    "\n",
    "__Continuous task__ - Unlike episodic tasks, continuous tasks do not contain any episodes and so they don't have any terminal state. For example, a personal assistance robot does not have a terminal state. \n",
    "\n",
    "\n",
    "# Horizon\n",
    "Horizon is the time step until which the agent interacts with the environment. We can classify the horizon into two:\n",
    "\n",
    "* Finite horizon\n",
    "* Infinite horizon\n",
    "\n",
    "__Finite horizon__ - If the agent environment interaction stops at a particular time step then it is called finite Horizon. For instance, in the episodic tasks agent interacts with the environment starting from the initial state at time step  t =0 and reach the final state at a time step T.  Since the agent environment interaction stops at the time step T, it is considered a finite horizon. \n",
    "\n",
    "__Infinite horizon__ - If the agent environment interaction never stops then it is called an infinite horizon. For instance, we learned that the continuous task does not have any terminal states, so the agent environment interaction will never stop in the continuous task and so it is considered an infinite horizon. \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.01. Key Elements of Reinforcement Learning .ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "\n",
    "Reinforcement Learning (RL) is one of the areas of Machine Learning (ML). Unlike\n",
    "other ML paradigms, such as supervised and unsupervised learning, RL works in a\n",
    "trial and error fashion by interacting with its environment.\n",
    "\n",
    "RL is one of the most active areas of research in artificial intelligence, and it is\n",
    "believed that RL will take us a step closer towards achieving artificial general\n",
    "intelligence. RL has evolved rapidly in the past few years with a wide variety of\n",
    "applications ranging from building a recommendation system to self-driving cars.\n",
    "The major reason for this evolution is the advent of deep reinforcement learning,\n",
    "which is a combination of deep learning and RL. With the emergence of new RL\n",
    "algorithms and libraries, RL is clearly one of the most promising areas of ML.\n",
    "\n",
    "In this chapter, we will build a strong foundation in RL by exploring several\n",
    "important and fundamental concepts involved in RL. In this chapter, we will learn about the following topics:\n",
    "\n",
    "* Key elements of RL\n",
    "* The basic idea of RL\n",
    "* The RL algorithm\n",
    "* How RL differs from other ML paradigms\n",
    "* The Markov Decision Processes\n",
    "* Fundamental concepts of RL\n",
    "* Applications of RL\n",
    "* RL glossary\n",
    "\n",
    "We will begin the chapter by understanding Key elements of RL. This will help us understand the\n",
    "basic idea of RL."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Key Elements of Reinforcement Learning \n",
    "\n",
    "Let's begin by understanding some key elements of RL.\n",
    "\n",
    "## Agent \n",
    "\n",
    "An agent is a software program that learns to make intelligent decisions. We can\n",
    "say that an agent is a learner in the RL setting. For instance, a chess player can be\n",
    "considered an agent since the player learns to make the best moves (decisions) to win\n",
    "the game. Similarly, Mario in a Super Mario Bros video game can be considered an\n",
    "agent since Mario explores the game and learns to make the best moves in the game.\n",
    "\n",
    "\n",
    "## Environment \n",
    "The environment is the world of the agent. The agent stays within the environment.\n",
    "For instance, coming back to our chess game, a chessboard is called the environment\n",
    "since the chess player (agent) learns to play the game of chess within the chessboard\n",
    "(environment). Similarly, in Super Mario Bros, the world of Mario is called the\n",
    "environment.\n",
    "\n",
    "## State and action\n",
    "A state is a position or a moment in the environment that the agent can be in. We\n",
    "learned that the agent stays within the environment, and there can be many positions\n",
    "in the environment that the agent can stay in, and those positions are called states.\n",
    "For instance, in our chess game example, each position on the chessboard is called\n",
    "the state. The state is usually denoted by s.\n",
    "\n",
    "The agent interacts with the environment and moves from one state to another\n",
    "by performing an action. In the chess game environment, the action is the move\n",
    "performed by the player (agent). The action is usually denoted by a.\n",
    "\n",
    "\n",
    "## Reward\n",
    "\n",
    "We learned that the agent interacts with an environment by performing an action\n",
    "and moves from one state to another. Based on the action, the agent receives a\n",
    "reward. A reward is nothing but a numerical value, say, +1 for a good action and -1\n",
    "for a bad action. How do we decide if an action is good or bad?\n",
    "In our chess game example, if the agent makes a move in which it takes one of the\n",
    "opponent's chess pieces, then it is considered a good action and the agent receives\n",
    "a positive reward. Similarly, if the agent makes a move that leads to the opponent\n",
    "taking the agent's chess piece, then it is considered a bad action and the agent\n",
    "receives a negative reward. The reward is denoted by r.\n",
    "\n",
    "\n",
    "In the next section, let us explore basic idea of reinforcement learning. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.02. Basic Idea of Reinforcement Learning.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic Idea of Reinforcement Learning \n",
    "\n",
    "\n",
    "\n",
    "Let's begin with an analogy. Let's suppose we are teaching a dog (agent) to catch a\n",
    "ball. Instead of teaching the dog explicitly to catch a ball, we just throw a ball and\n",
    "every time the dog catches the ball, we give the dog a cookie (reward). If the dog\n",
    "fails to catch the ball, then we do not give it a cookie. So, the dog will figure out\n",
    "what action caused it to receive a cookie and repeat that action. Thus, the dog will\n",
    "understand that catching the ball caused it to receive a cookie and will attempt to\n",
    "repeat catching the ball. Thus, in this way, the dog will learn to catch a ball while\n",
    "aiming to maximize the cookies it can receive.\n",
    "\n",
    "Similarly, in an RL setting, we will not teach the agent what to do or how to do it;\n",
    "instead, we will give a reward to the agent for every action it does. We will give\n",
    "a positive reward to the agent when it performs a good action and we will give a\n",
    "negative reward to the agent when it performs a bad action. The agent begins by\n",
    "performing a random action and if the action is good, we then give the agent a\n",
    "positive reward so that the agent understands it has performed a good action and it\n",
    "will repeat that action. If the action performed by the agent is bad, then we will give\n",
    "the agent a negative reward so that the agent will understand it has performed a bad\n",
    "action and it will not repeat that action.\n",
    "\n",
    "Thus, RL can be viewed as a trial and error learning process where the agent tries out\n",
    "different actions and learns the good action, which gives a positive reward.\n",
    "\n",
    "In the dog analogy, the dog represents the agent, and giving a cookie to the dog\n",
    "upon it catching the ball is a positive reward and not giving a cookie is a negative\n",
    "reward. So, the dog (agent) explores different actions, which are catching the ball\n",
    "and not catching the ball, and understands that catching the ball is a good action as it\n",
    "brings the dog a positive reward (getting a cookie).\n",
    "\n",
    "\n",
    "Let's further explore the idea of RL with one more simple example. Let's suppose we\n",
    "want to teach a robot (agent) to walk without hitting a mountain, as the following figure shows: \n",
    "\n",
    "![title](Images/1.png)\n",
    "\n",
    "We will not teach the robot explicitly to not go in the direction of the mountain.\n",
    "Instead, if the robot hits the mountain and gets stuck, we give the robot a negative\n",
    "reward, say -1. So, the robot will understand that hitting the mountain is the wrong\n",
    "action, and it will not repeat that action:\n",
    "\n",
    "\n",
    "![title](Images/2.png)\n",
    "\n",
    "Similarly, when the robot walks in the right direction without hitting the mountain,\n",
    "we give the robot a positive reward, say +1. So, the robot will understand that not\n",
    "hitting the mountain is a good action, and it will repeat that action:\n",
    "\n",
    "![title](Images/3.png)\n",
    "\n",
    "Thus, in the RL setting, the agent explores different actions and learns the best action\n",
    "based on the reward it gets.\n",
    "Now that we have a basic idea of how RL works, in the upcoming sections, we will\n",
    "go into more detail and also learn the important concepts involved in RL."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.03. Reinforcement Learning Algorithm.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reinforcement Learning algorithm\n",
    "\n",
    "The steps involved in a typical RL algorithm are as follows:\n",
    "\n",
    "1. First, the agent interacts with the environment by performing an action.\n",
    "2. By performing an action, the agent moves from one state to another.\n",
    "3. Then the agent will receive a reward based on the action it performed.\n",
    "4. Based on the reward, the agent will understand whether the action is good or bad.\n",
    "5. If the action was good, that is, if the agent received a positive reward, then the agent will prefer performing that action, else the agent will try performing other actions in search of a positive reward.\n",
    "\n",
    "RL is basically a trial and error learning process. Now, let's revisit our chess game\n",
    "example. The agent (software program) is the chess player. So, the agent interacts\n",
    "with the environment (chessboard) by performing an action (moves). If the agent\n",
    "gets a positive reward for an action, then it will prefer performing that action; else it\n",
    "will find a different action that gives a positive reward.\n",
    "\n",
    "Ultimately, the goal of the agent is to maximize the reward it gets. If the agent\n",
    "receives a good reward, then it means it has performed a good action. If the agent\n",
    "performs a good action, then it implies that it can win the game. Thus, the agent\n",
    "learns to win the game by maximizing the reward."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.04. RL agent in the Grid World .ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# RL agent in the Grid World \n",
    "\n",
    "Let's strengthen our understanding of RL by looking at another simple example.\n",
    "Consider the following grid world environment:\n",
    "\n",
    "![title](Images/4.png)\n",
    "\n",
    "The positions A to I in the environment are called the states of the environment.\n",
    "The goal of the agent is to reach state I by starting from state A without visiting\n",
    "the shaded states (B, C, G, and H). Thus, in order to achieve the goal, whenever\n",
    "our agent visits a shaded state, we will give a negative reward (say -1) and when it\n",
    "visits an unshaded state, we will give a positive reward (say +1). The actions in the\n",
    "environment are moving up, down, right and left. The agent can perform any of these\n",
    "four actions to reach state I from state A.\n",
    "\n",
    "The first time the agent interacts with the environment (the first iteration), the agent\n",
    "is unlikely to perform the correct action in each state, and thus it receives a negative\n",
    "reward. That is, in the first iteration, the agent performs a random action in each\n",
    "state, and this may lead the agent to receive a negative reward. But over a series of\n",
    "iterations, the agent learns to perform the correct action in each state through the\n",
    "reward it obtains, helping it achieve the goal. Let us explore this in detail.\n",
    "\n",
    "## Iteration 1:\n",
    "\n",
    "As we learned, in the first iteration, the agent performs a random action in each state.\n",
    "For instance, look at the following figure. In the first iteration, the agent moves right\n",
    "from state A and reaches the new state B. But since B is the shaded state, the agent\n",
    "will receive a negative reward and so the agent will understand that moving right is\n",
    "not a good action in state A. When it visits state A next time, it will try out a different\n",
    "action instead of moving right:\n",
    "\n",
    "![title](Images/5.PNG)\n",
    "\n",
    "As the avove figure shows, from state B, the agent moves down and reaches the new state\n",
    "E. Since E is an unshaded state, the agent will receive a positive reward, so the agent\n",
    "will understand that moving down from state B is a good action.\n",
    "\n",
    "From state E, the agent moves right and reaches state F. Since F is an unshaded state,\n",
    "the agent receives a positive reward, and it will understand that moving right from\n",
    "state E is a good action. From state F, the agent moves down and reaches the goal\n",
    "state I and receives a positive reward, so the agent will understand that moving\n",
    "down from state F is a good action.\n",
    "\n",
    "\n",
    "## Iteration 2:\n",
    "\n",
    "In the second iteration, from state A, instead of moving right, the agent tries out a\n",
    "different action as the agent learned in the previous iteration that moving right is not\n",
    "a good action in state A.\n",
    "\n",
    "Thus, as the following figure shows, in this iteration the agent moves down from state A and\n",
    "reaches state D. Since D is an unshaded state, the agent receives a positive reward\n",
    "and now the agent will understand that moving down is a good action in state A:\n",
    "\n",
    "\n",
    "![title](Images/6.PNG)\n",
    "\n",
    "As shown in the preceding figure, from state D, the agent moves down and reaches\n",
    "state G. But since G is a shaded state, the agent will receive a negative reward and\n",
    "so the agent will understand that moving down is not a good action in state D, and\n",
    "when it visits state D next time, it will try out a different action instead of moving\n",
    "down.\n",
    "\n",
    "From G, the agent moves right and reaches state H. Since H is a shaded state, it will\n",
    "receive a negative reward and understand that moving right is not a good action in\n",
    "state G.\n",
    "\n",
    "From H it moves right and reaches the goal state I and receives a positive reward, so\n",
    "the agent will understand that moving right from state H is a good action.\n",
    "\n",
    "\n",
    "## Iteration 3:\n",
    "\n",
    "In the third iteration, the agent moves down from state A since, in the second\n",
    "iteration, our agent learned that moving down is a good action in state A. So, the\n",
    "agent moves down from state A and reaches the next state, D, as the following figure shows:\n",
    "\n",
    "![title](Images/7.PNG)\n",
    "\n",
    "Now, from state D, the agent tries a different action instead of moving down since in\n",
    "the second iteration our agent learned that moving down is not a good action in state\n",
    "D. So, in this iteration, the agent moves right from state D and reaches state E.\n",
    "\n",
    "From state E, the agent moves right as the agent already learned in the first iteration\n",
    "that moving right from state E is a good action and reaches state F.\n",
    "\n",
    "Now, from state F, the agent moves down since the agent learned in the first iteration\n",
    "that moving down is a good action in state F, and reaches the goal state I.\n",
    "\n",
    "The following figure shows the result of the third iteration:\n",
    "![title](Images/7.PNG)\n",
    "\n",
    "As we can see, our agent has successfully learned to reach the goal state I from state\n",
    "A without visiting the shaded states based on the rewards.\n",
    "\n",
    "In this way, the agent will try out different actions in each state and understand\n",
    "whether an action is good or bad based on the reward it obtains. The goal of the\n",
    "agent is to maximize rewards. So, the agent will always try to perform good actions\n",
    "that give a positive reward, and when the agent performs good actions in each state,\n",
    "then it ultimately leads the agent to achieve the goal.\n",
    "\n",
    "Note that these iterations are called episodes in RL terminology. We will learn more\n",
    "about episodes later in the chapter."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.05. How RL differs from other ML paradigms?.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How RL differs from other ML paradigms?\n",
    "\n",
    "We can categorize ML into three types:\n",
    "* Supervised learning\n",
    "* Unsupervised learning\n",
    "* Reinforcement learning\n",
    "\n",
    "In supervised learning, the machine learns from training data. The training data\n",
    "consists of a labeled pair of inputs and outputs. So, we train the model (agent)\n",
    "using the training data in such a way that the model can generalize its learning to\n",
    "new unseen data. It is called supervised learning because the training data acts as a\n",
    "supervisor, since it has a labeled pair of inputs and outputs, and it guides the model\n",
    "in learning the given task.\n",
    "\n",
    "Now, let's understand the difference between supervised and reinforcement learning\n",
    "with an example. Consider the dog analogy we discussed earlier in the chapter. In\n",
    "supervised learning, to teach the dog to catch a ball, we will teach it explicitly by\n",
    "specifying turn left, go right, move forward seven steps, catch the ball, and so on\n",
    "in the form of training data. But in RL, we just throw a ball, and every time the dog\n",
    "catches the ball, we give it a cookie (reward). So, the dog will learn to catch the ball\n",
    "while trying to maximize the cookies (reward) it can get.\n",
    "\n",
    "Let's consider one more example. Say we want to train the model to play chess using\n",
    "supervised learning. In this case, we will have training data that includes all the\n",
    "moves a player can make in each state, along with labels indicating whether it is a\n",
    "good move or not. Then, we train the model to learn from this training data, whereas\n",
    "in the case of RL, our agent will not be given any sort of training data; instead, we\n",
    "just give a reward to the agent for each action it performs. Then, the agent will learn\n",
    "by interacting with the environment and, based on the reward it gets, it will choose\n",
    "its actions.\n",
    "\n",
    "Similar to supervised learning, in unsupervised learning, we train the model (agent)\n",
    "based on the training data. But in the case of unsupervised learning, the training data\n",
    "does not contain any labels; that is, it consists of only inputs and not outputs. The\n",
    "goal of unsupervised learning is to determine hidden patterns in the input. There is\n",
    "a common misconception that RL is a kind of unsupervised learning, but it is not. In\n",
    "unsupervised learning, the model learns the hidden structure, whereas, in RL, the\n",
    "model learns by maximizing the reward.\n",
    "\n",
    "For instance, consider a movie recommendation system. Say we want to recommend\n",
    "a new movie to the user. With unsupervised learning, the model (agent) will find\n",
    "movies similar to the movies the user (or users with a profile similar to the user) has\n",
    "viewed before and recommend new movies to the user.\n",
    "\n",
    "With RL, the agent constantly receives feedback from the user. This feedback\n",
    "represents rewards (a reward could be ratings the user has given for a movie they\n",
    "have watched, time spent watching a movie, time spent watching trailers, and so on).\n",
    "Based on the rewards, an RL agent will understand the movie preference of the user\n",
    "and then suggest new movies accordingly.\n",
    "\n",
    "Since the RL agent is learning with the aid of rewards, it can understand if the user's\n",
    "movie preference changes and suggest new movies according to the user's changed\n",
    "movie preference dynamically.\n",
    "\n",
    "Thus, we can say that in both supervised and unsupervised learning the model\n",
    "(agent) learns based on the given training dataset, whereas in RL the agent learns\n",
    "by directly interacting with the environment. Thus, RL is essentially an interaction\n",
    "between the agent and its environment.\n",
    "\n",
    "Before moving on to the fundamental concepts of RL, we will introduce a popular\n",
    "process to aid decision-making in an RL environment.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.06. Markov Decision Processes.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Markov Decision Processes \n",
    "\n",
    "The Markov Decision Process (MDP) provides a mathematical framework for\n",
    "solving the RL problem. Almost all RL problems can be modeled as an MDP. MDPs\n",
    "are widely used for solving various optimization problems. In this section, we will\n",
    "understand what an MDP is and how it is used in RL.\n",
    "\n",
    "To understand an MDP, first, we need to learn about the Markov property and\n",
    "Markov chain.\n",
    "\n",
    "\n",
    "## Markov Property and Markov Chain \n",
    "\n",
    "The Markov property states that the future depends only on the present and not\n",
    "on the past. The Markov chain, also known as the Markov process, consists of a\n",
    "sequence of states that strictly obey the Markov property; that is, the Markov chain\n",
    "is the probabilistic model that solely depends on the current state to predict the next\n",
    "state and not the previous states, that is, the future is conditionally independent of\n",
    "the past.\n",
    "\n",
    "For example, if we want to predict the weather and we know that the current state is\n",
    "cloudy, we can predict that the next state could be rainy. We concluded that the next\n",
    "state is likely to be rainy only by considering the current state (cloudy) and not the\n",
    "previous states, which might have been sunny, windy, and so on.\n",
    "However, the Markov property does not hold for all processes. For instance,\n",
    "throwing a dice (the next state) has no dependency on the previous number that\n",
    "showed up on the dice (the current state).\n",
    "\n",
    "Moving from one state to another is called a transition, and its probability is called\n",
    "a transition probability. We denote the transition probability by $P(s'|s) $. It indicates\n",
    "the probability of moving from the state $s$ to the next state $s'$.\n",
    "\n",
    "Say we have three states (cloudy, rainy, and windy) in our Markov chain. Then we can represent the\n",
    "probability of transitioning from one state to another using a table called a Markov\n",
    "table, as shown in the following table:\n",
    "\n",
    "![title](Images/8.PNG)\n",
    "\n",
    "From the above table, we can observe that:\n",
    "\n",
    "* From the state cloudy, we transition to the state rainy with 70% probability and to the state windy with 30% probability.\n",
    "\n",
    "* From the state rainy, we transition to the same state rainy with 80% probability and to the state cloudy with 20% probability.\n",
    "\n",
    "* From the state windy, we transition to the state rainy with 100% probability.\n",
    "\n",
    "We can also represent this transition information of the Markov chain in the form of\n",
    "a state diagram, as shown below:\n",
    "\n",
    "\n",
    "![title](Images/9.png)\n",
    "We can also formulate the transition probabilities into a matrix called the transition\n",
    "matrix, as shown below:\n",
    "\n",
    "![title](Images/10.PNG)\n",
    "\n",
    "Thus, to conclude, we can say that the Markov chain or Markov process consists of a\n",
    "set of states along with their transition probabilities.\n",
    "\n",
    "## Markov Reward Process\n",
    "\n",
    "The Markov Reward Process (MRP) is an extension of the Markov chain with the\n",
    "reward function. That is, we learned that the Markov chain consists of states and a\n",
    "transition probability. The MRP consists of states, a transition probability, and also a\n",
    "reward function.\n",
    "\n",
    "A reward function tells us the reward we obtain in each state. For instance, based on\n",
    "our previous weather example, the reward function tells us the reward we obtain\n",
    "in the state cloudy, the reward we obtain in the state windy, and so on. The reward\n",
    "function is usually denoted by $R(s)$.\n",
    "\n",
    "Thus, the MRP consists of states $s$, a transition probability $P(s|s')$\n",
    "function $R(s)$. \n",
    "\n",
    "## Markov Decision Process\n",
    "\n",
    "The Markov Decision Process (MDP) is an extension of the MRP with actions. That\n",
    "is, we learned that the MRP consists of states, a transition probability, and a reward\n",
    "function. The MDP consists of states, a transition probability, a reward function,\n",
    "and also actions. We learned that the Markov property states that the next state is\n",
    "dependent only on the current state and is not based on the previous state. Is the\n",
    "Markov property applicable to the RL setting? Yes! In the RL environment, the agent\n",
    "makes decisions only based on the current state and not based on the past states. So,\n",
    "we can model an RL environment as an MDP.\n",
    "\n",
    "Let's understand this with an example. Given any environment, we can formulate\n",
    "the environment using an MDP. For instance, let's consider the same grid world\n",
    "environment we learned earlier. The following figure shows the grid world environment,\n",
    "and the goal of the agent is to reach state I from state A without visiting the shaded\n",
    "state\n",
    "\n",
    "\n",
    "![title](Images/11.png)\n",
    "\n",
    "An agent makes a decision (action) in the environment only based on the current\n",
    "state the agent is in and not based on the past state. So, we can formulate our\n",
    "environment as an MDP. We learned that the MDP consists of states, actions,\n",
    "transition probabilities, and a reward function. Now, let's learn how this relates to\n",
    "our RL environment:\n",
    "\n",
    "__States__ – A set of states present in the environment. Thus, in the grid world\n",
    "environment, we have states A to I.\n",
    "\n",
    "__Actions__ – A set of actions that our agent can perform in each state. An agent\n",
    "performs an action and moves from one state to another. Thus, in the grid world\n",
    "environment, the set of actions is up, down, left, and right.\n",
    "\n",
    "__Transition probability__ – The transition probability is denoted by $ P(s'|s,a) $. It\n",
    "implies the probability of moving from a state $s$ to the next state $s'$ while performing\n",
    "an action $a$. If you observe, in the MRP, the transition probability is just $ P(s'|s,a) $ that\n",
    "is, the probability of going from state $s$ to state $s'$ and it doesn't include actions. But in MDP we include the actions, thus the transition probability is denoted by $ P(s'|s,a) $. \n",
    "\n",
    "For example, in our grid world environment, say, the transition probability of moving from state A to state B while performing an action right is 100% then it can be expressed as: $P( B |A , \\text{right}) = 1.0 $. We can also view this in the state diagram as shown below:\n",
    "\n",
    "\n",
    "![title](Images/12.png)\n",
    "\n",
    "Suppose, our agent is in state C and the transition probability of moving from state C to the state F while performing an action down is 90% then it can be expressed as: $P( F |C , \\text{down}) = 0.9 $. We can also view this in the state diagram as shown below:\n",
    "\n",
    "\n",
    "![title](Images/13.png)\n",
    "\n",
    "__Reward function__ -  The reward function is denoted by $R(s,a,s') $. It implies the reward our agent obtains while transitioning from a state $s$ to the state $s'$ while performing an action $a$. \n",
    "\n",
    "Say, the reward we obtain while transitioning from the state A to the state B while performing an action right is -1, then it can be expressed as $R(A, \\text{right}, B) = -1 $. We can also view this in the state diagram as shown below:\n",
    "\n",
    "\n",
    "![title](Images/14.png)\n",
    "\n",
    "Suppose, our agent is in state C and say, the reward we obtain while transitioning from the state C to the state F while performing an action down is  +1, then it can be expressed as $R(C, \\text{down}, F) = +1 $. We can also view this in the state diagram as shown below:\n",
    "\n",
    "\n",
    "![title](Images/15.png)\n",
    "\n",
    "\n",
    "Thus, an RL environment can be represented as an MDP with states, actions,\n",
    "transition probability, and the reward function. But wait! What is the use of\n",
    "representing the RL environment using the MDP? We can solve the RL problem easily\n",
    "once we model our environment as the MDP. For instance, once we model our grid\n",
    "world environment using the MDP, then we can easily find how to reach the goal\n",
    "state I from state A without visiting the shaded states. We will learn more about this\n",
    "in the upcoming chapters. Next, we will go through more essential concepts of RL.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.07. Action space, Policy, Episode and Horizon.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Action space, Policy, Episode, Horizon\n",
    "\n",
    "In this section, we will learn about the several important fundamental concepts that are involved in reinforcement learning. \n",
    "\n",
    "## Action space\n",
    "Consider the grid world environment shown below:\n",
    "\n",
    "![title](Images/16.png)\n",
    "\n",
    "\n",
    "In the preceding grid world environment, the goal of the agent is to reach state I\n",
    "starting from state A without visiting the shaded states. In each of the states, the\n",
    "agent can perform any of the four actions—up, down, left, and right—to achieve the\n",
    "goal. The set of all possible actions in the environment is called the action space.\n",
    "Thus, for this grid world environment, the action space will be [up, down, left, right].\n",
    "We can categorize action spaces into two types:\n",
    "\n",
    "* Discrete action space \n",
    "* Continuous action space\n",
    "\n",
    "__Discrete action space__ -When our action space consists of actions that are discrete,\n",
    "then it is called a discrete action space. For instance, in the grid world environment,\n",
    "our action space consists of four discrete actions, which are up, down, left, right, and\n",
    "so it is called a discrete action space.\n",
    "\n",
    "__Continuous action space__ - When our action space consists of actions that are\n",
    "continuous, then it is called a continuous action space. For instance, let's suppose\n",
    "we are training an agent to drive a car, then our action space will consist of several\n",
    "actions that have continuous values, such as the speed at which we need to drive the\n",
    "car, the number of degrees we need to rotate the wheel, and so on. In cases where\n",
    "our action space consists of actions that are continuous, it is called a continuous\n",
    "action space.\n",
    "\n",
    "## Policy\n",
    "\n",
    "A policy defines the agent's behavior in an environment. The policy tells the agent\n",
    "what action to perform in each state. For instance, in the grid world environment, we\n",
    "have states A to I and four possible actions. The policy may tell the agent to move\n",
    "down in state A, move right in state D, and so on.\n",
    "\n",
    "To interact with the environment for the first time, we initialize a random policy, that\n",
    "is, the random policy tells the agent to perform a random action in each state. Thus,\n",
    "in an initial iteration, the agent performs a random action in each state and tries to\n",
    "learn whether the action is good or bad based on the reward it obtains. Over a series\n",
    "of iterations, an agent will learn to perform good actions in each state, which gives a\n",
    "positive reward. Thus, we can say that over a series of iterations, the agent will learn\n",
    "a good policy that gives a positive reward.\n",
    "\n",
    "The optimal policy is shown in the following figure. As we can observe, the agent selects the\n",
    "action in each state based on the optimal policy and reaches the terminal state I from\n",
    "the starting state A without visiting the shaded states:\n",
    "\n",
    "![title](Images/17.png)\n",
    "\n",
    "Thus, the optimal policy tells the agent to perform the correct action in each state so\n",
    "that the agent can receive a good reward.\n",
    "\n",
    "A policy can be classified into two:\n",
    "\n",
    "* Deterministic Policy\n",
    "* Stochastic Policy\n",
    "\n",
    "### Deterministic Policy\n",
    "The policy which we just learned above is called deterministic policy. That is, deterministic policy tells the agent to perform a one particular action in a state. Thus, the deterministic policy maps the state to one particular action and is often denoted by $\\mu$. Given a state $s$ at a time $t$, a deterministic policy tells the agent to perform a one particular action $a$. It can be expressed as:\n",
    "\n",
    "$$a_t = \\mu(s_t) $$\n",
    "\n",
    "For instance, consider our grid world example, given a state A, the deterministic policy $\\mu$ tells the agent to perform an action down and it can be expressed as:\n",
    "\n",
    "$$\\mu (A) = \\text{Down} $$\n",
    "\n",
    "Thus, according to the deterministic policy, whenever the agent visits state A, it performs the action down. \n",
    "\n",
    "### Stochastic Policy\n",
    "\n",
    "Unlike deterministic policy, the stochastic policy does not map the state directly to one particular action, instead, it maps the state to a probability distribution over an action space. \n",
    "\n",
    "That is, we learned that given a state, the deterministic policy will tell the agent to perform one particular action in the given state, so, whenever the agent visits the state it always performs the same particular action. But with stochastic policy, given a state, the stochastic policy will return a probability distribution over an action space so instead of performing the same action every time the agent visits the state, the agent performs different actions each time based on a probability distribution returned by the stochastic policy. \n",
    "\n",
    "Let's understand this with an example, we know that our grid world environment's action space consists of 4 actions which are [up, down, left, right]. Given a state A, the stochastic policy returns the probability distribution over the action space as [0.10,0.70,0.10,0.10]. Now, whenever the agent visits the state A, instead of selecting the same particular action every time, the agent selects the action up 10% of the time, action down 70% of the time, action left 10% of time and action right 10% of the time. \n",
    "\n",
    "The difference between the deterministic policy and stochastic policy is shown below, as we can observe the deterministic policy maps the state to one particular action whereas the stochastic policy maps the state to the probability distribution over an action space:\n",
    "\n",
    "![title](Images/18.png)\n",
    "\n",
    "Thus, stochastic policy maps the state to a probability distribution over action space and it is often denoted by $\\pi$.  Say, we have a state $s$ and action $a$ at a time $t$, then we can express the stochastic policy as:\n",
    "\n",
    "\n",
    "$$a_t \\sim \\pi(s_t) $$\n",
    "\n",
    "Or it can also be expressed as $\\pi(a_t |s_t) $. \n",
    "\n",
    "We can categorize the stochastic policy into two:\n",
    "\n",
    "* Categorical policy\n",
    "* Gaussian policy\n",
    "\n",
    "### Categorical policy \n",
    "A stochastic policy is called a categorical policy when the action space is discrete. That is, the stochastic policy uses categorical probability distribution over action space to select actions when the action space is discrete. For instance, in the grid world environment, we have just seen above, we select actions based on categorical probability distribution (discrete distribution) as the action space of the environment is discrete. As shown below, given a state A, we select an action based on the categorical probability distribution over the action space:\n",
    "\n",
    "\n",
    "\n",
    "![title](Images/19.png)\n",
    "### Gaussian policy \n",
    "A stochastic policy is called a gaussian policy when our action space is continuous. That is, the stochastic policy uses Gaussian probability distribution over action space to select actions when the action space is continuous. Let's understand this with a small example. Suppose we training an agent to drive a car and say we have one continuous action in our action space. Let the action be the speed of the car and the value of the speed of the car ranges from 0 to 150 kmph. Then, the stochastic policy uses the Gaussian distribution over the action space to select action as shown below:\n",
    "\n",
    "![title](Images/20.png)\n",
    "\n",
    "\n",
    "We will learn more about the gaussian policy in the upcoming chapters.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Episode \n",
    "\n",
    "The agent interacts with the environment by performing some action starting from the initial state and reach the final state. This agent-environment interaction starting from the initial state until the final state is called an episode. For instance, in the car racing video game, the agent plays the game by starting from the initial state (starting point of the race) and reach the final state (endpoint of the race). This is considered an episode. An episode is also often called trajectory (path taken by the agent) and it is denoted by $\\tau$. \n",
    "\n",
    "An agent can play the game for any number of episodes and each episode is independent of each other. What is the use of playing the game for multiple numbers of episodes? In order to learn the optimal policy, that is, the policy which tells the agent to perform correct action in each state, the agent plays the game for many episodes. \n",
    "\n",
    "For example, let's say we are playing a car racing game, for the first time, we may not win the game and we play the game several times to understand more about the game and discover some good strategies for winning the game. Similarly, in the first episode, the agent may not win the game and it plays the game for several episodes to understand more about the game environment and good strategies to win the game. \n",
    "\n",
    "\n",
    "\n",
    "Say, we begin the game from an initial state at a time step t=0 and reach the final state at a time step T then the episode information consists of the agent environment interaction such as state, action, and reward starting from the initial state till the final state, that is, $(s_0, a_0,r_0,s_1,a_1,r_1,\\dots,s_T) $\n",
    "\n",
    "An episode (or) trajectory is shown below:\n",
    "\n",
    "![title](Images/21.png)\n",
    "\n",
    "\n",
    "Let's strengthen our understanding of the episode and optimal policy with the grid world environment. We learned that in the grid world environment, the goal of our agent is to reach the final state I starting from the initial state A without visiting the shaded states. An agent receives +1 reward when it visits the unshaded states and -1 reward when it visits the shaded states.\n",
    "\n",
    "When we say, generate an episode it means going from initial state to the final state. The agent generates the first episode using a random policy and explores the environment and over several episodes, it will learn the optimal policy. \n",
    "\n",
    "### Episode 1:\n",
    "\n",
    "As shown below, in the first episode, the agent uses random policy and selects random action in each state starting from the initial state until the final state and observe the reward:\n",
    "\n",
    "\n",
    "![title](Images/22.png)\n",
    "\n",
    "\n",
    "### Episode 2:\n",
    "\n",
    "In the second episode, the agent tries a different policy to avoid negative rewards which it had received in the previous episode. For instance, as we can observe in the previous episode, the agent selected an action right in the state A and received a negative reward, so in this episode, instead of selecting action right in the state A, it tries a different action say, down as shown below:\n",
    "\n",
    "\n",
    "![title](Images/23.png)\n",
    "\n",
    "### Episode n:\n",
    "\n",
    "Thus, over a series of the episodes, the agent learns the optimal policy, that is, the policy which takes the agent to the final state I from the state A without visiting the shaded states as shown below:\n",
    "\n",
    "\n",
    "![title](Images/24.png)\n",
    "\n",
    "# Episodic and Continuous tasks \n",
    "A reinforcement learning task can be categorized into two:\n",
    "* Episodic task\n",
    "* Continuous task\n",
    "\n",
    "__Episodic task__ - As the name suggests episodic task is the one that has the terminal state. That is, episodic tasks are basically tasks made up of episodes and thus they have a terminal state. Example: Car racing game. \n",
    "\n",
    "__Continuous task__ - Unlike episodic tasks, continuous tasks do not contain any episodes and so they don't have any terminal state. For example, a personal assistance robot does not have a terminal state. \n",
    "\n",
    "\n",
    "# Horizon\n",
    "Horizon is the time step until which the agent interacts with the environment. We can classify the horizon into two:\n",
    "\n",
    "* Finite horizon\n",
    "* Infinite horizon\n",
    "\n",
    "__Finite horizon__ - If the agent environment interaction stops at a particular time step then it is called finite Horizon. For instance, in the episodic tasks agent interacts with the environment starting from the initial state at time step  t =0 and reach the final state at a time step T.  Since the agent environment interaction stops at the time step T, it is considered a finite horizon. \n",
    "\n",
    "__Infinite horizon__ - If the agent environment interaction never stops then it is called an infinite horizon. For instance, we learned that the continuous task does not have any terminal states, so the agent environment interaction will never stop in the continuous task and so it is considered an infinite horizon. \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.08.  Return, Discount Factor and Math Essentials.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Return, Discount Factor and Math Essentials"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Return and discount factor\n",
    "A return can be defined as the sum of the rewards obtained by the agent in an episode. The return is often denoted by  $R$ or $G$. Say, the agent starts from the initial state at time step  $t=0$ and reaches the final state at a time step $T$, then the return obtained by the agent is given as:\n",
    "\n",
    "$$\\begin{aligned}R(\\tau) &= r_0 + r_1+r_2+\\dots+r_T \\\\ &\\\\\n",
    "R(\\tau) &= \\sum_{t=0}^{T-1} r_t \\end{aligned} $$\n",
    "\n",
    "Let's understand this with an example, consider the below trajectory $\\tau$:\n",
    "\n",
    "![title](Images/25.PNG)\n",
    "\n",
    "\n",
    "The return of the trajectory is the sum of the rewards, that is, $R(\\tau) = 2+ 2+1 +2 = 7$\n",
    "\n",
    "Thus, we can say that the goal of our agent is to maximize the return, that is, maximize the sum of rewards(cumulative rewards) obtained over the episode. How can we maximize the return? We can maximize the return if we perform correct action in each state. Okay, how can we perform correct action in each state? We can perform correct action in each state using the optimal policy. Thus, we can maximize the return using the optimal policy. \n",
    "\n",
    "Thus, we can redefine the optimal policy as the policy which gets our agent the maximum return (sum of rewards) by performing correct action in each state. \n",
    "\n",
    "Okay, how can we define the return for continuous tasks? We learned that in continuous tasks there are no terminal states, so we can define the return as a sum of rewards up to infinity:\n",
    "$$R(\\tau) = r_0 + r_1+r_2+\\dots+r_{\\infty} $$\n",
    "\n",
    "\n",
    "But how can we maximize the return which just sums to infinity? So, we introduce a notation of discount factor $\\gamma$ and rewrite our return as:\n",
    "\n",
    "$$\\begin{aligned}R(\\tau) &= \\gamma^0 r_0 + \\gamma^1 r_1+ \\gamma^2 r_2+\\dots+ \\gamma^nr_{\\infty} \\\\&\\\\R(\\tau) & = \\sum_{t=0}^{\\infty} \\gamma^t r_t \\end{aligned} $$\n",
    "\n",
    "Okay, but the question is how this discount factor  $\\gamma$ is going to help us?  It helps us in preventing the return reaching up to infinity by deciding how much importance we give to future rewards and immediate rewards. The value of the discount factor ranges from 0 to 1. When we set the discount factor to a small value (close to 0) then it implies that we give more importance to immediate reward than the future rewards and when we set the discount factor to a high value (close to 1) then it implies that we give more importance to future rewards than the immediate reward. Let us understand this with an example with different values of discount factor:\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Small discount factor\n",
    "\n",
    "Let's set the discount factor to a small value, say 0.2, that is, let's set $\\gamma = 0.2$, then we can write:\n",
    "$$ \\begin{aligned}R &= (\\gamma)^0 r_0 + (\\gamma)^1 r_1+ (\\gamma)^2 r_2 +  \\dots \\\\&\\\\&=(0.2)^0 r_0 + (0.2)^1 r_1+ (0.2)^2 r_2 +  \\dots\\\\&\\\\&= (1) r_0 + (0.2) r_1+ (0.04) r_2+ \\dots \\end{aligned} $$\n",
    "\n",
    "\n",
    "From the above equation, we can observe that the reward at each time step is weighted by a discount factor. As the time step increases the discount factor (weight) decreases and thus the importance of rewards at future time steps also decreases. That is, from the above equation, we can observe that:\n",
    "\n",
    "* At the time step 0, the reward $r_0$ is weighted by the discount factor 1\n",
    "* At the time step 1, the discount factor is heavily decreased and the reward  $r_1$ is weighted by the discount factor 0.2\n",
    "* At the time step 2, the discount factor is again decreased to 0.04 and the reward $r2$ is weighted by the discount 0.04\n",
    "\n",
    "As we can observe, the discount factor is heavily decreased for the subsequent time steps and more importance is given to the immediate reward $r_0$ than the rewards obtained at the future time steps. Thus, when we set the discount factor to a small value we give more importance to the immediate reward than the future rewards. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## High discount factor\n",
    "Let's set the discount factor to a high value, say 0.9, that is, let's set, $\\gamma = 0.9$ , then we can write:\n",
    "\n",
    "$$\\begin{aligned}R &=(\\gamma)^0 r_0 + (\\gamma)^1 r_1+ (\\gamma)^2 r_2 +  \\dots \\\\&\\\\ &=(0.9)^0 r_0 + (0.9)^1 r_1+ (0.9)^2 r_2 +  \\dots\\\\&\\\\&= (1) r_0 + (0.9) r_1+ (0.81) r_2+ \\dots  \\end{aligned} $$\n",
    "\n",
    "From the above equation, we can infer that as the time step increases the discount factor (weight) decreases, however, it is not decreasing heavily unlike the previous case since here we started off with $\\gamma=0.9$. So in this case, we can say that we give more importance to future rewards. That is, from the above equation, we can observe that:\n",
    "\n",
    "* At the time step 0, the reward  $r_0$ is weighted by the discount factor 1\n",
    "* At the time step 1, the discount factor is decreased but not heavily decreased and the reward $r_1$  is weighted by the discount factor 0.9\n",
    "* At the time step 2, the discount factor is decreased to 0.81 and the reward $r_2$ is weighted by the discount 0.81\n",
    "\n",
    "\n",
    "As we can observe the discount factor is decreased for the subsequent time steps but unlike the previous case, the discount factor is not decreased heavily. Thus, when we set the discount factor to high value we give more importance to future rewards than the immediate reward. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What happens when we set the discount factor to 0?\n",
    "\n",
    "When we set the discount factor to 0, that is $\\gamma=0$, then it implies that we consider only the immediate reward $r_0$ and not the reward obtained from the future time steps. Thus, when we set the discount factor to 0 then the agent will never learn considering only the immediate reward $r_0$ as shown below:\n",
    "\n",
    "$$\\begin{aligned}R &=(\\gamma)^0 r_0 + (\\gamma)^1 r_1+ (\\gamma)^2 r_2 + \\dots \\\\&\\\\ &=(0)^0 r_0 + (0)^1 r_1+ (0)^2 r_2 +  \\dots\\\\&\\\\& = r_0 \\end{aligned} $$\n",
    "\n",
    "As we can observe when we set $\\gamma=0$, then our return will be just the immediate reward $r_0$.\n",
    "\n",
    "\n",
    "\n",
    "## What happens when we set the discount factor to 1?\n",
    "\n",
    "When we set the discount factor to 1, that is $\\gamma=1$, then it implies that we consider all the future rewards. Thus, when we set the discount factor to 1 then the agent will learn forever looking for all  the future reward which may lead to infinity as shown below:\n",
    "\n",
    "$$\\begin{aligned}R &=(\\gamma)^0 r_0 + (\\gamma)^1 r_1+ (\\gamma)^2 r_2 + \\dots \\\\&\\\\&=(1)^0 r_0 + (1)^1 r_1+ (1)^2 r_2 +  \\dots\\\\&\\\\& = r_0 + r_1 + r_2 + \\dots \\end{aligned} $$\n",
    "\n",
    "As we can observe when we set $\\gamma=1$, then our return will be the sum of rewards up to infinity. \n",
    "\n",
    "Thus, we learned that we set discount factor to 0 then the agent will never learn considering only the immediate reward and when we set the discount factor to 1 the agent will learn forever looking for the future rewards which lead to infinity. So the optimal value of the discount factor lies between 0.2 to 0.8.\n",
    "\n",
    "But the question is why should we care about immediate and future rewards? we give importance to immediate reward and future rewards depending on the tasks. In some tasks, future rewards are more desirable than immediate reward and vice versa. In a chess game, the goal is to defeat the opponent's king. If we give more importance to the immediate reward, which is acquired by actions like our pawn defeating any opponent chessman and so on, then the agent will learn to perform this sub-goal instead of learning the actual goal. So, in this case, we give importance to future rewards than the immediate reward, whereas in some cases, we prefer immediate rewards over future rewards. Say, would you prefer chocolates if I gave you them today or 13 days later?\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Math Essentials \n",
    "\n",
    "\n",
    "Before looking into the next important concept in reinforcement learning, let's quickly recap expectation as we will be dealing with expectation throughout the book.\n",
    "\n",
    "\n",
    "## Expectation \n",
    "\n",
    "Let's say we have a variable X and it has the following values 1,2,3,4,5,6. To compute the average value of X, we can just sum all the values of X divided by the number of values of X. Thus, the average of X is (1+2+3+4+5+6)/6 = 3.5\n",
    "\n",
    "Now, let's suppose X is a random variable. The random variable takes value based on the random experiment such as throwing dice, tossing a coin and so on. The random variable takes different values with some probabilities. Let's suppose we are throwing a fair dice then the possible outcomes (X) are 1,2,3,4,5,6 and the probability of occurrence of each of these outcomes is 1/6 as shown below:\n",
    "\n",
    "\n",
    "![title](Images/26.PNG)\n",
    "\n",
    "How can we compute the average value of the random variable X? Since each value has a probability of an occurrence we can't just take the average. So, what we will do is that here we will compute the weighted average, that is, the sum of values of X multiplied by their respective probabilities and this is called expectation. The expectation of a random variable X can be defined as:\n",
    "\n",
    "$$E(X) = \\sum_{i=1}^N x_i p(x_i) $$\n",
    "\n",
    "\n",
    "\n",
    "Thus, the expectation of the random variable X is, E(X) = 1(1/6) + 2(1/6) + 3(1/6) + 4(1/6) + 5 (1/6) + 6(1/6) = 3.5\n",
    "\n",
    "The expectation is also known as the expected value. Thus, the expected value of the random variable X is 3.5. Thus, when we say expectation or expected value of a random variable it basically means the weighted average.\n",
    "\n",
    "Now, we will look into the expectation of a function of a random variable. Let $f(x) = x^2$ then we can write:\n",
    "\n",
    "\n",
    "![title](Images/27.PNG)\n",
    "\n",
    "The expectation of a function of a random variable can be computed as:\n",
    "\n",
    "\n",
    "$$\\mathbb{E}_{x \\sim p(x)}[f(X)] = \\sum_{i=1}^N f(x_i) p(x_i) $$\n",
    "\n",
    "Thus the expected value of f(X) is given as E(f(X)) = 1(1/6) + 4(1/6) + 9(1/6) + 16(1/6) + 25(1/6) + 36(1/6) = 15.1\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.09 Value function and Q function.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Value function\n",
    "\n",
    "The value function also called the state value function denotes the value of the state. The value of a state is the return an agent would obtain starting from that state following the policy $\\pi$. The value of a state or value function is usually denoted by $V(s)$ and it can be expressed as: \n",
    "\n",
    "$$ V^{\\pi}(s) = [R(\\tau) | s_0 = s] $$\n",
    "\n",
    "where, $ s_0 = s $ implies that the starting state is $s_0$. The value of a state is called the state value. \n",
    "\n",
    "Let's understand the value function with an example. Let's suppose we generate the trajectory $\\tau$ following some policy $\\pi$ in our grid world environment as shown below:\n",
    "\n",
    "\n",
    "![title](Images/28.png)\n",
    "\n",
    "Now, how do we compute the value of all the states in our trajectory? We learned that the value of a state is the return(sum of reward) an agent would obtain starting from that state following the policy $\\pi$. The above trajectory is generated using the policy $\\pi$, thus we can say that the value of a state is the return(sum of rewards) of the trajectory starting from that state. \n",
    "\n",
    "* The value of the state A is the return of the trajectory starting from the state A. Thus, $V(A) = 1+1+-1+1 = 2 $\n",
    "* The value of the state D is the return of the trajectory starting from the state D. Thus, $V(D) = 1-1+1 = 1 $\n",
    "* The value of the state E is the return of the trajectory starting from the state E. Thus, $V(E) = -1+1 = 0 $\n",
    "* The value of the state H is the return of the trajectory starting from the state H. Thus, $V(H) = 1$\n",
    "\n",
    "\n",
    "What about the value of the final state I?  We learned the value of a state is return (sum of rewards) starting from that state. We know that we obtain a reward when we transition from one state to another. Since I is the final state, we don't make any transition from final state and so there will no reward and thus no value for the final state I. \n",
    "\n",
    "_In a nutshell, the value of a state is the return of the trajectory starting from that state._\n",
    "\n",
    "\n",
    "Wait! there is a small change here, instead of taking the return directly as a value of a state we will use the expected return. Thus, the value function or the value of the state $s$ can be defined as the expected return that the agent would obtain starting from the state $s$ following the policy $\\pi$ . It can be expressed as:\n",
    "\n",
    "\n",
    "$$V^{\\pi}(s) = \\underset{\\tau \\sim \\pi}{\\mathbb{E}}[R(\\tau) | s_0 = s] $$\n",
    "\n",
    "\n",
    "\n",
    "Let's understand this with a simple example. Let's suppose we have a stochastic policy $\\pi$.  We learned that, unlike deterministic policy which maps the state to action directly, stochastic policy maps the state to the probability distribution over action space. Thus, the stochastic policy selects actions based on a probability distribution.\n",
    "\n",
    "Let's suppose we are in state A and the stochastic policy returns the probability distribution over the action space as [0.0,0.80,0.00,0.20]. It implies that with the stochastic policy, in the state A, we perform action down for 80% of the time, that is, $\\pi(\\text{down} |A) = 0.80 $ and the action right for the 20 % of the time, that is $\\pi(\\text{right} |A) = 0.20 $ .\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "Thus, in state A, our stochastic policy  $\\pi$ selects action down for 80% of time and action right for 20% of the time and say our stochastic policy selects action right in the state D and E and action down in the state B and F for the 100% of the time.\n",
    "\n",
    "First, we generate an episode $\\tau_1$ using our given stochastic policy $\\pi$ as shown below:\n",
    "\n",
    "\n",
    "![title](Images/29.PNG)\n",
    "\n",
    "For better understanding, let's focus only on the value of state A. The value of the state A is the return (sum of rewards) of the trajectory starting from the state A. Thus, $V(A) = R(\\tau_1) = 1+1+1+1 =4 $\n",
    "\n",
    "Say, we generate another episode $\\tau_2$ using the same given stochastic policy $\\pi$ as shown below:\n",
    "\n",
    "\n",
    "![title](Images/30.PNG)\n",
    "\n",
    "The value of the state, A is the return (sum of rewards) of the trajectory from the state A. Thus, $V(A) = R(\\tau_2) = -1+1+1+1 =2 $\n",
    "\n",
    "As you may observe, although we use the same policy, the value of state A in the trajectory $\\tau_1$ and $\\tau_2$ are different. This is because our policy is a stochastic policy and it performs an action down in state A for 80% of time and action right in state A for 20% of the time. So, when we generate a trajectory using the policy $\\pi$, the trajectory  $\\tau_1$ will occur for 80% of the time and the trajectory $\\tau_2$ will occur for 20% of the time. Thus, the return will be 4 for 80% of the time and 2 for 20% of the time. \n",
    "\n",
    "Thus, instead of taking the value of the state as a return directly we will take the expected return since the return takes different values with some probability. The expected return is basically the weighted average, that is, the sum of the return multiplied by their probability. Thus, we can write:\n",
    "\n",
    "\n",
    "$$ V^{\\pi}(s) = \\underset{\\tau \\sim \\pi}{\\mathbb{E}}[R(\\tau) | s_0 = s] $$\n",
    "\n",
    "\n",
    "\n",
    "The value of a state A can be obtained as:\n",
    "$$\\begin{align}V^{\\pi}(A) &= \\underset{\\tau \\sim \\pi}{\\mathbb{E}}[R(\\tau) | s_0 = A] \\\\ &= \\sum_i R(\\tau_i) \\pi(a_i|A) \\\\ &= R(\\tau_1) \\pi (\\text{down}|A) + R(\\tau_2) \\pi (\\text{right}|A)\\\\ &= 4(0.8) + 2(0.2) \\\\ &= 3.6 \n",
    "\\end{align} $$\n",
    "\n",
    "\n",
    "Thus, the value of a state is the expected return of the trajectory starting from that state.\n",
    "\n",
    "Note that the value function depends on the policy, that is, the value of the state varies based on the policy we choose. There can be many different value functions according to different policies. The optimal value function,  $ V^*(s) $ is the one which yields maximum value compared to all the other value functions. It can be expressed as:\n",
    "\n",
    "$$V^{*}(s) = \\max_{\\pi} V^{\\pi}(s) $$\n",
    "\n",
    "\n",
    "\n",
    "For example, let's say we have two policies $\\pi_1$ and $\\pi_2$. Let the value of a state  using the policy $\\pi_1$  be $V^{\\pi_1} (s) = 13 $ and the value of the state  using the policy $\\pi_2$ be $V^{\\pi_2} (s) = 11 $.  Then the optimal value of the state  will be  $V^*(s) = 13 $ as it is the maximum. The policy which gives the maximum state value is called optimal policy $\\pi^*$. Thus, in this case, $\\pi_1$ is the optimal policy as it gives the maximum state value. \n",
    "\n",
    "We can view the value function in a table and it is called a value table. Let us say we have two states $s_0$ and  $s_1$ then the value function can be represented as:\n",
    "\n",
    "\n",
    "![title](Images/31.PNG)\n",
    "\n",
    "From the above value table, we can tell that it is good to be in the state $s_1$ than the state $s_0$  as $s_1$ has high value. Thus, we can say that the state $s_1$ is an optimal state than the state $s_0$ . \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Q function\n",
    "\n",
    "\n",
    "A Q function also called the state-action value function denotes the value of a state-action pair. The value of a state-action pair is the return agent would obtain starting from the state  and an action  following the policy . The value of a state-action pair or Q function is usually denoted by  and it is called Q value or state-action value. It is expressed as: \n",
    "\n",
    "\n",
    "![title](Images/32.PNG)\n",
    "\n",
    "Note that the only difference between the value function and Q function is that in the value function we compute the value of a state whereas in the Q function we compute the value of a state-action pair. Let's understand the Q function with an example. Consider the below trajectory generated using some policy :\n",
    "\n",
    "\n",
    "\n",
    "We learned that Q function computes the value of a state-action pair. Say we need to compute the Q value of state action pair, A-down. That is the Q value of performing action down in the state A. Then the Q value will be just the return of our trajectory starting from the state A and performing action down as shown below:\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "Let's suppose we need to compute the Q value of the state action pair, D-right. That is the Q value of performing action right in the state D. Then Q value will be just the return of our trajectory starting from the state D and performing action right as shown below:\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "Similarly, we can compute the Q value for all the state-action pairs. Similar to what we learned in the value function, instead of taking the return directly as the Q value of a state-action pair we will use the expected return because the return is the random variable and it takes different values with some probability. So we can redefine our Q function as:\n",
    "\n",
    "\n",
    "\n",
    "It implies that the Q value is the expected return agent would obtain starting from the state  and action  following the policy \n",
    "\n",
    "Similar to value function, the Q function depends on the policy, that is, the Q value varies based on the policy we choose. There can be many different Q functions according to different policies. Optimal Q function is the one which has the maximum Q value over other Q functions and it can be expressed as:\n",
    "\n",
    "\n",
    "\n",
    "The optimal policy  is the policy which gives the maximum Q value. \n",
    "\n",
    "Like value function, Q function can be viewed in a table. It is called a Q table. Let us say we have two states  and  and two actions 0 and 1, then the Q function can be represented as follows:\n",
    "\n",
    "\n",
    "![title](Images/33.PNG)\n",
    "\n",
    "As we can observe, the above Q table represents the Q-value of all possible state-action pairs. We can extract the policy from the Q function by just selecting the action which has the maximum Q value in each state:\n",
    "\n",
    "\n",
    "\n",
    "Thus, our policy will select action 1 in the state  and action 0 in the state  since they have a maximum Q value as shown below:\n",
    "\n",
    "![title](Images/34.PNG)\n",
    "\n",
    "\n",
    "Thus, we can extract the policy by computing the Q function. \n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.10. Model-Based and Model-Free Learning .ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model-Based and Model-free learning\n",
    "\n",
    "**Model-based learning** - In model-based learning, an agent will have a complete description of the environment. That is, we learned that the transition probability tells the probability of moving from a state  to the next state  by performing an action  and the reward function tells the reward we would obtain while moving from a state  to the next state  by performing an action . When the agent knows the model dynamics of the environment, that is, when the agent knows the transition probability of the environment then it is called model-based learning. Thus, in model-based learning the agent uses the model dynamics for finding the optimal policy. \n",
    "\n",
    "**Model-free learning** - When the agent does not know the model dynamics of the environment then it is called model-free learning. That is, In model-free learning, an agent tries to find the optimal policy without the model dynamics. \n",
    "\n",
    "Thus, to summarize, in a model-free setting, the agent learns optimal policy without the model dynamics of the environment whereas, in a model-based setting, the agent learns the optimal policy with the model dynamics of the environment."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.11. Different Types of Environments.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Different types of environments\n",
    "\n",
    "We learned that the environment is the world of the agent and the agent lives/stays within the environment. We can categorize the environment into different types as follows:\n",
    "\n",
    "## Deterministic and Stochastic environment\n",
    "\n",
    "__Deterministic environment__ - In a deterministic environment, we can be sure that when an agent performs an action $a$ in the state $s$ then it always reaches the state $s'$ exactly. For example, let's consider our grid world environment. Say the agent is in state A when we perform action down in the state A we always reach the state D and so it is called deterministic environment:\n",
    "\n",
    "\n",
    "\n",
    "![title](Images/35.png)\n",
    "\n",
    "__Stochastic environment__ - In a stochastic environment, we cannot say that by performing some action $a$ in the state $s$ the agent always reaches the state $s'$ exactly because there will be some randomness associated with the stochastic environment. For example, let's suppose our grid world environment is a stochastic environment. Say our agent is in state A, now if we perform action down in the state A then the agent doesn't always reach the state D  instead it reaches the state D for 70% of the time and the state B for 30 % of the time. That is, if we perform action down in the state A then the agent reaches the state D with 70% probability and the state B with 30% probability as shown below:\n",
    "\n",
    "\n",
    "![title](Images/36.png)\n",
    "\n",
    "## Discrete and continuous environment \n",
    "\n",
    "__Discrete Environment__ - When the action space of the environment is discrete then our environment is called a discrete environment. For instance, in the grid world environment, we have discrete action space which includes actions such as [up, down, left, right] and thus our grid world environment is called the discrete environment. \n",
    "\n",
    "__Continuous environment__ - When the action space of the environment is continuous then our environment is called a continuous environment. For instance, suppose, we are training an agent to drive a car then our action space will be continuous with several continuous actions such as speed in which we need to drive the car, the number of degrees we need to rotate the wheel and so on. In such a case where our action space of the environment is continuous, it is called continuous environment. \n",
    "\n",
    "## Episodic and non-episodic environment \n",
    "\n",
    "__Episodic environment__ - In an episodic environment, an agent's current action will not affect future action and thus the episodic environment is also called the non-sequential environment. \n",
    "\n",
    "__Non-episodic environment__ - In a non-episodic environment, an agent's current action will affect future action and thus the non-episodic environment is also called the sequential environment. Example: The chessboard is a sequential environment since the agent's current action will affect future action in a chess game.\n",
    "\n",
    "## Single and multi-agent environment\n",
    "\n",
    "__Single-agent environment__ - When our environment consists of only a single agent then it is called a single-agent environment. \n",
    "\n",
    "__Multi-agent environment__ - When our environment consists of multiple agents then it is called a multi-agent environment. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.12. Applications of Reinforcement Learning.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Applications of Reinforcement Learning\n",
    "\n",
    "Reinforcement learning has evolved rapidly over the past couple of years with a wide range of applications ranging from playing games to self-driving cars. One of the major reasons for this evolution is due to deep reinforcement learning (DRL) which is a combination of reinforcement learning and deep learning. We will learn about the various state of the art deep reinforcement learning algorithms in the upcoming chapters so be excited. In this section, we will look into some of the real-life applications of reinforcement learning.\n",
    "\n",
    "__Manufacturing__ - In manufacturing, intelligent robots are trained using reinforcement learning to place objects in the right position. The use of intelligent robots reduces labor costs and increases productivity. \n",
    "\n",
    "Dynamic Pricing - One of the popular applications of reinforcement learning includes dynamic pricing. Dynamic pricing implies that we change the price of the products based on demand and supply. We can train the RL agent for the dynamic pricing of the products with the goal of maximizing the revenue.\n",
    "\n",
    "__Inventory management__ - Reinforcement learning is extensively used in inventory management which is a crucial business activity. Some of these activities include supply chain management, demand forecasting, and handling several warehouse operations (such as placing products in warehouses for managing space efficiently).\n",
    "\n",
    "__Recommendation System__ - Reinforcement learning is widely used in building a recommendation system where the behavior of the user constantly changes. For instance, in the music recommendation system, the behavior or the music preference of the user changes from time to time. So in those cases using an RL agent can be very useful as the agent constantly learn by interacting with the environment. \n",
    "\n",
    "__Neural Architecture search__ - In order for the neural networks to perform a given task with good accuracy, the architecture of the network is very important and it has to properly designed. With reinforcement learning, we can automate the process of complex neural architecture search by training the agent to find the best neural architecture for a given task with the goal of maximizing the accuracy.\n",
    "\n",
    "__Natural Language Processing__ - With the increase in popularity of the deep reinforcement algorithms, RL has been widely used in several NLP tasks such as abstractive text summarization, chatbots and more.\n",
    "\n",
    "__Finance__ - Reinforcement learning is widely used in financial portfolio management which is the process of constant redistribution of a fund into different financial products. RL is also used in predicting and trading in commercial transaction markets. JP Morgan has successfully used RL to provide better trade execution results for large orders."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 01. Fundamentals of Reinforcement Learning/1.13. Reinforcement Learning Glossary.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reinforcement Learning Glossary \n",
    "\n",
    "We have learned several important and fundamental concepts of Reinforcement learning. In this section, we will revisit the several important terms and terminologies that are very useful for understanding upcoming chapters. \n",
    "\n",
    "__Agent__ - Agent is the software program that learns to make intelligent decisions. Example: A software program that plays the chess game intelligently. \n",
    "\n",
    "__Environment__ - The environment is the world of the agent. A chessboard is called the environment when the agent plays the chess game.\n",
    "\n",
    "__State__ -  A state is a position or a moment in the environment where the agent can be in. Example: All the positions in the chessboard are called the states. \n",
    "\n",
    "__Action__ - The agent interacts with the environment by performing an action and move from one state to another. Example: Moves made by the chessman can be considered an action. \n",
    "\n",
    "__Reward__ -  A reward is a numerical value that the agent receives based on its action. Consider reward as a point. For instance, an agent receives +1 point (reward) for good action and -1 point (reward) for a bad action. \n",
    "\n",
    "__Action space__ - A set of all possible actions in the environment is called action space. The action space is called a discrete action space when our action space consists of discrete actions and the action space is called continuous action space when our actions space consists of continuous actions.\n",
    "\n",
    "__Policy__  - The agent makes a decision based on the policy. A policy tells the agent what action to perform in each state. It can be considered as the brain of an agent. A policy is called deterministic policy if it exactly maps a state to a particular action. Unlike deterministic policy, stochastic policy maps the state to a probability distribution over the action space. An optimal policy is the one that gives the maximum reward. \n",
    "\n",
    "__Episode__ - Agent environment interaction starting from initial state to terminal state is called the episode. An episode is often called the trajectory or rollout. \n",
    "\n",
    "__Episodic and continuous task__ - A reinforcement learning task is called episodic task if it has the terminal state and it is called a continuous task if it does not has a terminal state. \n",
    "\n",
    "__Horizon__ - Horizon can be considered as an agent's life span, that is, time step until which the agent interacts with the environment. Horizon is called finite horizon if the agent environment interaction stops at a particular time step and it is called an infinite horizon when the agent environment interaction continues forever. \n",
    "\n",
    "__Return__ - Return is the sum of rewards received by the agent in an episode.\n",
    "\n",
    "__Discount factor__ - Discount factor helps to control whether we want to give importance to the immediate reward or future reward. The value of the discount factor ranges from 0 to 1. A discount factor close to 0 implies that we give more importance to immediate reward while a discount factor close to 1 implies that we give more importance to future reward than the immediate reward.\n",
    "\n",
    "__Value function__ - Value function or the value of the state is the expected return that the agent would get starting from the state $s$ following the policy $\\pi#. \n",
    "\n",
    "__Q function__ - Q function or the value of a state-action pair implies the expected return agent would obtain starting from the state $s$ and an action $a$ following the policy $\\pi$. \n",
    "\n",
    "__Model-based and Model-free learning__ - When the agent tries to learn the optimal policy with the model dynamics then it is called model-based learning and when the agent tries to learn the optimal policy without the model dynamics then it is called model-free learning. \n",
    "\n",
    "__Deterministic and Stochastic environment__ - When an agent performs an action $a$ in the state $s$ and if it reaches the state  exactly every time, then the environment is called a deterministic environment. When an agent performs an action $a$ in the state $s$ and if it reaches different states every time based on some probability distribution then the environment is called stochastic environment. \n",
    "\n",
    "Thus, in this chapter, we have learned several important and fundamental concepts of reinforcement learning. In the next chapter, we will begin our Hands-on reinforcement learning journey by implementing all the fundamental concepts we have learned in this chapter using the popular toolkit called the gym. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 02. A Guide to the Gym Toolkit/2.02.  Creating our First Gym Environment.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creating our first gym environment\n",
    "\n",
    "We learned that the gym provides a variety of environments for training the reinforcement learning agent. To clearly understand how the gym environment is designed, we will start off with the basic gym environment. Going forward, we will understand all other complex gym environments. \n",
    "\n",
    "Let's introduce one of the simplest environments called the frozen lake environment. The frozen lake environment is shown below. As we can observe, in the frozen lake environment, the goal of the agent is to start from the initial state S and reach the goal state G.\n",
    "\n",
    "![title](Images/1.png)\n",
    "\n",
    "In the above environment, the following applies:\n",
    "\n",
    "* S denotes the starting state\n",
    "* F denotes the frozen state\n",
    "* H denotes the hole state\n",
    "* G denotes the goal state\n",
    "\n",
    "So, the agent has to start from the state S and reach the goal state G. But one issue is that if the agent visits the state H, which is just the hole state, then the agent will fall into the hole and die as shown below:\n",
    "\n",
    "\n",
    "![title](Images/2.png)\n",
    "\n",
    "So, we need to make sure that the agent starts from S and reaches G without falling into the hole state H as shown below:\n",
    "\n",
    "\n",
    "![title](Images/3.png)\n",
    "Each grid box in the above environment is called state, thus we have 16 states (S to G) and we have 4 possible actions which are up, down, left and right. We learned that our goal is to reach the state G from S without visiting H. So, we assign reward as 0 to all the states and + 1 for the goal state G. \n",
    "\n",
    "Thus, we learned how the frozen lake environment works. Now, to train our agent in the frozen lake environment, first, we need to create the environment by coding it from scratch in Python. But luckily we don't have to do that! Since the gym provides the various environment, we can directly import the gym toolkit and create a frozen lake environment using the gym.\n",
    "\n",
    "\n",
    "Now, we will learn how to create our frozen lake environment using the gym. Before running any code, make sure that you activated our virtual environment universe. First, let's import the gym library:\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Next, we can create a gym environment using the make function.  The make function requires the environment id as a parameter. In the gym, the id of the frozen lake environment is `FrozenLake-v0`. So, we can create our frozen lake environment as shown below:\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make(\"FrozenLake-v0\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After creating the environment, we can see how our environment looks like using the render function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[41mS\u001b[0mFFF\n",
      "FHFH\n",
      "FFFH\n",
      "HFFG\n"
     ]
    }
   ],
   "source": [
    "env.render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "As we can observe, the frozen lake environment consists of 16 states (S to G) as we learned. The state S is highlighted indicating that it is our current state, that is, agent is in the state S. So whenever we create an environment, an agent will always begin from the initial state, in our case, it is the state S. \n",
    "\n",
    "That's it! Creating the environment using the gym is that simple. In the next section, we will understand more about the gym environment by relating all the concepts we have learned in the previous chapter. \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Exploring the environment\n",
    "\n",
    "In the previous chapter, we learned that the reinforcement learning environment can be modeled as the Markov decision process (MDP) and an MDP consists of the following: \n",
    "\n",
    "* __States__ -  A set of states present in the environment \n",
    "* __Actions__ - A set of actions that the agent can perform in each state. \n",
    "* __Transition probability__ - The transition probability is denoted by $P(s'|s,a) $. It implies the probability of moving from a state $s$ to the state $s'$ while performing an action $a$.\n",
    "* __Reward function__ - Reward function is denoted by $R(s,a,s')$. It implies the reward the agent obtains moving from a state $s$ to the state  $s'$ while performing an action $a$.\n",
    "\n",
    "Let's now understand how to obtain all the above information from the frozen lake environment we just created using the gym.\n",
    "\n",
    "\n",
    "\n",
    "## States\n",
    "A state space consists of all of our states. We can obtain the number of states in our environment by just typing `env.observation_space` as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Discrete(16)\n"
     ]
    }
   ],
   "source": [
    "print(env.observation_space)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It implies that we have 16 discrete states in our state space starting from the state S to G. Note that, in the gym, the states will be encoded as a number, so the state S will be encoded as 0, state F will be encoded as 1 and so on as shown below:\n",
    "\n",
    "\n",
    "![title](Images/5.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Actions\n",
    "\n",
    "We learned that the action space consists of all the possible actions in the environment. We can obtain the action space by `env.action_space` as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Discrete(4)\n"
     ]
    }
   ],
   "source": [
    "print(env.action_space)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It implies that we have 4 discrete actions in our action space which are left, down, right, up. Note that, similar to states, actions also will be encoded into numbers as shown below:\n",
    "\n",
    "\n",
    "![title](Images/6.PNG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transition probability and Reward function\n",
    "\n",
    "Now, let's look at how to obtain the transition probability and the reward function. We learned that in the stochastic environment, we cannot say that by performing some action $a$, agent will always reach the next state $s'$ exactly because there will be some randomness associated with the stochastic environment and by performing an action $a$ in the state $s$, agent reaches the next state  with some probability.\n",
    "\n",
    "Let's suppose we are in state 2 (F). Now if we perform action 1 (down) in state 2, we can reach the state 6 as shown below:\n",
    "\n",
    "\n",
    "![title](Images/7.png)\n",
    "\n",
    "Our frozen lake environment is a stochastic environment. When our environment is stochastic we won't always reach the state 6 by performing action 1(down) in state 2, we also reach other states with some probability. So when we perform an action 1 (down) in the state 2, we reach state 1 with probability 0.33333, we reach state 6 with probability 0.33333 and we reach the state 3 with probability 0.33333 as shown below:\n",
    "\n",
    "\n",
    "![title](Images/8.png)\n",
    "\n",
    "\n",
    "As we can notice, in the stochastic environment we reach the next states with some probability. Now, let's learn how to obtain this transition probability using the gym environment.  \n",
    "\n",
    "We can obtain the transition probability and the reward function by just typing `env.P[state][action]` So, in order to obtain the transition probability of moving from the state S to the other states by performing an action right, we can type, `env.P[S][right]`. But we cannot just type state S and action right directly since they are encoded into numbers. We learned that state S is encoded as 0 and the action right is encoded as 2, so, in order to obtain the transition probability of state S by performing an action right, we type `env.P[0][2]` as shown below:\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False), (0.3333333333333333, 0, 0.0, False)]\n"
     ]
    }
   ],
   "source": [
    "print(env.P[0][2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What does this imply? Our output is in the form of `[(transition probability, next state, reward, Is terminal state?)]` It implies that if we perform an action 2 (right) in state 0 (S) then:\n",
    "\n",
    "* We reach the state 4 (F) with probability 0.33333 and receive 0 reward. \n",
    "* We reach the state 1 (F) with probability 0.33333 and receive 0 reward.\n",
    "* We reach the same state 0 (S) with probability 0.33333 and receive 0 reward.\n",
    "\n",
    "The transition probability is shown below:\n",
    "\n",
    "\n",
    "\n",
    "![title](Images/9.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus, when we type `env.P[state][action]` we get the result in the form of `[(transition probability, next state, reward, Is terminal state?)]`. The last value is the boolean and it implies that whether the next state is a terminal state, since 4, 1 and 0 are not the terminal states it is given as false. \n",
    "\n",
    "The output of `env.P[0][2]` is shown in the below table for more clarity:\n",
    "\n",
    "\n",
    "![title](Images/10.PNG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's understand this with one more example. Let's suppose we are in the state 3 (F) as shown below:\n",
    "\n",
    "\n",
    "![title](Images/11.png)\n",
    "\n",
    "Say, we perform action 1 (down) in the state 3(F). Then the transition probability of the state 3(F) by performing action 1(down) can be obtained as shown below:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0.3333333333333333, 2, 0.0, False), (0.3333333333333333, 7, 0.0, True), (0.3333333333333333, 3, 0.0, False)]\n"
     ]
    }
   ],
   "source": [
    "print(env.P[3][1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we learned, our output is in the form of `[(transition probability, next state, reward, Is terminal state?)]` It implies that if we perform an action 1 (down) in state 3 (F) then:\n",
    "\n",
    "* We reach the state 2 (F) with probability 0.33333 and receive 0 reward. \n",
    "* We reach the state 7 (H) with probability 0.33333 and receive 0 reward.\n",
    "* We reach the same state 3 (F) with probability 0.33333 and receive 0 reward.\n",
    "\n",
    "\n",
    "The transition probability is shown below:\n",
    "\n",
    "\n",
    "\n",
    "![title](Images/12.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The output of `env.P[3][1]` is shown in the below table for more clarity:\n",
    "\n",
    "\n",
    "![title](Images/13.PNG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe, in the second row of our output, we have, `(0.33333, 7, 0.0, True)`,and the last value here is marked as True. It implies that state 7 is a terminal state. That is, if we perform action 1(down) in state 3(F) then we reach the state 7(H) with 0.33333 probability and since 7(H) is a hole, the agent dies if it reaches the state 7(H). Thus 7(H) is a terminal state and so it is marked as True. \n",
    "\n",
    "Thus, we learned how to obtain the state space, action space, transition probability and the reward function using the gym environment. In the next section, we will learn how to generate an episode. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 02. A Guide to the Gym Toolkit/2.05. Cart Pole Balancing with Random Policy.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cart Pole Balancing with Random Policy\n",
    "\n",
    "Let's create an agent with the random policy, that is, we create the agent that selects the random action in the environment and tries to balance the pole. The agent receives +1 reward every time the pole stands straight up on the cart. We will generate over 100 episodes and we will see the return (sum of rewards) obtained over each episode. Let's learn this step by step.\n",
    "\n",
    "First, create our cart pole environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "env = gym.make('CartPole-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Set the number of episodes and number of time steps in the episode:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_episodes = 100\n",
    "num_timesteps = 50"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Episode: 0, Return: 23.0\n",
      "Episode: 10, Return: 12.0\n",
      "Episode: 20, Return: 23.0\n",
      "Episode: 30, Return: 15.0\n",
      "Episode: 40, Return: 19.0\n",
      "Episode: 50, Return: 10.0\n",
      "Episode: 60, Return: 16.0\n",
      "Episode: 70, Return: 10.0\n",
      "Episode: 80, Return: 22.0\n",
      "Episode: 90, Return: 38.0\n"
     ]
    }
   ],
   "source": [
    "#for each episode\n",
    "for i in range(num_episodes):\n",
    "    \n",
    "    #set the Return to 0\n",
    "    Return = 0\n",
    "    #initialize the state by resetting the environment\n",
    "    state = env.reset()\n",
    "    \n",
    "    #for each step in the episode\n",
    "    for t in range(num_timesteps):\n",
    "        #render the environment\n",
    "        env.render()\n",
    "        \n",
    "        #randomly select an action by sampling from the environment\n",
    "        random_action = env.action_space.sample()\n",
    "        \n",
    "        #perform the randomly selected action\n",
    "        next_state, reward, done, info = env.step(random_action)\n",
    "\n",
    "        #update the return\n",
    "        Return = Return + reward\n",
    "\n",
    "        #if the next state is a terminal state then end the episode\n",
    "        if done:\n",
    "            break\n",
    "    #for every 10 episodes, print the return (sum of rewards)\n",
    "    if i%10==0:\n",
    "        print('Episode: {}, Return: {}'.format(i, Return))\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Close the environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "env.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 02. A Guide to the Gym Toolkit/README.md
================================================
# 2. A Guide to the Gym Toolkit
* 2.1. Setting Up our Machine
   * 2.1.1. Installing Anaconda
   * 2.1.2. Installing the Gym Toolkit
   * 2.1.3. Common Error Fixes
* 2.2. Creating our First Gym Environment
   * 2.2.1. Exploring the Environment
   * 2.2.2. States
   * 2.2.3. Actions
   * 2.2.4. Transition Probability and Reward Function
* 2.3. Generating an episode
* 2.4. Classic Control Environments
   * 2.4.1. State Space
   * 2.4.2. Action Space
* 2.5. Cart Pole Balancing with Random Policy
* 2.6. Atari Game Environments
   * 2.6.1. General Environment
   * 2.6.2. Deterministic Environment
* 2.7. Agent Playing the Tennis Game
* 2.8. Recording the Game
* 2.9. Other environments
   * 2.9.1. Box 2D
   * 2.9.2. Mujoco
   * 2.9.3. Robotics
   * 2.9.4. Toy text
   * 2.9.5. Algorithms
* 2.10. Environment Synopsis

================================================
FILE: 03. Bellman Equation and Dynamic Programming/.ipynb_checkpoints/3.06. Solving the Frozen Lake Problem with Value Iteration-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Solving the Frozen Lake Problem with Value Iteration\n",
    "\n",
    "In the previous chapter, we have learned about the frozen lake environment. The frozen\n",
    "lake environment is shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/4.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's recap the frozen lake environment a bit. In the frozen lake environment shown above,\n",
    "the following applies:\n",
    "    \n",
    "* S implies the starting state\n",
    "* F implies the frozen states\n",
    "* H implies the hold states\n",
    "* G implies the goal state\n",
    "\n",
    "We learned that in the frozen lake environment, our goal is to reach the goal state G from\n",
    "the starting state S without visiting the hole states H. That is, while trying to reach the goal\n",
    "state G from the starting state S if the agent visits the hole state H then it will fall into the\n",
    "hole and die as shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/5.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, the goal of the agent is to reach the state G starting from the state S without visiting the\n",
    "hole states H as shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/6.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How can we achieve this goal? That is, how can we reach the state G from S without\n",
    "visiting H? We learned that the optimal policy tells the agent to perform correct action in\n",
    "each state. So, if we find the optimal policy then we can reach the state G from S visiting the state H. Okay, how to find the optimal policy? We can use the value iteration method\n",
    "we just learned to find the optimal policy.\n",
    "\n",
    "\n",
    "Remember that all our states (S to G) will be encoded from 0 to 16 and all the four actions -\n",
    "left, down, up, right will be encoded from 0 to 3 in the gym toolkit.\n",
    "So, in this section, we will learn how to find the optimal policy using the value iteration\n",
    "method so that the agent can reach the state G from S without visiting H."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's create the frozen lake environment using gym:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('FrozenLake-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the frozen lake environment using the render function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[41mS\u001b[0mFFF\n",
      "FHFH\n",
      "FFFH\n",
      "HFFG\n"
     ]
    }
   ],
   "source": [
    "env.render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can notice, our agent is in the state S and it has to reach the state G without visiting\n",
    "the states H. So, let's learn how to compute the optimal policy using the value iteration\n",
    "method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's learn how to compute the optimal value function and then we will see how to\n",
    "extract the optimal policy from the computed optimal value function. \n",
    "\n",
    "\n",
    "## Computing optimal value function\n",
    "\n",
    "We will define a function called `value_iteration` where we compute the optimal value\n",
    "function iteratively by taking maximum over Q function. For\n",
    "better understanding, let's closely look at the every line of the function and then we look at\n",
    "the complete function at the end which gives us more clarity.\n",
    "\n",
    "\n",
    "\n",
    "Define `value_iteration` function which takes the environment as a parameter: \n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def value_iteration(env):\n",
    "\n",
    "    #set the number of iterations\n",
    "    num_iterations = 1000\n",
    "    \n",
    "    #set the threshold number for checking the convergence of the value function\n",
    "    threshold = 1e-20\n",
    "    \n",
    "    #we also set the discount factor\n",
    "    gamma = 1.0\n",
    "    \n",
    "    #now, we will initialize the value table, with the value of all states to zero\n",
    "    value_table = np.zeros(env.observation_space.n)\n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        \n",
    "        #update the value table, that is, we learned that on every iteration, we use the updated value\n",
    "        #table (state values) from the previous iteration\n",
    "        updated_value_table = np.copy(value_table) \n",
    "             \n",
    "        #now, we compute the value function (state value) by taking the maximum of Q value.\n",
    "        \n",
    "        #thus, for each state, we compute the Q values of all the actions in the state and then\n",
    "        #we update the value of the state as the one which has maximum Q value as shown below:\n",
    "        for s in range(env.observation_space.n):\n",
    "            \n",
    "            Q_values = [sum([prob*(r + gamma * updated_value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                                        \n",
    "            value_table[s] = max(Q_values) \n",
    "                        \n",
    "        #after computing the value table, that is, value of all the states, we check whether the\n",
    "        #difference between value table obtained in the current iteration and previous iteration is\n",
    "        #less than or equal to a threshold value if it is less then we break the loop and return the\n",
    "        #value table as our optimal value function as shown below:\n",
    "    \n",
    "        if (np.sum(np.fabs(updated_value_table - value_table)) <= threshold):\n",
    "             break\n",
    "    \n",
    "    return value_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, that we have computed the optimal value function by taking the maximum over Q\n",
    "values, let's see how to extract the optimal policy from the optimal value function. \n",
    "\n",
    "\n",
    "## Extracting optimal policy from the optimal value function\n",
    "\n",
    "In the previous step, we computed the optimal value function. Now, let see how to extract\n",
    "the optimal policy from the computed optimal value function.\n",
    "\n",
    "\n",
    "First, we define a function called `extract_policy` which takes the `value_table` as a\n",
    "parameter: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_policy(value_table):\n",
    "    \n",
    "    #set the discount factor\n",
    "    gamma = 1.0\n",
    "     \n",
    "    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to\n",
    "    #be zero\n",
    "    policy = np.zeros(env.observation_space.n) \n",
    "    \n",
    "    #now, we compute the Q function using the optimal value function obtained from the\n",
    "    #previous step. After computing the Q function, we can extract policy by selecting action which has\n",
    "    #maximum Q value. Since we are computing the Q function using the optimal value\n",
    "    #function, the policy extracted from the Q function will be the optimal policy. \n",
    "    \n",
    "    #As shown below, for each state, we compute the Q values for all the actions in the state and\n",
    "    #then we extract policy by selecting the action which has maximum Q value.\n",
    "    \n",
    "    #for each state\n",
    "    for s in range(env.observation_space.n):\n",
    "        \n",
    "        #compute the Q value of all the actions in the state\n",
    "        Q_values = [sum([prob*(r + gamma * value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                \n",
    "        #extract policy by selecting the action which has maximum Q value\n",
    "        policy[s] = np.argmax(np.array(Q_values))        \n",
    "    \n",
    "    return policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "That's it! Now, we will see how to extract the optimal policy in our frozen lake\n",
    "environment. \n",
    "\n",
    "## Putting it all together\n",
    "We learn that in the frozen lake environment our goal is to find the optimal policy which\n",
    "selects the correct action in each state so that we can reach the state G from the state\n",
    "A without visiting the hole states.\n",
    "\n",
    "First, we compute the optimal value function using our `value_iteration` function by\n",
    "passing our frozen lake environment as the parameter: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_value_function = value_iteration(env=env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we extract the optimal policy from the optimal value function using our\n",
    "extract_policy function as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_policy = extract_policy(optimal_value_function)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can print the obtained optimal policy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\n"
     ]
    }
   ],
   "source": [
    "print(optimal_policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe, our optimal policy tells us to\n",
    "perform the correct action in each state. \n",
    "\n",
    "Now, that we have learned what is value iteration and how to perform the value iteration\n",
    "method to compute the optimal policy in our frozen lake environment, in the next section\n",
    "we will learn about another interesting method called the policy iteration. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 03. Bellman Equation and Dynamic Programming/.ipynb_checkpoints/3.08. Solving the Frozen Lake Problem with Policy Iteration-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Solving the Frozen Lake Problem with Policy Iteration\n",
    "\n",
    "We learned that in the frozen lake environment, our goal is to reach the goal state G from\n",
    "the starting state S without visiting the hole states H. Now, let's learn how to compute the optimal policy using the policy iteration method in the frozen lake environment.\n",
    "\n",
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's create the frozen lake environment using gym:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('FrozenLake-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We learned that in the policy iteration, we compute the value function using the policy\n",
    "iteratively. Once we found the optimal value function then the policy which is used to\n",
    "compute the optimal value function will be the optimal policy.\n",
    "\n",
    "So, first, let's learn how to compute the value function using the policy. \n",
    "\n",
    "\n",
    "## Computing value function using policy\n",
    "\n",
    "This step is exactly the same as how we computed the value function in the value iteration\n",
    "method but with a small difference. Here we compute the value function using the policy\n",
    "but in the value iteration method, we compute the value function by taking the maximum\n",
    "over Q values. Now, let's learn how to define a function that computes the value function\n",
    "using the given policy.\n",
    "\n",
    "\n",
    "Let's define a function called `compute_value_function` which takes the policy as a\n",
    "parameter:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_value_function(policy):\n",
    "    \n",
    "    #now, let's define the number of iterations\n",
    "    num_iterations = 1000\n",
    "    \n",
    "    #define the threshold value\n",
    "    threshold = 1e-20\n",
    "    \n",
    "    #set the discount factor\n",
    "    gamma = 1.0\n",
    "    \n",
    "    #now, we will initialize the value table, with the value of all states to zero\n",
    "    value_table = np.zeros(env.observation_space.n)\n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        \n",
    "        #update the value table, that is, we learned that on every iteration, we use the updated value\n",
    "        #table (state values) from the previous iteration\n",
    "        updated_value_table = np.copy(value_table)\n",
    "        \n",
    "        \n",
    "\n",
    "        #thus, for each state, we select the action according to the given policy and then we update the\n",
    "        #value of the state using the selected action as shown below\n",
    "        \n",
    "        #for each state\n",
    "        for s in range(env.observation_space.n):\n",
    "            \n",
    "            #select the action in the state according to the policy\n",
    "            a = policy[s]\n",
    "            \n",
    "            #compute the value of the state using the selected action\n",
    "            value_table[s] = sum([prob * (r + gamma * updated_value_table[s_]) \n",
    "                                        for prob, s_, r, _ in env.P[s][a]])\n",
    "            \n",
    "        #after computing the value table, that is, value of all the states, we check whether the\n",
    "        #difference between value table obtained in the current iteration and previous iteration is\n",
    "        #less than or equal to a threshold value if it is less then we break the loop and return the\n",
    "        #value table as an accurate value function of the given policy\n",
    "\n",
    "        if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):\n",
    "            break\n",
    "            \n",
    "    return value_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now that we have computed the value function of the policy, let's see how to extract the\n",
    "policy from the value function. \n",
    "\n",
    "## Extracting policy from the value function\n",
    "\n",
    "This step is exactly the same as how we extracted policy from the value function in the\n",
    "value iteration method. Thus, similar to what we learned in the value iteration method, we\n",
    "define a function called `extract_policy` to extract a policy given the value function as\n",
    "shown below:\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_policy(value_table):\n",
    "    \n",
    "    #set the discount factor\n",
    "    gamma = 1.0\n",
    "     \n",
    "    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to\n",
    "    #be zero\n",
    "    policy = np.zeros(env.observation_space.n) \n",
    "    \n",
    "    #now, we compute the Q function using the optimal value function obtained from the\n",
    "    #previous step. After computing the Q function, we can extract policy by selecting action which has\n",
    "    #maximum Q value. Since we are computing the Q function using the optimal value\n",
    "    #function, the policy extracted from the Q function will be the optimal policy. \n",
    "    \n",
    "    #As shown below, for each state, we compute the Q values for all the actions in the state and\n",
    "    #then we extract policy by selecting the action which has maximum Q value.\n",
    "    \n",
    "    #for each state\n",
    "    for s in range(env.observation_space.n):\n",
    "        \n",
    "        #compute the Q value of all the actions in the state\n",
    "        Q_values = [sum([prob*(r + gamma * value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                \n",
    "        #extract policy by selecting the action which has maximum Q value\n",
    "        policy[s] = np.argmax(np.array(Q_values))        \n",
    "    \n",
    "    return policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Putting it all together\n",
    "\n",
    "First, let's define a function called `policy_iteration` which takes the environment as a\n",
    "parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy_iteration(env):\n",
    "    \n",
    "    #set the number of iterations\n",
    "    num_iterations = 1000\n",
    "    \n",
    "    #we learned that in the policy iteration method, we begin by initializing a random policy.\n",
    "    #so, we will initialize the random policy which selects the action 0 in all the states\n",
    "    policy = np.zeros(env.observation_space.n)  \n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        #compute the value function using the policy\n",
    "        value_function = compute_value_function(policy)\n",
    "        \n",
    "        #extract the new policy from the computed value function\n",
    "        new_policy = extract_policy(value_function)\n",
    "           \n",
    "        #if the policy and new_policy are same then break the loop\n",
    "        if (np.all(policy == new_policy)):\n",
    "            break\n",
    "        \n",
    "        #else, update the current policy to new_policy\n",
    "        policy = new_policy\n",
    "        \n",
    "    return policy\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, let's learn how to perform policy iteration and find the optimal policy in the frozen\n",
    "lake environment. \n",
    "\n",
    "So, we just feed the frozen lake environment to our `policy_iteration`\n",
    "function as shown below and get the optimal policy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_policy = policy_iteration(env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can print the optimal policy: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\n"
     ]
    }
   ],
   "source": [
    "print(optimal_policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe, our optimal policy tells us to perform the correct action in each\n",
    "state. Thus, we learned how to perform the policy iteration method to compute the optimal\n",
    "policy. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 03. Bellman Equation and Dynamic Programming/3.06. Solving the Frozen Lake Problem with Value Iteration.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Solving the Frozen Lake Problem with Value Iteration\n",
    "\n",
    "In the previous chapter, we have learned about the frozen lake environment. The frozen\n",
    "lake environment is shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/4.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's recap the frozen lake environment a bit. In the frozen lake environment shown above,\n",
    "the following applies:\n",
    "    \n",
    "* S implies the starting state\n",
    "* F implies the frozen states\n",
    "* H implies the hold states\n",
    "* G implies the goal state\n",
    "\n",
    "We learned that in the frozen lake environment, our goal is to reach the goal state G from\n",
    "the starting state S without visiting the hole states H. That is, while trying to reach the goal\n",
    "state G from the starting state S if the agent visits the hole state H then it will fall into the\n",
    "hole and die as shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/5.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, the goal of the agent is to reach the state G starting from the state S without visiting the\n",
    "hole states H as shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![title](Images/6.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How can we achieve this goal? That is, how can we reach the state G from S without\n",
    "visiting H? We learned that the optimal policy tells the agent to perform correct action in\n",
    "each state. So, if we find the optimal policy then we can reach the state G from S visiting the state H. Okay, how to find the optimal policy? We can use the value iteration method\n",
    "we just learned to find the optimal policy.\n",
    "\n",
    "\n",
    "Remember that all our states (S to G) will be encoded from 0 to 16 and all the four actions -\n",
    "left, down, up, right will be encoded from 0 to 3 in the gym toolkit.\n",
    "So, in this section, we will learn how to find the optimal policy using the value iteration\n",
    "method so that the agent can reach the state G from S without visiting H."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's create the frozen lake environment using gym:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('FrozenLake-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the frozen lake environment using the render function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[41mS\u001b[0mFFF\n",
      "FHFH\n",
      "FFFH\n",
      "HFFG\n"
     ]
    }
   ],
   "source": [
    "env.render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can notice, our agent is in the state S and it has to reach the state G without visiting\n",
    "the states H. So, let's learn how to compute the optimal policy using the value iteration\n",
    "method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's learn how to compute the optimal value function and then we will see how to\n",
    "extract the optimal policy from the computed optimal value function. \n",
    "\n",
    "\n",
    "## Computing optimal value function\n",
    "\n",
    "We will define a function called `value_iteration` where we compute the optimal value\n",
    "function iteratively by taking maximum over Q function. For\n",
    "better understanding, let's closely look at the every line of the function and then we look at\n",
    "the complete function at the end which gives us more clarity.\n",
    "\n",
    "\n",
    "\n",
    "Define `value_iteration` function which takes the environment as a parameter: \n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def value_iteration(env):\n",
    "\n",
    "    #set the number of iterations\n",
    "    num_iterations = 1000\n",
    "    \n",
    "    #set the threshold number for checking the convergence of the value function\n",
    "    threshold = 1e-20\n",
    "    \n",
    "    #we also set the discount factor\n",
    "    gamma = 1.0\n",
    "    \n",
    "    #now, we will initialize the value table, with the value of all states to zero\n",
    "    value_table = np.zeros(env.observation_space.n)\n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        \n",
    "        #update the value table, that is, we learned that on every iteration, we use the updated value\n",
    "        #table (state values) from the previous iteration\n",
    "        updated_value_table = np.copy(value_table) \n",
    "             \n",
    "        #now, we compute the value function (state value) by taking the maximum of Q value.\n",
    "        \n",
    "        #thus, for each state, we compute the Q values of all the actions in the state and then\n",
    "        #we update the value of the state as the one which has maximum Q value as shown below:\n",
    "        for s in range(env.observation_space.n):\n",
    "            \n",
    "            Q_values = [sum([prob*(r + gamma * updated_value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                                        \n",
    "            value_table[s] = max(Q_values) \n",
    "                        \n",
    "        #after computing the value table, that is, value of all the states, we check whether the\n",
    "        #difference between value table obtained in the current iteration and previous iteration is\n",
    "        #less than or equal to a threshold value if it is less then we break the loop and return the\n",
    "        #value table as our optimal value function as shown below:\n",
    "    \n",
    "        if (np.sum(np.fabs(updated_value_table - value_table)) <= threshold):\n",
    "             break\n",
    "    \n",
    "    return value_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, that we have computed the optimal value function by taking the maximum over Q\n",
    "values, let's see how to extract the optimal policy from the optimal value function. \n",
    "\n",
    "\n",
    "## Extracting optimal policy from the optimal value function\n",
    "\n",
    "In the previous step, we computed the optimal value function. Now, let see how to extract\n",
    "the optimal policy from the computed optimal value function.\n",
    "\n",
    "\n",
    "First, we define a function called `extract_policy` which takes the `value_table` as a\n",
    "parameter: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_policy(value_table):\n",
    "    \n",
    "    #set the discount factor\n",
    "    gamma = 1.0\n",
    "     \n",
    "    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to\n",
    "    #be zero\n",
    "    policy = np.zeros(env.observation_space.n) \n",
    "    \n",
    "    #now, we compute the Q function using the optimal value function obtained from the\n",
    "    #previous step. After computing the Q function, we can extract policy by selecting action which has\n",
    "    #maximum Q value. Since we are computing the Q function using the optimal value\n",
    "    #function, the policy extracted from the Q function will be the optimal policy. \n",
    "    \n",
    "    #As shown below, for each state, we compute the Q values for all the actions in the state and\n",
    "    #then we extract policy by selecting the action which has maximum Q value.\n",
    "    \n",
    "    #for each state\n",
    "    for s in range(env.observation_space.n):\n",
    "        \n",
    "        #compute the Q value of all the actions in the state\n",
    "        Q_values = [sum([prob*(r + gamma * value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                \n",
    "        #extract policy by selecting the action which has maximum Q value\n",
    "        policy[s] = np.argmax(np.array(Q_values))        \n",
    "    \n",
    "    return policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "That's it! Now, we will see how to extract the optimal policy in our frozen lake\n",
    "environment. \n",
    "\n",
    "## Putting it all together\n",
    "We learn that in the frozen lake environment our goal is to find the optimal policy which\n",
    "selects the correct action in each state so that we can reach the state G from the state\n",
    "A without visiting the hole states.\n",
    "\n",
    "First, we compute the optimal value function using our `value_iteration` function by\n",
    "passing our frozen lake environment as the parameter: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_value_function = value_iteration(env=env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we extract the optimal policy from the optimal value function using our\n",
    "extract_policy function as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_policy = extract_policy(optimal_value_function)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can print the obtained optimal policy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\n"
     ]
    }
   ],
   "source": [
    "print(optimal_policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe, our optimal policy tells us to\n",
    "perform the correct action in each state. \n",
    "\n",
    "Now, that we have learned what is value iteration and how to perform the value iteration\n",
    "method to compute the optimal policy in our frozen lake environment, in the next section\n",
    "we will learn about another interesting method called the policy iteration. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 03. Bellman Equation and Dynamic Programming/3.08. Solving the Frozen Lake Problem with Policy Iteration.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Solving the Frozen Lake Problem with Policy Iteration\n",
    "\n",
    "We learned that in the frozen lake environment, our goal is to reach the goal state G from\n",
    "the starting state S without visiting the hole states H. Now, let's learn how to compute the optimal policy using the policy iteration method in the frozen lake environment.\n",
    "\n",
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's create the frozen lake environment using gym:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('FrozenLake-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We learned that in the policy iteration, we compute the value function using the policy\n",
    "iteratively. Once we found the optimal value function then the policy which is used to\n",
    "compute the optimal value function will be the optimal policy.\n",
    "\n",
    "So, first, let's learn how to compute the value function using the policy. \n",
    "\n",
    "\n",
    "## Computing value function using policy\n",
    "\n",
    "This step is exactly the same as how we computed the value function in the value iteration\n",
    "method but with a small difference. Here we compute the value function using the policy\n",
    "but in the value iteration method, we compute the value function by taking the maximum\n",
    "over Q values. Now, let's learn how to define a function that computes the value function\n",
    "using the given policy.\n",
    "\n",
    "\n",
    "Let's define a function called `compute_value_function` which takes the policy as a\n",
    "parameter:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_value_function(policy):\n",
    "    \n",
    "    #now, let's define the number of iterations\n",
    "    num_iterations = 1000\n",
    "    \n",
    "    #define the threshold value\n",
    "    threshold = 1e-20\n",
    "    \n",
    "    #set the discount factor\n",
    "    gamma = 1.0\n",
    "    \n",
    "    #now, we will initialize the value table, with the value of all states to zero\n",
    "    value_table = np.zeros(env.observation_space.n)\n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        \n",
    "        #update the value table, that is, we learned that on every iteration, we use the updated value\n",
    "        #table (state values) from the previous iteration\n",
    "        updated_value_table = np.copy(value_table)\n",
    "        \n",
    "        \n",
    "\n",
    "        #thus, for each state, we select the action according to the given policy and then we update the\n",
    "        #value of the state using the selected action as shown below\n",
    "        \n",
    "        #for each state\n",
    "        for s in range(env.observation_space.n):\n",
    "            \n",
    "            #select the action in the state according to the policy\n",
    "            a = policy[s]\n",
    "            \n",
    "            #compute the value of the state using the selected action\n",
    "            value_table[s] = sum([prob * (r + gamma * updated_value_table[s_]) \n",
    "                                        for prob, s_, r, _ in env.P[s][a]])\n",
    "            \n",
    "        #after computing the value table, that is, value of all the states, we check whether the\n",
    "        #difference between value table obtained in the current iteration and previous iteration is\n",
    "        #less than or equal to a threshold value if it is less then we break the loop and return the\n",
    "        #value table as an accurate value function of the given policy\n",
    "\n",
    "        if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):\n",
    "            break\n",
    "            \n",
    "    return value_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now that we have computed the value function of the policy, let's see how to extract the\n",
    "policy from the value function. \n",
    "\n",
    "## Extracting policy from the value function\n",
    "\n",
    "This step is exactly the same as how we extracted policy from the value function in the\n",
    "value iteration method. Thus, similar to what we learned in the value iteration method, we\n",
    "define a function called `extract_policy` to extract a policy given the value function as\n",
    "shown below:\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_policy(value_table):\n",
    "    \n",
    "    #set the discount factor\n",
    "    gamma = 1.0\n",
    "     \n",
    "    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to\n",
    "    #be zero\n",
    "    policy = np.zeros(env.observation_space.n) \n",
    "    \n",
    "    #now, we compute the Q function using the optimal value function obtained from the\n",
    "    #previous step. After computing the Q function, we can extract policy by selecting action which has\n",
    "    #maximum Q value. Since we are computing the Q function using the optimal value\n",
    "    #function, the policy extracted from the Q function will be the optimal policy. \n",
    "    \n",
    "    #As shown below, for each state, we compute the Q values for all the actions in the state and\n",
    "    #then we extract policy by selecting the action which has maximum Q value.\n",
    "    \n",
    "    #for each state\n",
    "    for s in range(env.observation_space.n):\n",
    "        \n",
    "        #compute the Q value of all the actions in the state\n",
    "        Q_values = [sum([prob*(r + gamma * value_table[s_])\n",
    "                             for prob, s_, r, _ in env.P[s][a]]) \n",
    "                                   for a in range(env.action_space.n)] \n",
    "                \n",
    "        #extract policy by selecting the action which has maximum Q value\n",
    "        policy[s] = np.argmax(np.array(Q_values))        \n",
    "    \n",
    "    return policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Putting it all together\n",
    "\n",
    "First, let's define a function called `policy_iteration` which takes the environment as a\n",
    "parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy_iteration(env):\n",
    "    \n",
    "    #set the number of iterations\n",
    "    num_iterations = 1000\n",
    "    \n",
    "    #we learned that in the policy iteration method, we begin by initializing a random policy.\n",
    "    #so, we will initialize the random policy which selects the action 0 in all the states\n",
    "    policy = np.zeros(env.observation_space.n)  \n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        #compute the value function using the policy\n",
    "        value_function = compute_value_function(policy)\n",
    "        \n",
    "        #extract the new policy from the computed value function\n",
    "        new_policy = extract_policy(value_function)\n",
    "           \n",
    "        #if the policy and new_policy are same then break the loop\n",
    "        if (np.all(policy == new_policy)):\n",
    "            break\n",
    "        \n",
    "        #else, update the current policy to new_policy\n",
    "        policy = new_policy\n",
    "        \n",
    "    return policy\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, let's learn how to perform policy iteration and find the optimal policy in the frozen\n",
    "lake environment. \n",
    "\n",
    "So, we just feed the frozen lake environment to our `policy_iteration`\n",
    "function as shown below and get the optimal policy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_policy = policy_iteration(env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can print the optimal policy: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\n"
     ]
    }
   ],
   "source": [
    "print(optimal_policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can observe, our optimal policy tells us to perform the correct action in each\n",
    "state. Thus, we learned how to perform the policy iteration method to compute the optimal\n",
    "policy. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 03. Bellman Equation and Dynamic Programming/README.md
================================================
# 3. Bellman Equation and Dynamic Programming
* 3.1. The Bellman Equation
   * 3.1.1. Bellman Equation of the Value Function
   * 3.1.2. Bellman Equation of the Q Function
* 3.2. Bellman Optimality Equation
* 3.3. Relation Between Value and Q Function
* 3.4. Dynamic Programming
* 3.5. Value Iteration
   * 3.5.1. Algorithm - Value Iteration
* 3.6. Solving the Frozen Lake Problem with Value Iteration
* 3.7. Policy iteration
   * 3.7.1. Algorithm - Policy iteration
* 3.8. Solving the Frozen Lake Problem with Policy Iteration
* 3.9. Is DP Applicable to all Environments?

================================================
FILE: 04. Monte Carlo Methods/.ipynb_checkpoints/4.01. Understanding the Monte Carlo Method-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Understanding the Monte Carlo method\n",
    "\n",
    "Before understanding how the Monte Carlo method is useful in reinforcement learning, first, let's understand what is Monte Carlo method and how does it work. The Monte Carlo method is a statistical technique used to find an approximate solution through sampling. \n",
    "\n",
    "For instance, the Monte Carlo method approximates the expectation of a random variable by sampling and when the sample size is greater the approximation will be better. Let's suppose we have a random variable X and say we need to compute the expected value of X, that is E[X], then we can compute it by taking the sum of values of X multiplied by their respective probabilities as shown below:\n",
    "\n",
    "$$ E(X) = \\sum_{i=1}^N x_i p(x_i) $$\n",
    "\n",
    "But instead of computing the expectation like this, can we approximate them with the Monte Carlo method? Yes! We can estimate the expected value of X by just sampling the values of X for some N times and compute the average value of X as the expected value of X as shown below:\n",
    "\n",
    "$$ \\mathbb{E}_{x \\sim p(x)}[X]  \\approx \\frac{1}{N} \\sum_i x_i $$\n",
    "\n",
    "\n",
    "When N is larger our approximation will be better. Thus, with the Monte Carlo method, we can approximate the solution through sampling and our approximation will be better when the sample size is large.\n",
    "\n",
    "In the upcoming sections, we will learn how exactly the Monte Carlo method is used in reinforcement learning. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 04. Monte Carlo Methods/.ipynb_checkpoints/4.02.  Prediction and control tasks-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prediction and control tasks\n",
    "\n",
    "In reinforcement learning, we perform two important tasks, and they are:\n",
    "* The prediction task\n",
    "* The control task\n",
    "\n",
    "## Prediction task\n",
    "In the prediction task, a policy π is given as an input and we try to predict the value\n",
    "function or Q function using the given policy. But what is the use of doing this? Our\n",
    "goal is to evaluate the given policy.That is, we need to determine whether the given policy is good or bad.  How can we determine that? If the agent obtains\n",
    "a good return using the given policy then we can say that our policy is good. Thus,\n",
    "to evaluate the given policy, we need to understand what the return the agent would\n",
    "obta

Download .txt

gitextract_jqc__b22/

├── 01. Fundamentals of Reinforcement Learning/
│   ├── .ipynb_checkpoints/
│   │   ├── 1.01. Basic Idea of Reinforcement Learning -checkpoint.ipynb
│   │   ├── 1.02. Key Elements of Reinforcement Learning -checkpoint.ipynb
│   │   ├── 1.03. Reinforcement Learning Algorithm-checkpoint.ipynb
│   │   ├── 1.04. RL agent in the Grid World -checkpoint.ipynb
│   │   ├── 1.05. How RL differs from other ML paradigms?-checkpoint.ipynb
│   │   ├── 1.06. Markov Decision Processes-checkpoint.ipynb
│   │   └── 1.07. Action space, Policy, Episode and Horizon-checkpoint.ipynb
│   ├── 1.01. Key Elements of Reinforcement Learning .ipynb
│   ├── 1.02. Basic Idea of Reinforcement Learning.ipynb
│   ├── 1.03. Reinforcement Learning Algorithm.ipynb
│   ├── 1.04. RL agent in the Grid World .ipynb
│   ├── 1.05. How RL differs from other ML paradigms?.ipynb
│   ├── 1.06. Markov Decision Processes.ipynb
│   ├── 1.07. Action space, Policy, Episode and Horizon.ipynb
│   ├── 1.08.  Return, Discount Factor and Math Essentials.ipynb
│   ├── 1.09 Value function and Q function.ipynb
│   ├── 1.10. Model-Based and Model-Free Learning .ipynb
│   ├── 1.11. Different Types of Environments.ipynb
│   ├── 1.12. Applications of Reinforcement Learning.ipynb
│   └── 1.13. Reinforcement Learning Glossary.ipynb
├── 02. A Guide to the Gym Toolkit/
│   ├── 2.02.  Creating our First Gym Environment.ipynb
│   ├── 2.05. Cart Pole Balancing with Random Policy.ipynb
│   └── README.md
├── 03. Bellman Equation and Dynamic Programming/
│   ├── .ipynb_checkpoints/
│   │   ├── 3.06. Solving the Frozen Lake Problem with Value Iteration-checkpoint.ipynb
│   │   └── 3.08. Solving the Frozen Lake Problem with Policy Iteration-checkpoint.ipynb
│   ├── 3.06. Solving the Frozen Lake Problem with Value Iteration.ipynb
│   ├── 3.08. Solving the Frozen Lake Problem with Policy Iteration.ipynb
│   └── README.md
├── 04. Monte Carlo Methods/
│   ├── .ipynb_checkpoints/
│   │   ├── 4.01. Understanding the Monte Carlo Method-checkpoint.ipynb
│   │   ├── 4.02.  Prediction and control tasks-checkpoint.ipynb
│   │   ├── 4.05. Every-visit MC Prediction with Blackjack Game-checkpoint.ipynb
│   │   ├── 4.06. First-visit MC Prediction with Blackjack Game-checkpoint.ipynb
│   │   └── 4.13. Implementing On-Policy MC Control-checkpoint.ipynb
│   ├── 4.13. Implementing On-Policy MC Control.ipynb
│   └── README.md
├── 05. Understanding Temporal Difference Learning/
│   ├── .ipynb_checkpoints/
│   │   ├── 5.03. Predicting the Value of States in a Frozen Lake Environment-checkpoint.ipynb
│   │   ├── 5.06. Computing Optimal Policy using SARSA-checkpoint.ipynb
│   │   └── 5.08. Computing the Optimal Policy using Q Learning-checkpoint.ipynb
│   ├── 5.03. Predicting the Value of States in a Frozen Lake Environment.ipynb
│   ├── 5.06. Computing Optimal Policy using SARSA.ipynb
│   ├── 5.08. Computing the Optimal Policy using Q Learning.ipynb
│   └── README.md
├── 06. Case Study: The MAB Problem/
│   ├── .ipynb_checkpoints/
│   │   ├── 6.01 .The MAB Problem-checkpoint.ipynb
│   │   ├── 6.04. Implementing epsilon-greedy -checkpoint.ipynb
│   │   ├── 6.06. Implementing Softmax Exploration-checkpoint.ipynb
│   │   ├── 6.08. Implementing UCB-checkpoint.ipynb
│   │   ├── 6.1-checkpoint.ipynb
│   │   ├── 6.10. Implementing Thompson Sampling-checkpoint.ipynb
│   │   └── 6.12. Finding the Best Advertisement Banner using Bandits-checkpoint.ipynb
│   ├── 6.01 .The MAB Problem.ipynb
│   ├── 6.03. Epsilon-Greedy.ipynb
│   ├── 6.04. Implementing epsilon-greedy .ipynb
│   ├── 6.06. Implementing Softmax Exploration.ipynb
│   ├── 6.08. Implementing UCB.ipynb
│   ├── 6.10. Implementing Thompson Sampling.ipynb
│   ├── 6.12. Finding the Best Advertisement Banner using Bandits.ipynb
│   └── README.md
├── 07. Deep learning foundations/
│   ├── .ipynb_checkpoints/
│   │   └── 7.05 Building Neural Network from scratch-checkpoint.ipynb
│   ├── 7.05 Building Neural Network from scratch.ipynb
│   └── README.md
├── 08. A primer on TensorFlow/
│   ├── .ipynb_checkpoints/
│   │   ├── 8.05 Handwritten digits classification using TensorFlow-checkpoint.ipynb
│   │   └── 8.10 MNIST digits classification in TensorFlow 2.0-checkpoint.ipynb
│   ├── 8.05 Handwritten digits classification using TensorFlow.ipynb
│   ├── 8.08 Math operations in TensorFlow.ipynb
│   ├── 8.10 MNIST digits classification in TensorFlow 2.0.ipynb
│   ├── README.md
│   └── graphs/
│       └── events.out.tfevents.1559122983.ml-dev
├── 09.  Deep Q Network and its Variants/
│   ├── .ipynb_checkpoints/
│   │   ├── 7.03. Playing Atari Games using DQN-Copy1-checkpoint.ipynb
│   │   ├── 7.03. Playing Atari Games using DQN-checkpoint.ipynb
│   │   └── 9.03. Playing Atari Games using DQN-checkpoint.ipynb
│   ├── 9.03. Playing Atari Games using DQN.ipynb
│   └── READEME.md
├── 10. Policy Gradient Method/
│   ├── .ipynb_checkpoints/
│   │   ├── 10.01. Why Policy based Methods-checkpoint.ipynb
│   │   ├── 10.02. Policy Gradient Intuition-checkpoint.ipynb
│   │   ├── 10.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb
│   │   └── 8.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb
│   ├── 10.07. Cart Pole Balancing with Policy Gradient.ipynb
│   └── README.md
├── 11. Actor Critic Methods - A2C and A3C/
│   ├── .ipynb_checkpoints/
│   │   ├── 11.01. Overview of actor critic method-checkpoint.ipynb
│   │   ├── 11.05. Mountain Car Climbing using A3C-checkpoint.ipynb
│   │   └── 9.05. Mountain Car Climbing using A3C-checkpoint.ipynb
│   ├── 11.05. Mountain Car Climbing using A3C.ipynb
│   ├── README.md
│   └── logs/
│       └── events.out.tfevents.1596718791.Sudharsan
├── 12. Learning DDPG, TD3 and SAC/
│   ├── .ipynb_checkpoints/
│   │   ├── 10.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb
│   │   ├── 12.01. DDPG-checkpoint.ipynb
│   │   ├── 12.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb
│   │   ├── 12.03. Twin delayed DDPG-checkpoint.ipynb
│   │   └── Swinging up the pendulum using DDPG -checkpoint.ipynb
│   ├── 12.05. Swinging Up the Pendulum using DDPG .ipynb
│   └── README.md
├── 13. TRPO, PPO and ACKTR Methods/
│   ├── .ipynb_checkpoints/
│   │   ├──  Implementing PPO-clipped method-checkpoint.ipynb
│   │   ├── 11.09. Implementing PPO-Clipped Method-checkpoint.ipynb
│   │   ├── 13.01. Trust Region Policy Optimization-checkpoint.ipynb
│   │   └── 13.09. Implementing PPO-Clipped Method-checkpoint.ipynb
│   ├── 13.09. Implementing PPO-Clipped Method.ipynb
│   └── README.md
├── 14. Distributional Reinforcement Learning/
│   ├── .ipynb_checkpoints/
│   │   ├── 12.03. Playing Atari games using Categorical DQN-checkpoint.ipynb
│   │   ├── 14.03. Playing Atari games using Categorical DQN-checkpoint.ipynb
│   │   ├── Playing Atari games using Categorical DQN-checkpoint.ipynb
│   │   └── c51 done-Copy1-checkpoint.ipynb
│   ├── 14.03. Playing Atari games using Categorical DQN.ipynb
│   └── README.md
├── 15. Imitation Learning and Inverse RL/
│   ├── .ipynb_checkpoints/
│   │   ├── 13.01. Supervised Imitation Learning -checkpoint.ipynb
│   │   └── 13.02. DAgger-checkpoint.ipynb
│   ├── 15.02. DAgger.ipynb
│   └── README.md
├── 16. Deep Reinforcement Learning with Stable Baselines/
│   ├── .ipynb_checkpoints/
│   │   ├── 14.01. Creating our First Agent with Baseline-checkpoint.ipynb
│   │   ├── 14.04. Playing Atari games with DQN and its variants-checkpoint.ipynb
│   │   ├── 14.05. Implementing DQN variants-checkpoint.ipynb
│   │   ├── 14.06. Lunar Lander using A2C-checkpoint.ipynb
│   │   ├── 14.07. Creating a custom network-checkpoint.ipynb
│   │   ├── 14.08. Swinging up a pendulum using DDPG-checkpoint.ipynb
│   │   ├── 16.01. Creating our First Agent with Stable Baseline-checkpoint.ipynb
│   │   ├── 16.04. Playing Atari games with DQN and its variants-checkpoint.ipynb
│   │   ├── 16.05. Implementing DQN variants-checkpoint.ipynb
│   │   ├── 16.06. Lunar Lander using A2C-checkpoint.ipynb
│   │   ├── 16.07. Creating a custom network-checkpoint.ipynb
│   │   ├── 16.08. Swinging up a pendulum using DDPG-checkpoint.ipynb
│   │   ├── 16.09. Training an agent to walk using TRPO-checkpoint.ipynb
│   │   ├── 16.10. Training cheetah bot to run using PPO-checkpoint.ipynb
│   │   ├── Creating a custom network-checkpoint.ipynb
│   │   ├── Implementing DQN variants-checkpoint.ipynb
│   │   ├── Lunar Lander using A2C-checkpoint.ipynb
│   │   ├── Playing Atari games with DQN and its variants-checkpoint.ipynb
│   │   ├── Swinging up a pendulum using DDPG-checkpoint.ipynb
│   │   ├── Training an agent to walk using TRPO-checkpoint.ipynb
│   │   ├── Training cheetah bot to run using PPO-checkpoint.ipynb
│   │   └── Untitled-checkpoint.ipynb
│   ├── 16.04. Playing Atari games with DQN and its variants.ipynb
│   ├── 16.05. Implementing DQN variants.ipynb
│   ├── 16.06. Lunar Lander using A2C.ipynb
│   ├── 16.07. Creating a custom network.ipynb
│   ├── 16.08. Swinging up a pendulum using DDPG.ipynb
│   ├── 16.09. Training an agent to walk using TRPO.ipynb
│   ├── 16.10. Training cheetah bot to run using PPO.ipynb
│   ├── README.md
│   └── logs/
│       └── DDPG_1/
│           └── events.out.tfevents.1582974711.Sudharsan
├── 17. Reinforcement Learning Frontiers/
│   ├── .ipynb_checkpoints/
│   │   └── 15.01. Meta Reinforcement Learning-checkpoint.ipynb
│   └── README.md
└── README.md

Download .json

Condensed preview — 141 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,290K chars).

[
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.01. Basic Idea of Reinforcement Learning -checkpoint.ipynb",
    "chars": 4876,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction\\n\",\n    \"\\n\",\n    \"\\"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.02. Key Elements of Reinforcement Learning -checkpoint.ipynb",
    "chars": 4040,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Basic Idea of Reinforcement Learn"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.03. Reinforcement Learning Algorithm-checkpoint.ipynb",
    "chars": 2003,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Reinforcement Learning algorithm\\"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.04. RL agent in the Grid World -checkpoint.ipynb",
    "chars": 6616,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# RL agent in the Grid World \\n\",\n "
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.05. How RL differs from other ML paradigms?-checkpoint.ipynb",
    "chars": 4684,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# How RL differs from other ML para"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.06. Markov Decision Processes-checkpoint.ipynb",
    "chars": 9257,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Markov Decision Processes \\n\",\n  "
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.07. Action space, Policy, Episode and Horizon-checkpoint.ipynb",
    "chars": 14041,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Action space, Policy, Episode, Ho"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.01. Key Elements of Reinforcement Learning .ipynb",
    "chars": 4876,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction\\n\",\n    \"\\n\",\n    \"\\"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.02. Basic Idea of Reinforcement Learning.ipynb",
    "chars": 4040,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Basic Idea of Reinforcement Learn"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.03. Reinforcement Learning Algorithm.ipynb",
    "chars": 2003,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Reinforcement Learning algorithm\\"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.04. RL agent in the Grid World .ipynb",
    "chars": 6616,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# RL agent in the Grid World \\n\",\n "
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.05. How RL differs from other ML paradigms?.ipynb",
    "chars": 4684,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# How RL differs from other ML para"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.06. Markov Decision Processes.ipynb",
    "chars": 9257,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Markov Decision Processes \\n\",\n  "
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.07. Action space, Policy, Episode and Horizon.ipynb",
    "chars": 14041,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Action space, Policy, Episode, Ho"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.08.  Return, Discount Factor and Math Essentials.ipynb",
    "chars": 11737,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Return, Discount Factor and Math "
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.09 Value function and Q function.ipynb",
    "chars": 11943,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Value function\\n\",\n    \"\\n\",\n    "
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.10. Model-Based and Model-Free Learning .ipynb",
    "chars": 1750,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Model-Based and Model-free learni"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.11. Different Types of Environments.ipynb",
    "chars": 3961,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Different types of environments\\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.12. Applications of Reinforcement Learning.ipynb",
    "chars": 3465,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Applications of Reinforcement Le"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.13. Reinforcement Learning Glossary.ipynb",
    "chars": 5228,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Reinforcement Learning Glossary \\"
  },
  {
    "path": "02. A Guide to the Gym Toolkit/2.02.  Creating our First Gym Environment.ipynb",
    "chars": 13368,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Creating our first gym environmen"
  },
  {
    "path": "02. A Guide to the Gym Toolkit/2.05. Cart Pole Balancing with Random Policy.ipynb",
    "chars": 3446,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Cart Pole Balancing with Random P"
  },
  {
    "path": "02. A Guide to the Gym Toolkit/README.md",
    "chars": 819,
    "preview": "# 2. A Guide to the Gym Toolkit\n* 2.1. Setting Up our Machine\n   * 2.1.1. Installing Anaconda\n   * 2.1.2. Installing the"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/.ipynb_checkpoints/3.06. Solving the Frozen Lake Problem with Value Iteration-checkpoint.ipynb",
    "chars": 11949,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"# Solving"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/.ipynb_checkpoints/3.08. Solving the Frozen Lake Problem with Policy Iteration-checkpoint.ipynb",
    "chars": 9850,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Solving the Frozen Lake Problem w"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/3.06. Solving the Frozen Lake Problem with Value Iteration.ipynb",
    "chars": 11949,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"# Solving"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/3.08. Solving the Frozen Lake Problem with Policy Iteration.ipynb",
    "chars": 9850,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Solving the Frozen Lake Problem w"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/README.md",
    "chars": 572,
    "preview": "# 3. Bellman Equation and Dynamic Programming\n* 3.1. The Bellman Equation\n   * 3.1.1. Bellman Equation of the Value Func"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.01. Understanding the Monte Carlo Method-checkpoint.ipynb",
    "chars": 2064,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Understanding the Monte Carlo met"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.02.  Prediction and control tasks-checkpoint.ipynb",
    "chars": 3936,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Prediction and control tasks\\n\",\n"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.05. Every-visit MC Prediction with Blackjack Game-checkpoint.ipynb",
    "chars": 20702,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Every-visit MC prediction with bl"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.06. First-visit MC Prediction with Blackjack Game-checkpoint.ipynb",
    "chars": 20667,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# First-visit MC prediction with bl"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.13. Implementing On-Policy MC Control-checkpoint.ipynb",
    "chars": 12377,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing On-policy MC control"
  },
  {
    "path": "04. Monte Carlo Methods/4.13. Implementing On-Policy MC Control.ipynb",
    "chars": 12377,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing On-policy MC control"
  },
  {
    "path": "04. Monte Carlo Methods/README.md",
    "chars": 925,
    "preview": "### 4. Monte Carlo Methods\n* 4.1. Understanding the Monte Carlo Method\n* 4.2. Prediction and Control Tasks\n   * 4.2.1. P"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/.ipynb_checkpoints/5.03. Predicting the Value of States in a Frozen Lake Environment-checkpoint.ipynb",
    "chars": 9655,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Predicting the value of states in"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/.ipynb_checkpoints/5.06. Computing Optimal Policy using SARSA-checkpoint.ipynb",
    "chars": 4984,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Computing optimal policy using SA"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/.ipynb_checkpoints/5.08. Computing the Optimal Policy using Q Learning-checkpoint.ipynb",
    "chars": 4885,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Computing the optimal policy usin"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/5.03. Predicting the Value of States in a Frozen Lake Environment.ipynb",
    "chars": 9655,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Predicting the value of states in"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/5.06. Computing Optimal Policy using SARSA.ipynb",
    "chars": 4984,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Computing optimal policy using SA"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/5.08. Computing the Optimal Policy using Q Learning.ipynb",
    "chars": 4885,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Computing the optimal policy usin"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/README.md",
    "chars": 477,
    "preview": "### 5. Understanding Temporal Difference Learning\n* 5.1. TD Learning\n* 5.2. TD Prediction\n   * 5.2.1. TD Prediction Algo"
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.01 .The MAB Problem-checkpoint.ipynb",
    "chars": 5818,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# The MAB problem\\n\",\n    \"\\n\",\n   "
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.04. Implementing epsilon-greedy -checkpoint.ipynb",
    "chars": 6031,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing epsilon-greedy \\n\",\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.06. Implementing Softmax Exploration-checkpoint.ipynb",
    "chars": 6593,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing Softmax Exploration\\"
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.08. Implementing UCB-checkpoint.ipynb",
    "chars": 6540,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing UCB\\n\",\n    \"\\n\",\n  "
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.1-checkpoint.ipynb",
    "chars": 5803,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# The MAB problem\\n\",\n    \"\\n\",\n   "
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.10. Implementing Thompson Sampling-checkpoint.ipynb",
    "chars": 6966,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing Thompson sampling\\n\""
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.12. Finding the Best Advertisement Banner using Bandits-checkpoint.ipynb",
    "chars": 22701,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Finding the best advertisement ba"
  },
  {
    "path": "06. Case Study: The MAB Problem/6.01 .The MAB Problem.ipynb",
    "chars": 5818,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# The MAB problem\\n\",\n    \"\\n\",\n   "
  },
  {
    "path": "06. Case Study: The MAB Problem/6.03. Epsilon-Greedy.ipynb",
    "chars": 3921,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Epsilon-greedy\\n\",\n    \"\\n\",\n    "
  },
  {
    "path": "06. Case Study: The MAB Problem/6.04. Implementing epsilon-greedy .ipynb",
    "chars": 6031,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing epsilon-greedy \\n\",\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/6.06. Implementing Softmax Exploration.ipynb",
    "chars": 6593,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing Softmax Exploration\\"
  },
  {
    "path": "06. Case Study: The MAB Problem/6.08. Implementing UCB.ipynb",
    "chars": 6540,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing UCB\\n\",\n    \"\\n\",\n  "
  },
  {
    "path": "06. Case Study: The MAB Problem/6.10. Implementing Thompson Sampling.ipynb",
    "chars": 6966,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing Thompson sampling\\n\""
  },
  {
    "path": "06. Case Study: The MAB Problem/6.12. Finding the Best Advertisement Banner using Bandits.ipynb",
    "chars": 22701,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Finding the best advertisement ba"
  },
  {
    "path": "06. Case Study: The MAB Problem/README.md",
    "chars": 447,
    "preview": "# 6. Case Study: The MAB Problem\n* 6.1. The MAB Problem\n* 6.2. Creating Bandit in the Gym\n* 6.3. Epsilon-Greedy\n* 6.4. I"
  },
  {
    "path": "07. Deep learning foundations/.ipynb_checkpoints/7.05 Building Neural Network from scratch-checkpoint.ipynb",
    "chars": 22669,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Building Neural Network from Scra"
  },
  {
    "path": "07. Deep learning foundations/7.05 Building Neural Network from scratch.ipynb",
    "chars": 22669,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Building Neural Network from Scra"
  },
  {
    "path": "07. Deep learning foundations/README.md",
    "chars": 842,
    "preview": "# [Chapter 7. Deep Learning Foundations](#)\n\n* 7.1. Biological and artifical neurons\n* 7.2. ANN and its layers \n* 7.3. E"
  },
  {
    "path": "08. A primer on TensorFlow/.ipynb_checkpoints/8.05 Handwritten digits classification using TensorFlow-checkpoint.ipynb",
    "chars": 23951,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Handwritten digits classification"
  },
  {
    "path": "08. A primer on TensorFlow/.ipynb_checkpoints/8.10 MNIST digits classification in TensorFlow 2.0-checkpoint.ipynb",
    "chars": 6836,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# MNIST digit classification in Ten"
  },
  {
    "path": "08. A primer on TensorFlow/8.05 Handwritten digits classification using TensorFlow.ipynb",
    "chars": 23951,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Handwritten digits classification"
  },
  {
    "path": "08. A primer on TensorFlow/8.08 Math operations in TensorFlow.ipynb",
    "chars": 21364,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Math operations in TensorFlow\\n\","
  },
  {
    "path": "08. A primer on TensorFlow/8.10 MNIST digits classification in TensorFlow 2.0.ipynb",
    "chars": 6836,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# MNIST digit classification in Ten"
  },
  {
    "path": "08. A primer on TensorFlow/README.md",
    "chars": 802,
    "preview": "\n\n# [Chapter 8. Getting to Know TensorFlow](#)\n\n* 8.1. What is TensorFlow?\n* 8.2. Understanding Computational Graphs and"
  },
  {
    "path": "09.  Deep Q Network and its Variants/.ipynb_checkpoints/7.03. Playing Atari Games using DQN-Copy1-checkpoint.ipynb",
    "chars": 14609,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"ename\""
  },
  {
    "path": "09.  Deep Q Network and its Variants/.ipynb_checkpoints/7.03. Playing Atari Games using DQN-checkpoint.ipynb",
    "chars": 15570,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using DQN \\n\""
  },
  {
    "path": "09.  Deep Q Network and its Variants/.ipynb_checkpoints/9.03. Playing Atari Games using DQN-checkpoint.ipynb",
    "chars": 15913,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using DQN \\n\""
  },
  {
    "path": "09.  Deep Q Network and its Variants/9.03. Playing Atari Games using DQN.ipynb",
    "chars": 15913,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using DQN \\n\""
  },
  {
    "path": "09.  Deep Q Network and its Variants/READEME.md",
    "chars": 675,
    "preview": "# 9. Deep Q Network and its Variants\n\n* 9.1. What is Deep Q Network?\n* 9.2. Understanding DQN\n   * 9.2.1. Replay Buffer\n"
  },
  {
    "path": "10. Policy Gradient Method/.ipynb_checkpoints/10.01. Why Policy based Methods-checkpoint.ipynb",
    "chars": 6984,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Why policy-based methods?\\n\",\n  "
  },
  {
    "path": "10. Policy Gradient Method/.ipynb_checkpoints/10.02. Policy Gradient Intuition-checkpoint.ipynb",
    "chars": 4603,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Policy gradient intuition\\n\",\n   "
  },
  {
    "path": "10. Policy Gradient Method/.ipynb_checkpoints/10.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb",
    "chars": 12439,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Cart pole balancing with policy g"
  },
  {
    "path": "10. Policy Gradient Method/.ipynb_checkpoints/8.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb",
    "chars": 12968,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Cart pole balancing with policy g"
  },
  {
    "path": "10. Policy Gradient Method/10.07. Cart Pole Balancing with Policy Gradient.ipynb",
    "chars": 12439,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Cart pole balancing with policy g"
  },
  {
    "path": "10. Policy Gradient Method/README.md",
    "chars": 480,
    "preview": "# 10. Policy Gradient Method\n* 10.1. Why Policy Based Methods?\n* 10.2. Policy Gradient Intuition\n* 10.3. Understanding t"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/.ipynb_checkpoints/11.01. Overview of actor critic method-checkpoint.ipynb",
    "chars": 3482,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Overview of actor critic method\\n"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/.ipynb_checkpoints/11.05. Mountain Car Climbing using A3C-checkpoint.ipynb",
    "chars": 25577,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Mountain car climbing using A3C\\n"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/.ipynb_checkpoints/9.05. Mountain Car Climbing using A3C-checkpoint.ipynb",
    "chars": 27575,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Mountain car climbing using A3C\\n"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/11.05. Mountain Car Climbing using A3C.ipynb",
    "chars": 25577,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Mountain car climbing using A3C\\n"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/README.md",
    "chars": 364,
    "preview": "# 11. Actor Critic Methods - A2C and A3C\n* 11.1. Overview of Actor Critic Method\n* 11.2. Understanding the Actor Critic "
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/10.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb",
    "chars": 19730,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Swinging up the pendulum using DD"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/12.01. DDPG-checkpoint.ipynb",
    "chars": 6416,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Deep deterministic policy gradien"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/12.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb",
    "chars": 19118,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Swinging up the pendulum using DD"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/12.03. Twin delayed DDPG-checkpoint.ipynb",
    "chars": 3910,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Twin delayed DDPG\\n\",\n    \"\\n\",\n "
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/Swinging up the pendulum using DDPG -checkpoint.ipynb",
    "chars": 18451,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Swinging up the pendulum using DD"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/12.05. Swinging Up the Pendulum using DDPG .ipynb",
    "chars": 19118,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Swinging up the pendulum using DD"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/README.md",
    "chars": 878,
    "preview": "# 12. Learning DDPG, TD3 and SAC\n* 12.1. Deep Deterministic Policy Gradient\n   * 12.1.1. An Overview of DDPG\n* 12.2. Com"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/.ipynb_checkpoints/ Implementing PPO-clipped method-checkpoint.ipynb",
    "chars": 15287,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"# Impleme"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/.ipynb_checkpoints/11.09. Implementing PPO-Clipped Method-checkpoint.ipynb",
    "chars": 17519,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"# Impleme"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/.ipynb_checkpoints/13.01. Trust Region Policy Optimization-checkpoint.ipynb",
    "chars": 5034,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Trust Region Policy Optimization\\"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/.ipynb_checkpoints/13.09. Implementing PPO-Clipped Method-checkpoint.ipynb",
    "chars": 15945,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"# Impleme"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/13.09. Implementing PPO-Clipped Method.ipynb",
    "chars": 15945,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"# Impleme"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/README.md",
    "chars": 1159,
    "preview": "# 13. TRPO, PPO and ACKTR Methods\n* 13.1 Trust Region Policy Optimization\n* 13.2. Math Essentials\n   * 13.2.1. Taylor se"
  },
  {
    "path": "14. Distributional Reinforcement Learning/.ipynb_checkpoints/12.03. Playing Atari games using Categorical DQN-checkpoint.ipynb",
    "chars": 20580,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using Categor"
  },
  {
    "path": "14. Distributional Reinforcement Learning/.ipynb_checkpoints/14.03. Playing Atari games using Categorical DQN-checkpoint.ipynb",
    "chars": 19658,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using Categor"
  },
  {
    "path": "14. Distributional Reinforcement Learning/.ipynb_checkpoints/Playing Atari games using Categorical DQN-checkpoint.ipynb",
    "chars": 19345,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using Categor"
  },
  {
    "path": "14. Distributional Reinforcement Learning/.ipynb_checkpoints/c51 done-Copy1-checkpoint.ipynb",
    "chars": 16792,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": "
  },
  {
    "path": "14. Distributional Reinforcement Learning/14.03. Playing Atari games using Categorical DQN.ipynb",
    "chars": 19658,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using Categor"
  },
  {
    "path": "14. Distributional Reinforcement Learning/README.md",
    "chars": 755,
    "preview": "# 14. Distributional Reinforcement Learning\n* 14.1. Why Distributional Reinforcement Learning?\n* 14.2. Categorical DQN\n "
  },
  {
    "path": "15. Imitation Learning and Inverse RL/.ipynb_checkpoints/13.01. Supervised Imitation Learning -checkpoint.ipynb",
    "chars": 3309,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Supervised Imitation Learning \\n\""
  },
  {
    "path": "15. Imitation Learning and Inverse RL/.ipynb_checkpoints/13.02. DAgger-checkpoint.ipynb",
    "chars": 3266,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# DAgger\\n\",\n    \"\\n\",\n    \"\\n\",\n  "
  },
  {
    "path": "15. Imitation Learning and Inverse RL/15.02. DAgger.ipynb",
    "chars": 3266,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# DAgger\\n\",\n    \"\\n\",\n    \"\\n\",\n  "
  },
  {
    "path": "15. Imitation Learning and Inverse RL/README.md",
    "chars": 582,
    "preview": "# 15. Imitation Learning and Inverse RL\n* 15.1. Supervised Imitation Learning\n* 15.2. DAgger\n   * 15.2. Understanding DA"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.01. Creating our First Agent with Baseline-checkpoint.ipynb",
    "chars": 5155,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating our first agent with ba"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.04. Playing Atari games with DQN and its variants-checkpoint.ipynb",
    "chars": 3416,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games with DQN\\n\",\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.05. Implementing DQN variants-checkpoint.ipynb",
    "chars": 4045,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing DQN variants\\n\",\n   "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.06. Lunar Lander using A2C-checkpoint.ipynb",
    "chars": 4057,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Lunar Lander using A2C\\n\",\n    \"\\"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.07. Creating a custom network-checkpoint.ipynb",
    "chars": 4567,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a custom network\\n\",\n  "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.08. Swinging up a pendulum using DDPG-checkpoint.ipynb",
    "chars": 5977,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Swinging up a pendulum using DDP"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.01. Creating our First Agent with Stable Baseline-checkpoint.ipynb",
    "chars": 5457,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Creating our first agent with Sta"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.04. Playing Atari games with DQN and its variants-checkpoint.ipynb",
    "chars": 3420,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games with DQN\\n\",\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.05. Implementing DQN variants-checkpoint.ipynb",
    "chars": 4049,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing DQN variants\\n\",\n   "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.06. Lunar Lander using A2C-checkpoint.ipynb",
    "chars": 4056,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Lunar Lander using A2C\\n\",\n    \"\\"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.07. Creating a custom network-checkpoint.ipynb",
    "chars": 4570,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a custom network\\n\",\n  "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.08. Swinging up a pendulum using DDPG-checkpoint.ipynb",
    "chars": 5977,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Swinging up a pendulum using DDP"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.09. Training an agent to walk using TRPO-checkpoint.ipynb",
    "chars": 5776,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training an agent to walk using T"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.10. Training cheetah bot to run using PPO-checkpoint.ipynb",
    "chars": 5308,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training cheetah bot to run using"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Creating a custom network-checkpoint.ipynb",
    "chars": 4300,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a custom network\\n\",\n  "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Implementing DQN variants-checkpoint.ipynb",
    "chars": 3846,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing DQN variants\\n\",\n   "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Lunar Lander using A2C-checkpoint.ipynb",
    "chars": 3857,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Lunar Lander using A2C\\n\",\n    \"\\"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Playing Atari games with DQN and its variants-checkpoint.ipynb",
    "chars": 3563,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games with DQN and "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Swinging up a pendulum using DDPG-checkpoint.ipynb",
    "chars": 5143,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Swinging up a pendulum using DDP"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Training an agent to walk using TRPO-checkpoint.ipynb",
    "chars": 5780,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training an agent to walk using T"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Training cheetah bot to run using PPO-checkpoint.ipynb",
    "chars": 5297,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training cheetah bot to run using"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Untitled-checkpoint.ipynb",
    "chars": 72,
    "preview": "{\n \"cells\": [],\n \"metadata\": {},\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.04. Playing Atari games with DQN and its variants.ipynb",
    "chars": 3420,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games with DQN\\n\",\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.05. Implementing DQN variants.ipynb",
    "chars": 4049,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing DQN variants\\n\",\n   "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.06. Lunar Lander using A2C.ipynb",
    "chars": 4056,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Lunar Lander using A2C\\n\",\n    \"\\"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.07. Creating a custom network.ipynb",
    "chars": 4570,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a custom network\\n\",\n  "
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.08. Swinging up a pendulum using DDPG.ipynb",
    "chars": 5977,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Swinging up a pendulum using DDP"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.09. Training an agent to walk using TRPO.ipynb",
    "chars": 5776,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training an agent to walk using T"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.10. Training cheetah bot to run using PPO.ipynb",
    "chars": 5308,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training cheetah bot to run using"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/README.md",
    "chars": 961,
    "preview": "\n# 16. Deep Reinforcement Learning with Stable Baselines\n\n\n* 16.1. Creating our First Agent with Baseline\n   * 16.1.1. E"
  },
  {
    "path": "17. Reinforcement Learning Frontiers/.ipynb_checkpoints/15.01. Meta Reinforcement Learning-checkpoint.ipynb",
    "chars": 2647,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Meta Reinforcement Learning \\n\",\n"
  },
  {
    "path": "17. Reinforcement Learning Frontiers/README.md",
    "chars": 461,
    "preview": "# 17. Reinforcement Learning Frontiers\n* 17.1. Meta Reinforcement Learning\n* 17.2. Model Agnostic Meta Learning\n* 17.3. "
  },
  {
    "path": "README.md",
    "chars": 15808,
    "preview": "# [Deep Reinforcement Learning With Python](https://www.amazon.com/dp/1839210680/ref=cm_sw_r_tw_dp_x_avRDFb99EVTQ)\n\n### "
  }
]

// ... and 3 more files (download for full content)

About this extraction

This page contains the full source code of the sudharsan13296/Deep-Reinforcement-Learning-With-Python GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 141 files (1.1 MB), approximately 358.4k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo