[
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.01. Basic Idea of Reinforcement Learning -checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Reinforcement Learning (RL) is one of the areas of Machine Learning (ML). Unlike\\n\",\n    \"other ML paradigms, such as supervised and unsupervised learning, RL works in a\\n\",\n    \"trial and error fashion by interacting with its environment.\\n\",\n    \"\\n\",\n    \"RL is one of the most active areas of research in artificial intelligence, and it is\\n\",\n    \"believed that RL will take us a step closer towards achieving artificial general\\n\",\n    \"intelligence. RL has evolved rapidly in the past few years with a wide variety of\\n\",\n    \"applications ranging from building a recommendation system to self-driving cars.\\n\",\n    \"The major reason for this evolution is the advent of deep reinforcement learning,\\n\",\n    \"which is a combination of deep learning and RL. With the emergence of new RL\\n\",\n    \"algorithms and libraries, RL is clearly one of the most promising areas of ML.\\n\",\n    \"\\n\",\n    \"In this chapter, we will build a strong foundation in RL by exploring several\\n\",\n    \"important and fundamental concepts involved in RL. In this chapter, we will learn about the following topics:\\n\",\n    \"\\n\",\n    \"* Key elements of RL\\n\",\n    \"* The basic idea of RL\\n\",\n    \"* The RL algorithm\\n\",\n    \"* How RL differs from other ML paradigms\\n\",\n    \"* The Markov Decision Processes\\n\",\n    \"* Fundamental concepts of RL\\n\",\n    \"* Applications of RL\\n\",\n    \"* RL glossary\\n\",\n    \"\\n\",\n    \"We will begin the chapter by understanding Key elements of RL. This will help us understand the\\n\",\n    \"basic idea of RL.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Key Elements of Reinforcement Learning \\n\",\n    \"\\n\",\n    \"Let's begin by understanding some key elements of RL.\\n\",\n    \"\\n\",\n    \"## Agent \\n\",\n    \"\\n\",\n    \"An agent is a software program that learns to make intelligent decisions. We can\\n\",\n    \"say that an agent is a learner in the RL setting. For instance, a chess player can be\\n\",\n    \"considered an agent since the player learns to make the best moves (decisions) to win\\n\",\n    \"the game. Similarly, Mario in a Super Mario Bros video game can be considered an\\n\",\n    \"agent since Mario explores the game and learns to make the best moves in the game.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Environment \\n\",\n    \"The environment is the world of the agent. The agent stays within the environment.\\n\",\n    \"For instance, coming back to our chess game, a chessboard is called the environment\\n\",\n    \"since the chess player (agent) learns to play the game of chess within the chessboard\\n\",\n    \"(environment). Similarly, in Super Mario Bros, the world of Mario is called the\\n\",\n    \"environment.\\n\",\n    \"\\n\",\n    \"## State and action\\n\",\n    \"A state is a position or a moment in the environment that the agent can be in. We\\n\",\n    \"learned that the agent stays within the environment, and there can be many positions\\n\",\n    \"in the environment that the agent can stay in, and those positions are called states.\\n\",\n    \"For instance, in our chess game example, each position on the chessboard is called\\n\",\n    \"the state. The state is usually denoted by s.\\n\",\n    \"\\n\",\n    \"The agent interacts with the environment and moves from one state to another\\n\",\n    \"by performing an action. In the chess game environment, the action is the move\\n\",\n    \"performed by the player (agent). The action is usually denoted by a.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Reward\\n\",\n    \"\\n\",\n    \"We learned that the agent interacts with an environment by performing an action\\n\",\n    \"and moves from one state to another. Based on the action, the agent receives a\\n\",\n    \"reward. A reward is nothing but a numerical value, say, +1 for a good action and -1\\n\",\n    \"for a bad action. How do we decide if an action is good or bad?\\n\",\n    \"In our chess game example, if the agent makes a move in which it takes one of the\\n\",\n    \"opponent's chess pieces, then it is considered a good action and the agent receives\\n\",\n    \"a positive reward. Similarly, if the agent makes a move that leads to the opponent\\n\",\n    \"taking the agent's chess piece, then it is considered a bad action and the agent\\n\",\n    \"receives a negative reward. The reward is denoted by r.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"In the next section, let us explore basic idea of reinforcement learning. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.02. Key Elements of Reinforcement Learning -checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Basic Idea of Reinforcement Learning \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Let's begin with an analogy. Let's suppose we are teaching a dog (agent) to catch a\\n\",\n    \"ball. Instead of teaching the dog explicitly to catch a ball, we just throw a ball and\\n\",\n    \"every time the dog catches the ball, we give the dog a cookie (reward). If the dog\\n\",\n    \"fails to catch the ball, then we do not give it a cookie. So, the dog will figure out\\n\",\n    \"what action caused it to receive a cookie and repeat that action. Thus, the dog will\\n\",\n    \"understand that catching the ball caused it to receive a cookie and will attempt to\\n\",\n    \"repeat catching the ball. Thus, in this way, the dog will learn to catch a ball while\\n\",\n    \"aiming to maximize the cookies it can receive.\\n\",\n    \"\\n\",\n    \"Similarly, in an RL setting, we will not teach the agent what to do or how to do it;\\n\",\n    \"instead, we will give a reward to the agent for every action it does. We will give\\n\",\n    \"a positive reward to the agent when it performs a good action and we will give a\\n\",\n    \"negative reward to the agent when it performs a bad action. The agent begins by\\n\",\n    \"performing a random action and if the action is good, we then give the agent a\\n\",\n    \"positive reward so that the agent understands it has performed a good action and it\\n\",\n    \"will repeat that action. If the action performed by the agent is bad, then we will give\\n\",\n    \"the agent a negative reward so that the agent will understand it has performed a bad\\n\",\n    \"action and it will not repeat that action.\\n\",\n    \"\\n\",\n    \"Thus, RL can be viewed as a trial and error learning process where the agent tries out\\n\",\n    \"different actions and learns the good action, which gives a positive reward.\\n\",\n    \"\\n\",\n    \"In the dog analogy, the dog represents the agent, and giving a cookie to the dog\\n\",\n    \"upon it catching the ball is a positive reward and not giving a cookie is a negative\\n\",\n    \"reward. So, the dog (agent) explores different actions, which are catching the ball\\n\",\n    \"and not catching the ball, and understands that catching the ball is a good action as it\\n\",\n    \"brings the dog a positive reward (getting a cookie).\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Let's further explore the idea of RL with one more simple example. Let's suppose we\\n\",\n    \"want to teach a robot (agent) to walk without hitting a mountain, as the following figure shows: \\n\",\n    \"\\n\",\n    \"![title](Images/1.png)\\n\",\n    \"\\n\",\n    \"We will not teach the robot explicitly to not go in the direction of the mountain.\\n\",\n    \"Instead, if the robot hits the mountain and gets stuck, we give the robot a negative\\n\",\n    \"reward, say -1. So, the robot will understand that hitting the mountain is the wrong\\n\",\n    \"action, and it will not repeat that action:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/2.png)\\n\",\n    \"\\n\",\n    \"Similarly, when the robot walks in the right direction without hitting the mountain,\\n\",\n    \"we give the robot a positive reward, say +1. So, the robot will understand that not\\n\",\n    \"hitting the mountain is a good action, and it will repeat that action:\\n\",\n    \"\\n\",\n    \"![title](Images/3.png)\\n\",\n    \"\\n\",\n    \"Thus, in the RL setting, the agent explores different actions and learns the best action\\n\",\n    \"based on the reward it gets.\\n\",\n    \"Now that we have a basic idea of how RL works, in the upcoming sections, we will\\n\",\n    \"go into more detail and also learn the important concepts involved in RL.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.03. Reinforcement Learning Algorithm-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Reinforcement Learning algorithm\\n\",\n    \"\\n\",\n    \"The steps involved in a typical RL algorithm are as follows:\\n\",\n    \"\\n\",\n    \"1. First, the agent interacts with the environment by performing an action.\\n\",\n    \"2. By performing an action, the agent moves from one state to another.\\n\",\n    \"3. Then the agent will receive a reward based on the action it performed.\\n\",\n    \"4. Based on the reward, the agent will understand whether the action is good or bad.\\n\",\n    \"5. If the action was good, that is, if the agent received a positive reward, then the agent will prefer performing that action, else the agent will try performing other actions in search of a positive reward.\\n\",\n    \"\\n\",\n    \"RL is basically a trial and error learning process. Now, let's revisit our chess game\\n\",\n    \"example. The agent (software program) is the chess player. So, the agent interacts\\n\",\n    \"with the environment (chessboard) by performing an action (moves). If the agent\\n\",\n    \"gets a positive reward for an action, then it will prefer performing that action; else it\\n\",\n    \"will find a different action that gives a positive reward.\\n\",\n    \"\\n\",\n    \"Ultimately, the goal of the agent is to maximize the reward it gets. If the agent\\n\",\n    \"receives a good reward, then it means it has performed a good action. If the agent\\n\",\n    \"performs a good action, then it implies that it can win the game. Thus, the agent\\n\",\n    \"learns to win the game by maximizing the reward.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.04. RL agent in the Grid World -checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# RL agent in the Grid World \\n\",\n    \"\\n\",\n    \"Let's strengthen our understanding of RL by looking at another simple example.\\n\",\n    \"Consider the following grid world environment:\\n\",\n    \"\\n\",\n    \"![title](Images/4.png)\\n\",\n    \"\\n\",\n    \"The positions A to I in the environment are called the states of the environment.\\n\",\n    \"The goal of the agent is to reach state I by starting from state A without visiting\\n\",\n    \"the shaded states (B, C, G, and H). Thus, in order to achieve the goal, whenever\\n\",\n    \"our agent visits a shaded state, we will give a negative reward (say -1) and when it\\n\",\n    \"visits an unshaded state, we will give a positive reward (say +1). The actions in the\\n\",\n    \"environment are moving up, down, right and left. The agent can perform any of these\\n\",\n    \"four actions to reach state I from state A.\\n\",\n    \"\\n\",\n    \"The first time the agent interacts with the environment (the first iteration), the agent\\n\",\n    \"is unlikely to perform the correct action in each state, and thus it receives a negative\\n\",\n    \"reward. That is, in the first iteration, the agent performs a random action in each\\n\",\n    \"state, and this may lead the agent to receive a negative reward. But over a series of\\n\",\n    \"iterations, the agent learns to perform the correct action in each state through the\\n\",\n    \"reward it obtains, helping it achieve the goal. Let us explore this in detail.\\n\",\n    \"\\n\",\n    \"## Iteration 1:\\n\",\n    \"\\n\",\n    \"As we learned, in the first iteration, the agent performs a random action in each state.\\n\",\n    \"For instance, look at the following figure. In the first iteration, the agent moves right\\n\",\n    \"from state A and reaches the new state B. But since B is the shaded state, the agent\\n\",\n    \"will receive a negative reward and so the agent will understand that moving right is\\n\",\n    \"not a good action in state A. When it visits state A next time, it will try out a different\\n\",\n    \"action instead of moving right:\\n\",\n    \"\\n\",\n    \"![title](Images/5.PNG)\\n\",\n    \"\\n\",\n    \"As the avove figure shows, from state B, the agent moves down and reaches the new state\\n\",\n    \"E. Since E is an unshaded state, the agent will receive a positive reward, so the agent\\n\",\n    \"will understand that moving down from state B is a good action.\\n\",\n    \"\\n\",\n    \"From state E, the agent moves right and reaches state F. Since F is an unshaded state,\\n\",\n    \"the agent receives a positive reward, and it will understand that moving right from\\n\",\n    \"state E is a good action. From state F, the agent moves down and reaches the goal\\n\",\n    \"state I and receives a positive reward, so the agent will understand that moving\\n\",\n    \"down from state F is a good action.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Iteration 2:\\n\",\n    \"\\n\",\n    \"In the second iteration, from state A, instead of moving right, the agent tries out a\\n\",\n    \"different action as the agent learned in the previous iteration that moving right is not\\n\",\n    \"a good action in state A.\\n\",\n    \"\\n\",\n    \"Thus, as the following figure shows, in this iteration the agent moves down from state A and\\n\",\n    \"reaches state D. Since D is an unshaded state, the agent receives a positive reward\\n\",\n    \"and now the agent will understand that moving down is a good action in state A:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/6.PNG)\\n\",\n    \"\\n\",\n    \"As shown in the preceding figure, from state D, the agent moves down and reaches\\n\",\n    \"state G. But since G is a shaded state, the agent will receive a negative reward and\\n\",\n    \"so the agent will understand that moving down is not a good action in state D, and\\n\",\n    \"when it visits state D next time, it will try out a different action instead of moving\\n\",\n    \"down.\\n\",\n    \"\\n\",\n    \"From G, the agent moves right and reaches state H. Since H is a shaded state, it will\\n\",\n    \"receive a negative reward and understand that moving right is not a good action in\\n\",\n    \"state G.\\n\",\n    \"\\n\",\n    \"From H it moves right and reaches the goal state I and receives a positive reward, so\\n\",\n    \"the agent will understand that moving right from state H is a good action.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Iteration 3:\\n\",\n    \"\\n\",\n    \"In the third iteration, the agent moves down from state A since, in the second\\n\",\n    \"iteration, our agent learned that moving down is a good action in state A. So, the\\n\",\n    \"agent moves down from state A and reaches the next state, D, as the following figure shows:\\n\",\n    \"\\n\",\n    \"![title](Images/7.PNG)\\n\",\n    \"\\n\",\n    \"Now, from state D, the agent tries a different action instead of moving down since in\\n\",\n    \"the second iteration our agent learned that moving down is not a good action in state\\n\",\n    \"D. So, in this iteration, the agent moves right from state D and reaches state E.\\n\",\n    \"\\n\",\n    \"From state E, the agent moves right as the agent already learned in the first iteration\\n\",\n    \"that moving right from state E is a good action and reaches state F.\\n\",\n    \"\\n\",\n    \"Now, from state F, the agent moves down since the agent learned in the first iteration\\n\",\n    \"that moving down is a good action in state F, and reaches the goal state I.\\n\",\n    \"\\n\",\n    \"The following figure shows the result of the third iteration:\\n\",\n    \"![title](Images/7.PNG)\\n\",\n    \"\\n\",\n    \"As we can see, our agent has successfully learned to reach the goal state I from state\\n\",\n    \"A without visiting the shaded states based on the rewards.\\n\",\n    \"\\n\",\n    \"In this way, the agent will try out different actions in each state and understand\\n\",\n    \"whether an action is good or bad based on the reward it obtains. The goal of the\\n\",\n    \"agent is to maximize rewards. So, the agent will always try to perform good actions\\n\",\n    \"that give a positive reward, and when the agent performs good actions in each state,\\n\",\n    \"then it ultimately leads the agent to achieve the goal.\\n\",\n    \"\\n\",\n    \"Note that these iterations are called episodes in RL terminology. We will learn more\\n\",\n    \"about episodes later in the chapter.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.05. How RL differs from other ML paradigms?-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# How RL differs from other ML paradigms?\\n\",\n    \"\\n\",\n    \"We can categorize ML into three types:\\n\",\n    \"* Supervised learning\\n\",\n    \"* Unsupervised learning\\n\",\n    \"* Reinforcement learning\\n\",\n    \"\\n\",\n    \"In supervised learning, the machine learns from training data. The training data\\n\",\n    \"consists of a labeled pair of inputs and outputs. So, we train the model (agent)\\n\",\n    \"using the training data in such a way that the model can generalize its learning to\\n\",\n    \"new unseen data. It is called supervised learning because the training data acts as a\\n\",\n    \"supervisor, since it has a labeled pair of inputs and outputs, and it guides the model\\n\",\n    \"in learning the given task.\\n\",\n    \"\\n\",\n    \"Now, let's understand the difference between supervised and reinforcement learning\\n\",\n    \"with an example. Consider the dog analogy we discussed earlier in the chapter. In\\n\",\n    \"supervised learning, to teach the dog to catch a ball, we will teach it explicitly by\\n\",\n    \"specifying turn left, go right, move forward seven steps, catch the ball, and so on\\n\",\n    \"in the form of training data. But in RL, we just throw a ball, and every time the dog\\n\",\n    \"catches the ball, we give it a cookie (reward). So, the dog will learn to catch the ball\\n\",\n    \"while trying to maximize the cookies (reward) it can get.\\n\",\n    \"\\n\",\n    \"Let's consider one more example. Say we want to train the model to play chess using\\n\",\n    \"supervised learning. In this case, we will have training data that includes all the\\n\",\n    \"moves a player can make in each state, along with labels indicating whether it is a\\n\",\n    \"good move or not. Then, we train the model to learn from this training data, whereas\\n\",\n    \"in the case of RL, our agent will not be given any sort of training data; instead, we\\n\",\n    \"just give a reward to the agent for each action it performs. Then, the agent will learn\\n\",\n    \"by interacting with the environment and, based on the reward it gets, it will choose\\n\",\n    \"its actions.\\n\",\n    \"\\n\",\n    \"Similar to supervised learning, in unsupervised learning, we train the model (agent)\\n\",\n    \"based on the training data. But in the case of unsupervised learning, the training data\\n\",\n    \"does not contain any labels; that is, it consists of only inputs and not outputs. The\\n\",\n    \"goal of unsupervised learning is to determine hidden patterns in the input. There is\\n\",\n    \"a common misconception that RL is a kind of unsupervised learning, but it is not. In\\n\",\n    \"unsupervised learning, the model learns the hidden structure, whereas, in RL, the\\n\",\n    \"model learns by maximizing the reward.\\n\",\n    \"\\n\",\n    \"For instance, consider a movie recommendation system. Say we want to recommend\\n\",\n    \"a new movie to the user. With unsupervised learning, the model (agent) will find\\n\",\n    \"movies similar to the movies the user (or users with a profile similar to the user) has\\n\",\n    \"viewed before and recommend new movies to the user.\\n\",\n    \"\\n\",\n    \"With RL, the agent constantly receives feedback from the user. This feedback\\n\",\n    \"represents rewards (a reward could be ratings the user has given for a movie they\\n\",\n    \"have watched, time spent watching a movie, time spent watching trailers, and so on).\\n\",\n    \"Based on the rewards, an RL agent will understand the movie preference of the user\\n\",\n    \"and then suggest new movies accordingly.\\n\",\n    \"\\n\",\n    \"Since the RL agent is learning with the aid of rewards, it can understand if the user's\\n\",\n    \"movie preference changes and suggest new movies according to the user's changed\\n\",\n    \"movie preference dynamically.\\n\",\n    \"\\n\",\n    \"Thus, we can say that in both supervised and unsupervised learning the model\\n\",\n    \"(agent) learns based on the given training dataset, whereas in RL the agent learns\\n\",\n    \"by directly interacting with the environment. Thus, RL is essentially an interaction\\n\",\n    \"between the agent and its environment.\\n\",\n    \"\\n\",\n    \"Before moving on to the fundamental concepts of RL, we will introduce a popular\\n\",\n    \"process to aid decision-making in an RL environment.\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.06. Markov Decision Processes-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Markov Decision Processes \\n\",\n    \"\\n\",\n    \"The Markov Decision Process (MDP) provides a mathematical framework for\\n\",\n    \"solving the RL problem. Almost all RL problems can be modeled as an MDP. MDPs\\n\",\n    \"are widely used for solving various optimization problems. In this section, we will\\n\",\n    \"understand what an MDP is and how it is used in RL.\\n\",\n    \"\\n\",\n    \"To understand an MDP, first, we need to learn about the Markov property and\\n\",\n    \"Markov chain.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Markov Property and Markov Chain \\n\",\n    \"\\n\",\n    \"The Markov property states that the future depends only on the present and not\\n\",\n    \"on the past. The Markov chain, also known as the Markov process, consists of a\\n\",\n    \"sequence of states that strictly obey the Markov property; that is, the Markov chain\\n\",\n    \"is the probabilistic model that solely depends on the current state to predict the next\\n\",\n    \"state and not the previous states, that is, the future is conditionally independent of\\n\",\n    \"the past.\\n\",\n    \"\\n\",\n    \"For example, if we want to predict the weather and we know that the current state is\\n\",\n    \"cloudy, we can predict that the next state could be rainy. We concluded that the next\\n\",\n    \"state is likely to be rainy only by considering the current state (cloudy) and not the\\n\",\n    \"previous states, which might have been sunny, windy, and so on.\\n\",\n    \"However, the Markov property does not hold for all processes. For instance,\\n\",\n    \"throwing a dice (the next state) has no dependency on the previous number that\\n\",\n    \"showed up on the dice (the current state).\\n\",\n    \"\\n\",\n    \"Moving from one state to another is called a transition, and its probability is called\\n\",\n    \"a transition probability. We denote the transition probability by $P(s'|s) $. It indicates\\n\",\n    \"the probability of moving from the state $s$ to the next state $s'$.\\n\",\n    \"\\n\",\n    \"Say we have three states (cloudy, rainy, and windy) in our Markov chain. Then we can represent the\\n\",\n    \"probability of transitioning from one state to another using a table called a Markov\\n\",\n    \"table, as shown in the following table:\\n\",\n    \"\\n\",\n    \"![title](Images/8.PNG)\\n\",\n    \"\\n\",\n    \"From the above table, we can observe that:\\n\",\n    \"\\n\",\n    \"* From the state cloudy, we transition to the state rainy with 70% probability and to the state windy with 30% probability.\\n\",\n    \"\\n\",\n    \"* From the state rainy, we transition to the same state rainy with 80% probability and to the state cloudy with 20% probability.\\n\",\n    \"\\n\",\n    \"* From the state windy, we transition to the state rainy with 100% probability.\\n\",\n    \"\\n\",\n    \"We can also represent this transition information of the Markov chain in the form of\\n\",\n    \"a state diagram, as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/9.png)\\n\",\n    \"We can also formulate the transition probabilities into a matrix called the transition\\n\",\n    \"matrix, as shown below:\\n\",\n    \"\\n\",\n    \"![title](Images/10.PNG)\\n\",\n    \"\\n\",\n    \"Thus, to conclude, we can say that the Markov chain or Markov process consists of a\\n\",\n    \"set of states along with their transition probabilities.\\n\",\n    \"\\n\",\n    \"## Markov Reward Process\\n\",\n    \"\\n\",\n    \"The Markov Reward Process (MRP) is an extension of the Markov chain with the\\n\",\n    \"reward function. That is, we learned that the Markov chain consists of states and a\\n\",\n    \"transition probability. The MRP consists of states, a transition probability, and also a\\n\",\n    \"reward function.\\n\",\n    \"\\n\",\n    \"A reward function tells us the reward we obtain in each state. For instance, based on\\n\",\n    \"our previous weather example, the reward function tells us the reward we obtain\\n\",\n    \"in the state cloudy, the reward we obtain in the state windy, and so on. The reward\\n\",\n    \"function is usually denoted by $R(s)$.\\n\",\n    \"\\n\",\n    \"Thus, the MRP consists of states $s$, a transition probability $P(s|s')$\\n\",\n    \"function $R(s)$. \\n\",\n    \"\\n\",\n    \"## Markov Decision Process\\n\",\n    \"\\n\",\n    \"The Markov Decision Process (MDP) is an extension of the MRP with actions. That\\n\",\n    \"is, we learned that the MRP consists of states, a transition probability, and a reward\\n\",\n    \"function. The MDP consists of states, a transition probability, a reward function,\\n\",\n    \"and also actions. We learned that the Markov property states that the next state is\\n\",\n    \"dependent only on the current state and is not based on the previous state. Is the\\n\",\n    \"Markov property applicable to the RL setting? Yes! In the RL environment, the agent\\n\",\n    \"makes decisions only based on the current state and not based on the past states. So,\\n\",\n    \"we can model an RL environment as an MDP.\\n\",\n    \"\\n\",\n    \"Let's understand this with an example. Given any environment, we can formulate\\n\",\n    \"the environment using an MDP. For instance, let's consider the same grid world\\n\",\n    \"environment we learned earlier. The following figure shows the grid world environment,\\n\",\n    \"and the goal of the agent is to reach state I from state A without visiting the shaded\\n\",\n    \"state\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/11.png)\\n\",\n    \"\\n\",\n    \"An agent makes a decision (action) in the environment only based on the current\\n\",\n    \"state the agent is in and not based on the past state. So, we can formulate our\\n\",\n    \"environment as an MDP. We learned that the MDP consists of states, actions,\\n\",\n    \"transition probabilities, and a reward function. Now, let's learn how this relates to\\n\",\n    \"our RL environment:\\n\",\n    \"\\n\",\n    \"__States__ – A set of states present in the environment. Thus, in the grid world\\n\",\n    \"environment, we have states A to I.\\n\",\n    \"\\n\",\n    \"__Actions__ – A set of actions that our agent can perform in each state. An agent\\n\",\n    \"performs an action and moves from one state to another. Thus, in the grid world\\n\",\n    \"environment, the set of actions is up, down, left, and right.\\n\",\n    \"\\n\",\n    \"__Transition probability__ – The transition probability is denoted by $ P(s'|s,a) $. It\\n\",\n    \"implies the probability of moving from a state $s$ to the next state $s'$ while performing\\n\",\n    \"an action $a$. If you observe, in the MRP, the transition probability is just $ P(s'|s,a) $ that\\n\",\n    \"is, the probability of going from state $s$ to state $s'$ and it doesn't include actions. But in MDP we include the actions, thus the transition probability is denoted by $ P(s'|s,a) $. \\n\",\n    \"\\n\",\n    \"For example, in our grid world environment, say, the transition probability of moving from state A to state B while performing an action right is 100% then it can be expressed as: $P( B |A , \\\\text{right}) = 1.0 $. We can also view this in the state diagram as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/12.png)\\n\",\n    \"\\n\",\n    \"Suppose, our agent is in state C and the transition probability of moving from state C to the state F while performing an action down is 90% then it can be expressed as: $P( F |C , \\\\text{down}) = 0.9 $. We can also view this in the state diagram as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/13.png)\\n\",\n    \"\\n\",\n    \"__Reward function__ -  The reward function is denoted by $R(s,a,s') $. It implies the reward our agent obtains while transitioning from a state $s$ to the state $s'$ while performing an action $a$. \\n\",\n    \"\\n\",\n    \"Say, the reward we obtain while transitioning from the state A to the state B while performing an action right is -1, then it can be expressed as $R(A, \\\\text{right}, B) = -1 $. We can also view this in the state diagram as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/14.png)\\n\",\n    \"\\n\",\n    \"Suppose, our agent is in state C and say, the reward we obtain while transitioning from the state C to the state F while performing an action down is  +1, then it can be expressed as $R(C, \\\\text{down}, F) = +1 $. We can also view this in the state diagram as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/15.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, an RL environment can be represented as an MDP with states, actions,\\n\",\n    \"transition probability, and the reward function. But wait! What is the use of\\n\",\n    \"representing the RL environment using the MDP? We can solve the RL problem easily\\n\",\n    \"once we model our environment as the MDP. For instance, once we model our grid\\n\",\n    \"world environment using the MDP, then we can easily find how to reach the goal\\n\",\n    \"state I from state A without visiting the shaded states. We will learn more about this\\n\",\n    \"in the upcoming chapters. Next, we will go through more essential concepts of RL.\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/.ipynb_checkpoints/1.07. Action space, Policy, Episode and Horizon-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Action space, Policy, Episode, Horizon\\n\",\n    \"\\n\",\n    \"In this section, we will learn about the several important fundamental concepts that are involved in reinforcement learning. \\n\",\n    \"\\n\",\n    \"## Action space\\n\",\n    \"Consider the grid world environment shown below:\\n\",\n    \"\\n\",\n    \"![title](Images/16.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"In the preceding grid world environment, the goal of the agent is to reach state I\\n\",\n    \"starting from state A without visiting the shaded states. In each of the states, the\\n\",\n    \"agent can perform any of the four actions—up, down, left, and right—to achieve the\\n\",\n    \"goal. The set of all possible actions in the environment is called the action space.\\n\",\n    \"Thus, for this grid world environment, the action space will be [up, down, left, right].\\n\",\n    \"We can categorize action spaces into two types:\\n\",\n    \"\\n\",\n    \"* Discrete action space \\n\",\n    \"* Continuous action space\\n\",\n    \"\\n\",\n    \"__Discrete action space__ -When our action space consists of actions that are discrete,\\n\",\n    \"then it is called a discrete action space. For instance, in the grid world environment,\\n\",\n    \"our action space consists of four discrete actions, which are up, down, left, right, and\\n\",\n    \"so it is called a discrete action space.\\n\",\n    \"\\n\",\n    \"__Continuous action space__ - When our action space consists of actions that are\\n\",\n    \"continuous, then it is called a continuous action space. For instance, let's suppose\\n\",\n    \"we are training an agent to drive a car, then our action space will consist of several\\n\",\n    \"actions that have continuous values, such as the speed at which we need to drive the\\n\",\n    \"car, the number of degrees we need to rotate the wheel, and so on. In cases where\\n\",\n    \"our action space consists of actions that are continuous, it is called a continuous\\n\",\n    \"action space.\\n\",\n    \"\\n\",\n    \"## Policy\\n\",\n    \"\\n\",\n    \"A policy defines the agent's behavior in an environment. The policy tells the agent\\n\",\n    \"what action to perform in each state. For instance, in the grid world environment, we\\n\",\n    \"have states A to I and four possible actions. The policy may tell the agent to move\\n\",\n    \"down in state A, move right in state D, and so on.\\n\",\n    \"\\n\",\n    \"To interact with the environment for the first time, we initialize a random policy, that\\n\",\n    \"is, the random policy tells the agent to perform a random action in each state. Thus,\\n\",\n    \"in an initial iteration, the agent performs a random action in each state and tries to\\n\",\n    \"learn whether the action is good or bad based on the reward it obtains. Over a series\\n\",\n    \"of iterations, an agent will learn to perform good actions in each state, which gives a\\n\",\n    \"positive reward. Thus, we can say that over a series of iterations, the agent will learn\\n\",\n    \"a good policy that gives a positive reward.\\n\",\n    \"\\n\",\n    \"The optimal policy is shown in the following figure. As we can observe, the agent selects the\\n\",\n    \"action in each state based on the optimal policy and reaches the terminal state I from\\n\",\n    \"the starting state A without visiting the shaded states:\\n\",\n    \"\\n\",\n    \"![title](Images/17.png)\\n\",\n    \"\\n\",\n    \"Thus, the optimal policy tells the agent to perform the correct action in each state so\\n\",\n    \"that the agent can receive a good reward.\\n\",\n    \"\\n\",\n    \"A policy can be classified into two:\\n\",\n    \"\\n\",\n    \"* Deterministic Policy\\n\",\n    \"* Stochastic Policy\\n\",\n    \"\\n\",\n    \"### Deterministic Policy\\n\",\n    \"The policy which we just learned above is called deterministic policy. That is, deterministic policy tells the agent to perform a one particular action in a state. Thus, the deterministic policy maps the state to one particular action and is often denoted by $\\\\mu$. Given a state $s$ at a time $t$, a deterministic policy tells the agent to perform a one particular action $a$. It can be expressed as:\\n\",\n    \"\\n\",\n    \"$$a_t = \\\\mu(s_t) $$\\n\",\n    \"\\n\",\n    \"For instance, consider our grid world example, given a state A, the deterministic policy $\\\\mu$ tells the agent to perform an action down and it can be expressed as:\\n\",\n    \"\\n\",\n    \"$$\\\\mu (A) = \\\\text{Down} $$\\n\",\n    \"\\n\",\n    \"Thus, according to the deterministic policy, whenever the agent visits state A, it performs the action down. \\n\",\n    \"\\n\",\n    \"### Stochastic Policy\\n\",\n    \"\\n\",\n    \"Unlike deterministic policy, the stochastic policy does not map the state directly to one particular action, instead, it maps the state to a probability distribution over an action space. \\n\",\n    \"\\n\",\n    \"That is, we learned that given a state, the deterministic policy will tell the agent to perform one particular action in the given state, so, whenever the agent visits the state it always performs the same particular action. But with stochastic policy, given a state, the stochastic policy will return a probability distribution over an action space so instead of performing the same action every time the agent visits the state, the agent performs different actions each time based on a probability distribution returned by the stochastic policy. \\n\",\n    \"\\n\",\n    \"Let's understand this with an example, we know that our grid world environment's action space consists of 4 actions which are [up, down, left, right]. Given a state A, the stochastic policy returns the probability distribution over the action space as [0.10,0.70,0.10,0.10]. Now, whenever the agent visits the state A, instead of selecting the same particular action every time, the agent selects the action up 10% of the time, action down 70% of the time, action left 10% of time and action right 10% of the time. \\n\",\n    \"\\n\",\n    \"The difference between the deterministic policy and stochastic policy is shown below, as we can observe the deterministic policy maps the state to one particular action whereas the stochastic policy maps the state to the probability distribution over an action space:\\n\",\n    \"\\n\",\n    \"![title](Images/18.png)\\n\",\n    \"\\n\",\n    \"Thus, stochastic policy maps the state to a probability distribution over action space and it is often denoted by $\\\\pi$.  Say, we have a state $s$ and action $a$ at a time $t$, then we can express the stochastic policy as:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"$$a_t \\\\sim \\\\pi(s_t) $$\\n\",\n    \"\\n\",\n    \"Or it can also be expressed as $\\\\pi(a_t |s_t) $. \\n\",\n    \"\\n\",\n    \"We can categorize the stochastic policy into two:\\n\",\n    \"\\n\",\n    \"* Categorical policy\\n\",\n    \"* Gaussian policy\\n\",\n    \"\\n\",\n    \"### Categorical policy \\n\",\n    \"A stochastic policy is called a categorical policy when the action space is discrete. That is, the stochastic policy uses categorical probability distribution over action space to select actions when the action space is discrete. For instance, in the grid world environment, we have just seen above, we select actions based on categorical probability distribution (discrete distribution) as the action space of the environment is discrete. As shown below, given a state A, we select an action based on the categorical probability distribution over the action space:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/19.png)\\n\",\n    \"### Gaussian policy \\n\",\n    \"A stochastic policy is called a gaussian policy when our action space is continuous. That is, the stochastic policy uses Gaussian probability distribution over action space to select actions when the action space is continuous. Let's understand this with a small example. Suppose we training an agent to drive a car and say we have one continuous action in our action space. Let the action be the speed of the car and the value of the speed of the car ranges from 0 to 150 kmph. Then, the stochastic policy uses the Gaussian distribution over the action space to select action as shown below:\\n\",\n    \"\\n\",\n    \"![title](Images/20.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"We will learn more about the gaussian policy in the upcoming chapters.\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Episode \\n\",\n    \"\\n\",\n    \"The agent interacts with the environment by performing some action starting from the initial state and reach the final state. This agent-environment interaction starting from the initial state until the final state is called an episode. For instance, in the car racing video game, the agent plays the game by starting from the initial state (starting point of the race) and reach the final state (endpoint of the race). This is considered an episode. An episode is also often called trajectory (path taken by the agent) and it is denoted by $\\\\tau$. \\n\",\n    \"\\n\",\n    \"An agent can play the game for any number of episodes and each episode is independent of each other. What is the use of playing the game for multiple numbers of episodes? In order to learn the optimal policy, that is, the policy which tells the agent to perform correct action in each state, the agent plays the game for many episodes. \\n\",\n    \"\\n\",\n    \"For example, let's say we are playing a car racing game, for the first time, we may not win the game and we play the game several times to understand more about the game and discover some good strategies for winning the game. Similarly, in the first episode, the agent may not win the game and it plays the game for several episodes to understand more about the game environment and good strategies to win the game. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Say, we begin the game from an initial state at a time step t=0 and reach the final state at a time step T then the episode information consists of the agent environment interaction such as state, action, and reward starting from the initial state till the final state, that is, $(s_0, a_0,r_0,s_1,a_1,r_1,\\\\dots,s_T) $\\n\",\n    \"\\n\",\n    \"An episode (or) trajectory is shown below:\\n\",\n    \"\\n\",\n    \"![title](Images/21.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Let's strengthen our understanding of the episode and optimal policy with the grid world environment. We learned that in the grid world environment, the goal of our agent is to reach the final state I starting from the initial state A without visiting the shaded states. An agent receives +1 reward when it visits the unshaded states and -1 reward when it visits the shaded states.\\n\",\n    \"\\n\",\n    \"When we say, generate an episode it means going from initial state to the final state. The agent generates the first episode using a random policy and explores the environment and over several episodes, it will learn the optimal policy. \\n\",\n    \"\\n\",\n    \"### Episode 1:\\n\",\n    \"\\n\",\n    \"As shown below, in the first episode, the agent uses random policy and selects random action in each state starting from the initial state until the final state and observe the reward:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/22.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"### Episode 2:\\n\",\n    \"\\n\",\n    \"In the second episode, the agent tries a different policy to avoid negative rewards which it had received in the previous episode. For instance, as we can observe in the previous episode, the agent selected an action right in the state A and received a negative reward, so in this episode, instead of selecting action right in the state A, it tries a different action say, down as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/23.png)\\n\",\n    \"\\n\",\n    \"### Episode n:\\n\",\n    \"\\n\",\n    \"Thus, over a series of the episodes, the agent learns the optimal policy, that is, the policy which takes the agent to the final state I from the state A without visiting the shaded states as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/24.png)\\n\",\n    \"\\n\",\n    \"# Episodic and Continuous tasks \\n\",\n    \"A reinforcement learning task can be categorized into two:\\n\",\n    \"* Episodic task\\n\",\n    \"* Continuous task\\n\",\n    \"\\n\",\n    \"__Episodic task__ - As the name suggests episodic task is the one that has the terminal state. That is, episodic tasks are basically tasks made up of episodes and thus they have a terminal state. Example: Car racing game. \\n\",\n    \"\\n\",\n    \"__Continuous task__ - Unlike episodic tasks, continuous tasks do not contain any episodes and so they don't have any terminal state. For example, a personal assistance robot does not have a terminal state. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"# Horizon\\n\",\n    \"Horizon is the time step until which the agent interacts with the environment. We can classify the horizon into two:\\n\",\n    \"\\n\",\n    \"* Finite horizon\\n\",\n    \"* Infinite horizon\\n\",\n    \"\\n\",\n    \"__Finite horizon__ - If the agent environment interaction stops at a particular time step then it is called finite Horizon. For instance, in the episodic tasks agent interacts with the environment starting from the initial state at time step  t =0 and reach the final state at a time step T.  Since the agent environment interaction stops at the time step T, it is considered a finite horizon. \\n\",\n    \"\\n\",\n    \"__Infinite horizon__ - If the agent environment interaction never stops then it is called an infinite horizon. For instance, we learned that the continuous task does not have any terminal states, so the agent environment interaction will never stop in the continuous task and so it is considered an infinite horizon. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.01. Key Elements of Reinforcement Learning .ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Reinforcement Learning (RL) is one of the areas of Machine Learning (ML). Unlike\\n\",\n    \"other ML paradigms, such as supervised and unsupervised learning, RL works in a\\n\",\n    \"trial and error fashion by interacting with its environment.\\n\",\n    \"\\n\",\n    \"RL is one of the most active areas of research in artificial intelligence, and it is\\n\",\n    \"believed that RL will take us a step closer towards achieving artificial general\\n\",\n    \"intelligence. RL has evolved rapidly in the past few years with a wide variety of\\n\",\n    \"applications ranging from building a recommendation system to self-driving cars.\\n\",\n    \"The major reason for this evolution is the advent of deep reinforcement learning,\\n\",\n    \"which is a combination of deep learning and RL. With the emergence of new RL\\n\",\n    \"algorithms and libraries, RL is clearly one of the most promising areas of ML.\\n\",\n    \"\\n\",\n    \"In this chapter, we will build a strong foundation in RL by exploring several\\n\",\n    \"important and fundamental concepts involved in RL. In this chapter, we will learn about the following topics:\\n\",\n    \"\\n\",\n    \"* Key elements of RL\\n\",\n    \"* The basic idea of RL\\n\",\n    \"* The RL algorithm\\n\",\n    \"* How RL differs from other ML paradigms\\n\",\n    \"* The Markov Decision Processes\\n\",\n    \"* Fundamental concepts of RL\\n\",\n    \"* Applications of RL\\n\",\n    \"* RL glossary\\n\",\n    \"\\n\",\n    \"We will begin the chapter by understanding Key elements of RL. This will help us understand the\\n\",\n    \"basic idea of RL.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Key Elements of Reinforcement Learning \\n\",\n    \"\\n\",\n    \"Let's begin by understanding some key elements of RL.\\n\",\n    \"\\n\",\n    \"## Agent \\n\",\n    \"\\n\",\n    \"An agent is a software program that learns to make intelligent decisions. We can\\n\",\n    \"say that an agent is a learner in the RL setting. For instance, a chess player can be\\n\",\n    \"considered an agent since the player learns to make the best moves (decisions) to win\\n\",\n    \"the game. Similarly, Mario in a Super Mario Bros video game can be considered an\\n\",\n    \"agent since Mario explores the game and learns to make the best moves in the game.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Environment \\n\",\n    \"The environment is the world of the agent. The agent stays within the environment.\\n\",\n    \"For instance, coming back to our chess game, a chessboard is called the environment\\n\",\n    \"since the chess player (agent) learns to play the game of chess within the chessboard\\n\",\n    \"(environment). Similarly, in Super Mario Bros, the world of Mario is called the\\n\",\n    \"environment.\\n\",\n    \"\\n\",\n    \"## State and action\\n\",\n    \"A state is a position or a moment in the environment that the agent can be in. We\\n\",\n    \"learned that the agent stays within the environment, and there can be many positions\\n\",\n    \"in the environment that the agent can stay in, and those positions are called states.\\n\",\n    \"For instance, in our chess game example, each position on the chessboard is called\\n\",\n    \"the state. The state is usually denoted by s.\\n\",\n    \"\\n\",\n    \"The agent interacts with the environment and moves from one state to another\\n\",\n    \"by performing an action. In the chess game environment, the action is the move\\n\",\n    \"performed by the player (agent). The action is usually denoted by a.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Reward\\n\",\n    \"\\n\",\n    \"We learned that the agent interacts with an environment by performing an action\\n\",\n    \"and moves from one state to another. Based on the action, the agent receives a\\n\",\n    \"reward. A reward is nothing but a numerical value, say, +1 for a good action and -1\\n\",\n    \"for a bad action. How do we decide if an action is good or bad?\\n\",\n    \"In our chess game example, if the agent makes a move in which it takes one of the\\n\",\n    \"opponent's chess pieces, then it is considered a good action and the agent receives\\n\",\n    \"a positive reward. Similarly, if the agent makes a move that leads to the opponent\\n\",\n    \"taking the agent's chess piece, then it is considered a bad action and the agent\\n\",\n    \"receives a negative reward. The reward is denoted by r.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"In the next section, let us explore basic idea of reinforcement learning. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.02. Basic Idea of Reinforcement Learning.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Basic Idea of Reinforcement Learning \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Let's begin with an analogy. Let's suppose we are teaching a dog (agent) to catch a\\n\",\n    \"ball. Instead of teaching the dog explicitly to catch a ball, we just throw a ball and\\n\",\n    \"every time the dog catches the ball, we give the dog a cookie (reward). If the dog\\n\",\n    \"fails to catch the ball, then we do not give it a cookie. So, the dog will figure out\\n\",\n    \"what action caused it to receive a cookie and repeat that action. Thus, the dog will\\n\",\n    \"understand that catching the ball caused it to receive a cookie and will attempt to\\n\",\n    \"repeat catching the ball. Thus, in this way, the dog will learn to catch a ball while\\n\",\n    \"aiming to maximize the cookies it can receive.\\n\",\n    \"\\n\",\n    \"Similarly, in an RL setting, we will not teach the agent what to do or how to do it;\\n\",\n    \"instead, we will give a reward to the agent for every action it does. We will give\\n\",\n    \"a positive reward to the agent when it performs a good action and we will give a\\n\",\n    \"negative reward to the agent when it performs a bad action. The agent begins by\\n\",\n    \"performing a random action and if the action is good, we then give the agent a\\n\",\n    \"positive reward so that the agent understands it has performed a good action and it\\n\",\n    \"will repeat that action. If the action performed by the agent is bad, then we will give\\n\",\n    \"the agent a negative reward so that the agent will understand it has performed a bad\\n\",\n    \"action and it will not repeat that action.\\n\",\n    \"\\n\",\n    \"Thus, RL can be viewed as a trial and error learning process where the agent tries out\\n\",\n    \"different actions and learns the good action, which gives a positive reward.\\n\",\n    \"\\n\",\n    \"In the dog analogy, the dog represents the agent, and giving a cookie to the dog\\n\",\n    \"upon it catching the ball is a positive reward and not giving a cookie is a negative\\n\",\n    \"reward. So, the dog (agent) explores different actions, which are catching the ball\\n\",\n    \"and not catching the ball, and understands that catching the ball is a good action as it\\n\",\n    \"brings the dog a positive reward (getting a cookie).\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Let's further explore the idea of RL with one more simple example. Let's suppose we\\n\",\n    \"want to teach a robot (agent) to walk without hitting a mountain, as the following figure shows: \\n\",\n    \"\\n\",\n    \"![title](Images/1.png)\\n\",\n    \"\\n\",\n    \"We will not teach the robot explicitly to not go in the direction of the mountain.\\n\",\n    \"Instead, if the robot hits the mountain and gets stuck, we give the robot a negative\\n\",\n    \"reward, say -1. So, the robot will understand that hitting the mountain is the wrong\\n\",\n    \"action, and it will not repeat that action:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/2.png)\\n\",\n    \"\\n\",\n    \"Similarly, when the robot walks in the right direction without hitting the mountain,\\n\",\n    \"we give the robot a positive reward, say +1. So, the robot will understand that not\\n\",\n    \"hitting the mountain is a good action, and it will repeat that action:\\n\",\n    \"\\n\",\n    \"![title](Images/3.png)\\n\",\n    \"\\n\",\n    \"Thus, in the RL setting, the agent explores different actions and learns the best action\\n\",\n    \"based on the reward it gets.\\n\",\n    \"Now that we have a basic idea of how RL works, in the upcoming sections, we will\\n\",\n    \"go into more detail and also learn the important concepts involved in RL.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.03. Reinforcement Learning Algorithm.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Reinforcement Learning algorithm\\n\",\n    \"\\n\",\n    \"The steps involved in a typical RL algorithm are as follows:\\n\",\n    \"\\n\",\n    \"1. First, the agent interacts with the environment by performing an action.\\n\",\n    \"2. By performing an action, the agent moves from one state to another.\\n\",\n    \"3. Then the agent will receive a reward based on the action it performed.\\n\",\n    \"4. Based on the reward, the agent will understand whether the action is good or bad.\\n\",\n    \"5. If the action was good, that is, if the agent received a positive reward, then the agent will prefer performing that action, else the agent will try performing other actions in search of a positive reward.\\n\",\n    \"\\n\",\n    \"RL is basically a trial and error learning process. Now, let's revisit our chess game\\n\",\n    \"example. The agent (software program) is the chess player. So, the agent interacts\\n\",\n    \"with the environment (chessboard) by performing an action (moves). If the agent\\n\",\n    \"gets a positive reward for an action, then it will prefer performing that action; else it\\n\",\n    \"will find a different action that gives a positive reward.\\n\",\n    \"\\n\",\n    \"Ultimately, the goal of the agent is to maximize the reward it gets. If the agent\\n\",\n    \"receives a good reward, then it means it has performed a good action. If the agent\\n\",\n    \"performs a good action, then it implies that it can win the game. Thus, the agent\\n\",\n    \"learns to win the game by maximizing the reward.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.04. RL agent in the Grid World .ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# RL agent in the Grid World \\n\",\n    \"\\n\",\n    \"Let's strengthen our understanding of RL by looking at another simple example.\\n\",\n    \"Consider the following grid world environment:\\n\",\n    \"\\n\",\n    \"![title](Images/4.png)\\n\",\n    \"\\n\",\n    \"The positions A to I in the environment are called the states of the environment.\\n\",\n    \"The goal of the agent is to reach state I by starting from state A without visiting\\n\",\n    \"the shaded states (B, C, G, and H). Thus, in order to achieve the goal, whenever\\n\",\n    \"our agent visits a shaded state, we will give a negative reward (say -1) and when it\\n\",\n    \"visits an unshaded state, we will give a positive reward (say +1). The actions in the\\n\",\n    \"environment are moving up, down, right and left. The agent can perform any of these\\n\",\n    \"four actions to reach state I from state A.\\n\",\n    \"\\n\",\n    \"The first time the agent interacts with the environment (the first iteration), the agent\\n\",\n    \"is unlikely to perform the correct action in each state, and thus it receives a negative\\n\",\n    \"reward. That is, in the first iteration, the agent performs a random action in each\\n\",\n    \"state, and this may lead the agent to receive a negative reward. But over a series of\\n\",\n    \"iterations, the agent learns to perform the correct action in each state through the\\n\",\n    \"reward it obtains, helping it achieve the goal. Let us explore this in detail.\\n\",\n    \"\\n\",\n    \"## Iteration 1:\\n\",\n    \"\\n\",\n    \"As we learned, in the first iteration, the agent performs a random action in each state.\\n\",\n    \"For instance, look at the following figure. In the first iteration, the agent moves right\\n\",\n    \"from state A and reaches the new state B. But since B is the shaded state, the agent\\n\",\n    \"will receive a negative reward and so the agent will understand that moving right is\\n\",\n    \"not a good action in state A. When it visits state A next time, it will try out a different\\n\",\n    \"action instead of moving right:\\n\",\n    \"\\n\",\n    \"![title](Images/5.PNG)\\n\",\n    \"\\n\",\n    \"As the avove figure shows, from state B, the agent moves down and reaches the new state\\n\",\n    \"E. Since E is an unshaded state, the agent will receive a positive reward, so the agent\\n\",\n    \"will understand that moving down from state B is a good action.\\n\",\n    \"\\n\",\n    \"From state E, the agent moves right and reaches state F. Since F is an unshaded state,\\n\",\n    \"the agent receives a positive reward, and it will understand that moving right from\\n\",\n    \"state E is a good action. From state F, the agent moves down and reaches the goal\\n\",\n    \"state I and receives a positive reward, so the agent will understand that moving\\n\",\n    \"down from state F is a good action.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Iteration 2:\\n\",\n    \"\\n\",\n    \"In the second iteration, from state A, instead of moving right, the agent tries out a\\n\",\n    \"different action as the agent learned in the previous iteration that moving right is not\\n\",\n    \"a good action in state A.\\n\",\n    \"\\n\",\n    \"Thus, as the following figure shows, in this iteration the agent moves down from state A and\\n\",\n    \"reaches state D. Since D is an unshaded state, the agent receives a positive reward\\n\",\n    \"and now the agent will understand that moving down is a good action in state A:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/6.PNG)\\n\",\n    \"\\n\",\n    \"As shown in the preceding figure, from state D, the agent moves down and reaches\\n\",\n    \"state G. But since G is a shaded state, the agent will receive a negative reward and\\n\",\n    \"so the agent will understand that moving down is not a good action in state D, and\\n\",\n    \"when it visits state D next time, it will try out a different action instead of moving\\n\",\n    \"down.\\n\",\n    \"\\n\",\n    \"From G, the agent moves right and reaches state H. Since H is a shaded state, it will\\n\",\n    \"receive a negative reward and understand that moving right is not a good action in\\n\",\n    \"state G.\\n\",\n    \"\\n\",\n    \"From H it moves right and reaches the goal state I and receives a positive reward, so\\n\",\n    \"the agent will understand that moving right from state H is a good action.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Iteration 3:\\n\",\n    \"\\n\",\n    \"In the third iteration, the agent moves down from state A since, in the second\\n\",\n    \"iteration, our agent learned that moving down is a good action in state A. So, the\\n\",\n    \"agent moves down from state A and reaches the next state, D, as the following figure shows:\\n\",\n    \"\\n\",\n    \"![title](Images/7.PNG)\\n\",\n    \"\\n\",\n    \"Now, from state D, the agent tries a different action instead of moving down since in\\n\",\n    \"the second iteration our agent learned that moving down is not a good action in state\\n\",\n    \"D. So, in this iteration, the agent moves right from state D and reaches state E.\\n\",\n    \"\\n\",\n    \"From state E, the agent moves right as the agent already learned in the first iteration\\n\",\n    \"that moving right from state E is a good action and reaches state F.\\n\",\n    \"\\n\",\n    \"Now, from state F, the agent moves down since the agent learned in the first iteration\\n\",\n    \"that moving down is a good action in state F, and reaches the goal state I.\\n\",\n    \"\\n\",\n    \"The following figure shows the result of the third iteration:\\n\",\n    \"![title](Images/7.PNG)\\n\",\n    \"\\n\",\n    \"As we can see, our agent has successfully learned to reach the goal state I from state\\n\",\n    \"A without visiting the shaded states based on the rewards.\\n\",\n    \"\\n\",\n    \"In this way, the agent will try out different actions in each state and understand\\n\",\n    \"whether an action is good or bad based on the reward it obtains. The goal of the\\n\",\n    \"agent is to maximize rewards. So, the agent will always try to perform good actions\\n\",\n    \"that give a positive reward, and when the agent performs good actions in each state,\\n\",\n    \"then it ultimately leads the agent to achieve the goal.\\n\",\n    \"\\n\",\n    \"Note that these iterations are called episodes in RL terminology. We will learn more\\n\",\n    \"about episodes later in the chapter.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.05. How RL differs from other ML paradigms?.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# How RL differs from other ML paradigms?\\n\",\n    \"\\n\",\n    \"We can categorize ML into three types:\\n\",\n    \"* Supervised learning\\n\",\n    \"* Unsupervised learning\\n\",\n    \"* Reinforcement learning\\n\",\n    \"\\n\",\n    \"In supervised learning, the machine learns from training data. The training data\\n\",\n    \"consists of a labeled pair of inputs and outputs. So, we train the model (agent)\\n\",\n    \"using the training data in such a way that the model can generalize its learning to\\n\",\n    \"new unseen data. It is called supervised learning because the training data acts as a\\n\",\n    \"supervisor, since it has a labeled pair of inputs and outputs, and it guides the model\\n\",\n    \"in learning the given task.\\n\",\n    \"\\n\",\n    \"Now, let's understand the difference between supervised and reinforcement learning\\n\",\n    \"with an example. Consider the dog analogy we discussed earlier in the chapter. In\\n\",\n    \"supervised learning, to teach the dog to catch a ball, we will teach it explicitly by\\n\",\n    \"specifying turn left, go right, move forward seven steps, catch the ball, and so on\\n\",\n    \"in the form of training data. But in RL, we just throw a ball, and every time the dog\\n\",\n    \"catches the ball, we give it a cookie (reward). So, the dog will learn to catch the ball\\n\",\n    \"while trying to maximize the cookies (reward) it can get.\\n\",\n    \"\\n\",\n    \"Let's consider one more example. Say we want to train the model to play chess using\\n\",\n    \"supervised learning. In this case, we will have training data that includes all the\\n\",\n    \"moves a player can make in each state, along with labels indicating whether it is a\\n\",\n    \"good move or not. Then, we train the model to learn from this training data, whereas\\n\",\n    \"in the case of RL, our agent will not be given any sort of training data; instead, we\\n\",\n    \"just give a reward to the agent for each action it performs. Then, the agent will learn\\n\",\n    \"by interacting with the environment and, based on the reward it gets, it will choose\\n\",\n    \"its actions.\\n\",\n    \"\\n\",\n    \"Similar to supervised learning, in unsupervised learning, we train the model (agent)\\n\",\n    \"based on the training data. But in the case of unsupervised learning, the training data\\n\",\n    \"does not contain any labels; that is, it consists of only inputs and not outputs. The\\n\",\n    \"goal of unsupervised learning is to determine hidden patterns in the input. There is\\n\",\n    \"a common misconception that RL is a kind of unsupervised learning, but it is not. In\\n\",\n    \"unsupervised learning, the model learns the hidden structure, whereas, in RL, the\\n\",\n    \"model learns by maximizing the reward.\\n\",\n    \"\\n\",\n    \"For instance, consider a movie recommendation system. Say we want to recommend\\n\",\n    \"a new movie to the user. With unsupervised learning, the model (agent) will find\\n\",\n    \"movies similar to the movies the user (or users with a profile similar to the user) has\\n\",\n    \"viewed before and recommend new movies to the user.\\n\",\n    \"\\n\",\n    \"With RL, the agent constantly receives feedback from the user. This feedback\\n\",\n    \"represents rewards (a reward could be ratings the user has given for a movie they\\n\",\n    \"have watched, time spent watching a movie, time spent watching trailers, and so on).\\n\",\n    \"Based on the rewards, an RL agent will understand the movie preference of the user\\n\",\n    \"and then suggest new movies accordingly.\\n\",\n    \"\\n\",\n    \"Since the RL agent is learning with the aid of rewards, it can understand if the user's\\n\",\n    \"movie preference changes and suggest new movies according to the user's changed\\n\",\n    \"movie preference dynamically.\\n\",\n    \"\\n\",\n    \"Thus, we can say that in both supervised and unsupervised learning the model\\n\",\n    \"(agent) learns based on the given training dataset, whereas in RL the agent learns\\n\",\n    \"by directly interacting with the environment. Thus, RL is essentially an interaction\\n\",\n    \"between the agent and its environment.\\n\",\n    \"\\n\",\n    \"Before moving on to the fundamental concepts of RL, we will introduce a popular\\n\",\n    \"process to aid decision-making in an RL environment.\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.06. Markov Decision Processes.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Markov Decision Processes \\n\",\n    \"\\n\",\n    \"The Markov Decision Process (MDP) provides a mathematical framework for\\n\",\n    \"solving the RL problem. Almost all RL problems can be modeled as an MDP. MDPs\\n\",\n    \"are widely used for solving various optimization problems. In this section, we will\\n\",\n    \"understand what an MDP is and how it is used in RL.\\n\",\n    \"\\n\",\n    \"To understand an MDP, first, we need to learn about the Markov property and\\n\",\n    \"Markov chain.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Markov Property and Markov Chain \\n\",\n    \"\\n\",\n    \"The Markov property states that the future depends only on the present and not\\n\",\n    \"on the past. The Markov chain, also known as the Markov process, consists of a\\n\",\n    \"sequence of states that strictly obey the Markov property; that is, the Markov chain\\n\",\n    \"is the probabilistic model that solely depends on the current state to predict the next\\n\",\n    \"state and not the previous states, that is, the future is conditionally independent of\\n\",\n    \"the past.\\n\",\n    \"\\n\",\n    \"For example, if we want to predict the weather and we know that the current state is\\n\",\n    \"cloudy, we can predict that the next state could be rainy. We concluded that the next\\n\",\n    \"state is likely to be rainy only by considering the current state (cloudy) and not the\\n\",\n    \"previous states, which might have been sunny, windy, and so on.\\n\",\n    \"However, the Markov property does not hold for all processes. For instance,\\n\",\n    \"throwing a dice (the next state) has no dependency on the previous number that\\n\",\n    \"showed up on the dice (the current state).\\n\",\n    \"\\n\",\n    \"Moving from one state to another is called a transition, and its probability is called\\n\",\n    \"a transition probability. We denote the transition probability by $P(s'|s) $. It indicates\\n\",\n    \"the probability of moving from the state $s$ to the next state $s'$.\\n\",\n    \"\\n\",\n    \"Say we have three states (cloudy, rainy, and windy) in our Markov chain. Then we can represent the\\n\",\n    \"probability of transitioning from one state to another using a table called a Markov\\n\",\n    \"table, as shown in the following table:\\n\",\n    \"\\n\",\n    \"![title](Images/8.PNG)\\n\",\n    \"\\n\",\n    \"From the above table, we can observe that:\\n\",\n    \"\\n\",\n    \"* From the state cloudy, we transition to the state rainy with 70% probability and to the state windy with 30% probability.\\n\",\n    \"\\n\",\n    \"* From the state rainy, we transition to the same state rainy with 80% probability and to the state cloudy with 20% probability.\\n\",\n    \"\\n\",\n    \"* From the state windy, we transition to the state rainy with 100% probability.\\n\",\n    \"\\n\",\n    \"We can also represent this transition information of the Markov chain in the form of\\n\",\n    \"a state diagram, as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/9.png)\\n\",\n    \"We can also formulate the transition probabilities into a matrix called the transition\\n\",\n    \"matrix, as shown below:\\n\",\n    \"\\n\",\n    \"![title](Images/10.PNG)\\n\",\n    \"\\n\",\n    \"Thus, to conclude, we can say that the Markov chain or Markov process consists of a\\n\",\n    \"set of states along with their transition probabilities.\\n\",\n    \"\\n\",\n    \"## Markov Reward Process\\n\",\n    \"\\n\",\n    \"The Markov Reward Process (MRP) is an extension of the Markov chain with the\\n\",\n    \"reward function. That is, we learned that the Markov chain consists of states and a\\n\",\n    \"transition probability. The MRP consists of states, a transition probability, and also a\\n\",\n    \"reward function.\\n\",\n    \"\\n\",\n    \"A reward function tells us the reward we obtain in each state. For instance, based on\\n\",\n    \"our previous weather example, the reward function tells us the reward we obtain\\n\",\n    \"in the state cloudy, the reward we obtain in the state windy, and so on. The reward\\n\",\n    \"function is usually denoted by $R(s)$.\\n\",\n    \"\\n\",\n    \"Thus, the MRP consists of states $s$, a transition probability $P(s|s')$\\n\",\n    \"function $R(s)$. \\n\",\n    \"\\n\",\n    \"## Markov Decision Process\\n\",\n    \"\\n\",\n    \"The Markov Decision Process (MDP) is an extension of the MRP with actions. That\\n\",\n    \"is, we learned that the MRP consists of states, a transition probability, and a reward\\n\",\n    \"function. The MDP consists of states, a transition probability, a reward function,\\n\",\n    \"and also actions. We learned that the Markov property states that the next state is\\n\",\n    \"dependent only on the current state and is not based on the previous state. Is the\\n\",\n    \"Markov property applicable to the RL setting? Yes! In the RL environment, the agent\\n\",\n    \"makes decisions only based on the current state and not based on the past states. So,\\n\",\n    \"we can model an RL environment as an MDP.\\n\",\n    \"\\n\",\n    \"Let's understand this with an example. Given any environment, we can formulate\\n\",\n    \"the environment using an MDP. For instance, let's consider the same grid world\\n\",\n    \"environment we learned earlier. The following figure shows the grid world environment,\\n\",\n    \"and the goal of the agent is to reach state I from state A without visiting the shaded\\n\",\n    \"state\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/11.png)\\n\",\n    \"\\n\",\n    \"An agent makes a decision (action) in the environment only based on the current\\n\",\n    \"state the agent is in and not based on the past state. So, we can formulate our\\n\",\n    \"environment as an MDP. We learned that the MDP consists of states, actions,\\n\",\n    \"transition probabilities, and a reward function. Now, let's learn how this relates to\\n\",\n    \"our RL environment:\\n\",\n    \"\\n\",\n    \"__States__ – A set of states present in the environment. Thus, in the grid world\\n\",\n    \"environment, we have states A to I.\\n\",\n    \"\\n\",\n    \"__Actions__ – A set of actions that our agent can perform in each state. An agent\\n\",\n    \"performs an action and moves from one state to another. Thus, in the grid world\\n\",\n    \"environment, the set of actions is up, down, left, and right.\\n\",\n    \"\\n\",\n    \"__Transition probability__ – The transition probability is denoted by $ P(s'|s,a) $. It\\n\",\n    \"implies the probability of moving from a state $s$ to the next state $s'$ while performing\\n\",\n    \"an action $a$. If you observe, in the MRP, the transition probability is just $ P(s'|s,a) $ that\\n\",\n    \"is, the probability of going from state $s$ to state $s'$ and it doesn't include actions. But in MDP we include the actions, thus the transition probability is denoted by $ P(s'|s,a) $. \\n\",\n    \"\\n\",\n    \"For example, in our grid world environment, say, the transition probability of moving from state A to state B while performing an action right is 100% then it can be expressed as: $P( B |A , \\\\text{right}) = 1.0 $. We can also view this in the state diagram as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/12.png)\\n\",\n    \"\\n\",\n    \"Suppose, our agent is in state C and the transition probability of moving from state C to the state F while performing an action down is 90% then it can be expressed as: $P( F |C , \\\\text{down}) = 0.9 $. We can also view this in the state diagram as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/13.png)\\n\",\n    \"\\n\",\n    \"__Reward function__ -  The reward function is denoted by $R(s,a,s') $. It implies the reward our agent obtains while transitioning from a state $s$ to the state $s'$ while performing an action $a$. \\n\",\n    \"\\n\",\n    \"Say, the reward we obtain while transitioning from the state A to the state B while performing an action right is -1, then it can be expressed as $R(A, \\\\text{right}, B) = -1 $. We can also view this in the state diagram as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/14.png)\\n\",\n    \"\\n\",\n    \"Suppose, our agent is in state C and say, the reward we obtain while transitioning from the state C to the state F while performing an action down is  +1, then it can be expressed as $R(C, \\\\text{down}, F) = +1 $. We can also view this in the state diagram as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/15.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, an RL environment can be represented as an MDP with states, actions,\\n\",\n    \"transition probability, and the reward function. But wait! What is the use of\\n\",\n    \"representing the RL environment using the MDP? We can solve the RL problem easily\\n\",\n    \"once we model our environment as the MDP. For instance, once we model our grid\\n\",\n    \"world environment using the MDP, then we can easily find how to reach the goal\\n\",\n    \"state I from state A without visiting the shaded states. We will learn more about this\\n\",\n    \"in the upcoming chapters. Next, we will go through more essential concepts of RL.\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.07. Action space, Policy, Episode and Horizon.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Action space, Policy, Episode, Horizon\\n\",\n    \"\\n\",\n    \"In this section, we will learn about the several important fundamental concepts that are involved in reinforcement learning. \\n\",\n    \"\\n\",\n    \"## Action space\\n\",\n    \"Consider the grid world environment shown below:\\n\",\n    \"\\n\",\n    \"![title](Images/16.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"In the preceding grid world environment, the goal of the agent is to reach state I\\n\",\n    \"starting from state A without visiting the shaded states. In each of the states, the\\n\",\n    \"agent can perform any of the four actions—up, down, left, and right—to achieve the\\n\",\n    \"goal. The set of all possible actions in the environment is called the action space.\\n\",\n    \"Thus, for this grid world environment, the action space will be [up, down, left, right].\\n\",\n    \"We can categorize action spaces into two types:\\n\",\n    \"\\n\",\n    \"* Discrete action space \\n\",\n    \"* Continuous action space\\n\",\n    \"\\n\",\n    \"__Discrete action space__ -When our action space consists of actions that are discrete,\\n\",\n    \"then it is called a discrete action space. For instance, in the grid world environment,\\n\",\n    \"our action space consists of four discrete actions, which are up, down, left, right, and\\n\",\n    \"so it is called a discrete action space.\\n\",\n    \"\\n\",\n    \"__Continuous action space__ - When our action space consists of actions that are\\n\",\n    \"continuous, then it is called a continuous action space. For instance, let's suppose\\n\",\n    \"we are training an agent to drive a car, then our action space will consist of several\\n\",\n    \"actions that have continuous values, such as the speed at which we need to drive the\\n\",\n    \"car, the number of degrees we need to rotate the wheel, and so on. In cases where\\n\",\n    \"our action space consists of actions that are continuous, it is called a continuous\\n\",\n    \"action space.\\n\",\n    \"\\n\",\n    \"## Policy\\n\",\n    \"\\n\",\n    \"A policy defines the agent's behavior in an environment. The policy tells the agent\\n\",\n    \"what action to perform in each state. For instance, in the grid world environment, we\\n\",\n    \"have states A to I and four possible actions. The policy may tell the agent to move\\n\",\n    \"down in state A, move right in state D, and so on.\\n\",\n    \"\\n\",\n    \"To interact with the environment for the first time, we initialize a random policy, that\\n\",\n    \"is, the random policy tells the agent to perform a random action in each state. Thus,\\n\",\n    \"in an initial iteration, the agent performs a random action in each state and tries to\\n\",\n    \"learn whether the action is good or bad based on the reward it obtains. Over a series\\n\",\n    \"of iterations, an agent will learn to perform good actions in each state, which gives a\\n\",\n    \"positive reward. Thus, we can say that over a series of iterations, the agent will learn\\n\",\n    \"a good policy that gives a positive reward.\\n\",\n    \"\\n\",\n    \"The optimal policy is shown in the following figure. As we can observe, the agent selects the\\n\",\n    \"action in each state based on the optimal policy and reaches the terminal state I from\\n\",\n    \"the starting state A without visiting the shaded states:\\n\",\n    \"\\n\",\n    \"![title](Images/17.png)\\n\",\n    \"\\n\",\n    \"Thus, the optimal policy tells the agent to perform the correct action in each state so\\n\",\n    \"that the agent can receive a good reward.\\n\",\n    \"\\n\",\n    \"A policy can be classified into two:\\n\",\n    \"\\n\",\n    \"* Deterministic Policy\\n\",\n    \"* Stochastic Policy\\n\",\n    \"\\n\",\n    \"### Deterministic Policy\\n\",\n    \"The policy which we just learned above is called deterministic policy. That is, deterministic policy tells the agent to perform a one particular action in a state. Thus, the deterministic policy maps the state to one particular action and is often denoted by $\\\\mu$. Given a state $s$ at a time $t$, a deterministic policy tells the agent to perform a one particular action $a$. It can be expressed as:\\n\",\n    \"\\n\",\n    \"$$a_t = \\\\mu(s_t) $$\\n\",\n    \"\\n\",\n    \"For instance, consider our grid world example, given a state A, the deterministic policy $\\\\mu$ tells the agent to perform an action down and it can be expressed as:\\n\",\n    \"\\n\",\n    \"$$\\\\mu (A) = \\\\text{Down} $$\\n\",\n    \"\\n\",\n    \"Thus, according to the deterministic policy, whenever the agent visits state A, it performs the action down. \\n\",\n    \"\\n\",\n    \"### Stochastic Policy\\n\",\n    \"\\n\",\n    \"Unlike deterministic policy, the stochastic policy does not map the state directly to one particular action, instead, it maps the state to a probability distribution over an action space. \\n\",\n    \"\\n\",\n    \"That is, we learned that given a state, the deterministic policy will tell the agent to perform one particular action in the given state, so, whenever the agent visits the state it always performs the same particular action. But with stochastic policy, given a state, the stochastic policy will return a probability distribution over an action space so instead of performing the same action every time the agent visits the state, the agent performs different actions each time based on a probability distribution returned by the stochastic policy. \\n\",\n    \"\\n\",\n    \"Let's understand this with an example, we know that our grid world environment's action space consists of 4 actions which are [up, down, left, right]. Given a state A, the stochastic policy returns the probability distribution over the action space as [0.10,0.70,0.10,0.10]. Now, whenever the agent visits the state A, instead of selecting the same particular action every time, the agent selects the action up 10% of the time, action down 70% of the time, action left 10% of time and action right 10% of the time. \\n\",\n    \"\\n\",\n    \"The difference between the deterministic policy and stochastic policy is shown below, as we can observe the deterministic policy maps the state to one particular action whereas the stochastic policy maps the state to the probability distribution over an action space:\\n\",\n    \"\\n\",\n    \"![title](Images/18.png)\\n\",\n    \"\\n\",\n    \"Thus, stochastic policy maps the state to a probability distribution over action space and it is often denoted by $\\\\pi$.  Say, we have a state $s$ and action $a$ at a time $t$, then we can express the stochastic policy as:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"$$a_t \\\\sim \\\\pi(s_t) $$\\n\",\n    \"\\n\",\n    \"Or it can also be expressed as $\\\\pi(a_t |s_t) $. \\n\",\n    \"\\n\",\n    \"We can categorize the stochastic policy into two:\\n\",\n    \"\\n\",\n    \"* Categorical policy\\n\",\n    \"* Gaussian policy\\n\",\n    \"\\n\",\n    \"### Categorical policy \\n\",\n    \"A stochastic policy is called a categorical policy when the action space is discrete. That is, the stochastic policy uses categorical probability distribution over action space to select actions when the action space is discrete. For instance, in the grid world environment, we have just seen above, we select actions based on categorical probability distribution (discrete distribution) as the action space of the environment is discrete. As shown below, given a state A, we select an action based on the categorical probability distribution over the action space:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/19.png)\\n\",\n    \"### Gaussian policy \\n\",\n    \"A stochastic policy is called a gaussian policy when our action space is continuous. That is, the stochastic policy uses Gaussian probability distribution over action space to select actions when the action space is continuous. Let's understand this with a small example. Suppose we training an agent to drive a car and say we have one continuous action in our action space. Let the action be the speed of the car and the value of the speed of the car ranges from 0 to 150 kmph. Then, the stochastic policy uses the Gaussian distribution over the action space to select action as shown below:\\n\",\n    \"\\n\",\n    \"![title](Images/20.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"We will learn more about the gaussian policy in the upcoming chapters.\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Episode \\n\",\n    \"\\n\",\n    \"The agent interacts with the environment by performing some action starting from the initial state and reach the final state. This agent-environment interaction starting from the initial state until the final state is called an episode. For instance, in the car racing video game, the agent plays the game by starting from the initial state (starting point of the race) and reach the final state (endpoint of the race). This is considered an episode. An episode is also often called trajectory (path taken by the agent) and it is denoted by $\\\\tau$. \\n\",\n    \"\\n\",\n    \"An agent can play the game for any number of episodes and each episode is independent of each other. What is the use of playing the game for multiple numbers of episodes? In order to learn the optimal policy, that is, the policy which tells the agent to perform correct action in each state, the agent plays the game for many episodes. \\n\",\n    \"\\n\",\n    \"For example, let's say we are playing a car racing game, for the first time, we may not win the game and we play the game several times to understand more about the game and discover some good strategies for winning the game. Similarly, in the first episode, the agent may not win the game and it plays the game for several episodes to understand more about the game environment and good strategies to win the game. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Say, we begin the game from an initial state at a time step t=0 and reach the final state at a time step T then the episode information consists of the agent environment interaction such as state, action, and reward starting from the initial state till the final state, that is, $(s_0, a_0,r_0,s_1,a_1,r_1,\\\\dots,s_T) $\\n\",\n    \"\\n\",\n    \"An episode (or) trajectory is shown below:\\n\",\n    \"\\n\",\n    \"![title](Images/21.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Let's strengthen our understanding of the episode and optimal policy with the grid world environment. We learned that in the grid world environment, the goal of our agent is to reach the final state I starting from the initial state A without visiting the shaded states. An agent receives +1 reward when it visits the unshaded states and -1 reward when it visits the shaded states.\\n\",\n    \"\\n\",\n    \"When we say, generate an episode it means going from initial state to the final state. The agent generates the first episode using a random policy and explores the environment and over several episodes, it will learn the optimal policy. \\n\",\n    \"\\n\",\n    \"### Episode 1:\\n\",\n    \"\\n\",\n    \"As shown below, in the first episode, the agent uses random policy and selects random action in each state starting from the initial state until the final state and observe the reward:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/22.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"### Episode 2:\\n\",\n    \"\\n\",\n    \"In the second episode, the agent tries a different policy to avoid negative rewards which it had received in the previous episode. For instance, as we can observe in the previous episode, the agent selected an action right in the state A and received a negative reward, so in this episode, instead of selecting action right in the state A, it tries a different action say, down as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/23.png)\\n\",\n    \"\\n\",\n    \"### Episode n:\\n\",\n    \"\\n\",\n    \"Thus, over a series of the episodes, the agent learns the optimal policy, that is, the policy which takes the agent to the final state I from the state A without visiting the shaded states as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/24.png)\\n\",\n    \"\\n\",\n    \"# Episodic and Continuous tasks \\n\",\n    \"A reinforcement learning task can be categorized into two:\\n\",\n    \"* Episodic task\\n\",\n    \"* Continuous task\\n\",\n    \"\\n\",\n    \"__Episodic task__ - As the name suggests episodic task is the one that has the terminal state. That is, episodic tasks are basically tasks made up of episodes and thus they have a terminal state. Example: Car racing game. \\n\",\n    \"\\n\",\n    \"__Continuous task__ - Unlike episodic tasks, continuous tasks do not contain any episodes and so they don't have any terminal state. For example, a personal assistance robot does not have a terminal state. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"# Horizon\\n\",\n    \"Horizon is the time step until which the agent interacts with the environment. We can classify the horizon into two:\\n\",\n    \"\\n\",\n    \"* Finite horizon\\n\",\n    \"* Infinite horizon\\n\",\n    \"\\n\",\n    \"__Finite horizon__ - If the agent environment interaction stops at a particular time step then it is called finite Horizon. For instance, in the episodic tasks agent interacts with the environment starting from the initial state at time step  t =0 and reach the final state at a time step T.  Since the agent environment interaction stops at the time step T, it is considered a finite horizon. \\n\",\n    \"\\n\",\n    \"__Infinite horizon__ - If the agent environment interaction never stops then it is called an infinite horizon. For instance, we learned that the continuous task does not have any terminal states, so the agent environment interaction will never stop in the continuous task and so it is considered an infinite horizon. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.08.  Return, Discount Factor and Math Essentials.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Return, Discount Factor and Math Essentials\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Return and discount factor\\n\",\n    \"A return can be defined as the sum of the rewards obtained by the agent in an episode. The return is often denoted by  $R$ or $G$. Say, the agent starts from the initial state at time step  $t=0$ and reaches the final state at a time step $T$, then the return obtained by the agent is given as:\\n\",\n    \"\\n\",\n    \"$$\\\\begin{aligned}R(\\\\tau) &= r_0 + r_1+r_2+\\\\dots+r_T \\\\\\\\ &\\\\\\\\\\n\",\n    \"R(\\\\tau) &= \\\\sum_{t=0}^{T-1} r_t \\\\end{aligned} $$\\n\",\n    \"\\n\",\n    \"Let's understand this with an example, consider the below trajectory $\\\\tau$:\\n\",\n    \"\\n\",\n    \"![title](Images/25.PNG)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"The return of the trajectory is the sum of the rewards, that is, $R(\\\\tau) = 2+ 2+1 +2 = 7$\\n\",\n    \"\\n\",\n    \"Thus, we can say that the goal of our agent is to maximize the return, that is, maximize the sum of rewards(cumulative rewards) obtained over the episode. How can we maximize the return? We can maximize the return if we perform correct action in each state. Okay, how can we perform correct action in each state? We can perform correct action in each state using the optimal policy. Thus, we can maximize the return using the optimal policy. \\n\",\n    \"\\n\",\n    \"Thus, we can redefine the optimal policy as the policy which gets our agent the maximum return (sum of rewards) by performing correct action in each state. \\n\",\n    \"\\n\",\n    \"Okay, how can we define the return for continuous tasks? We learned that in continuous tasks there are no terminal states, so we can define the return as a sum of rewards up to infinity:\\n\",\n    \"$$R(\\\\tau) = r_0 + r_1+r_2+\\\\dots+r_{\\\\infty} $$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"But how can we maximize the return which just sums to infinity? So, we introduce a notation of discount factor $\\\\gamma$ and rewrite our return as:\\n\",\n    \"\\n\",\n    \"$$\\\\begin{aligned}R(\\\\tau) &= \\\\gamma^0 r_0 + \\\\gamma^1 r_1+ \\\\gamma^2 r_2+\\\\dots+ \\\\gamma^nr_{\\\\infty} \\\\\\\\&\\\\\\\\R(\\\\tau) & = \\\\sum_{t=0}^{\\\\infty} \\\\gamma^t r_t \\\\end{aligned} $$\\n\",\n    \"\\n\",\n    \"Okay, but the question is how this discount factor  $\\\\gamma$ is going to help us?  It helps us in preventing the return reaching up to infinity by deciding how much importance we give to future rewards and immediate rewards. The value of the discount factor ranges from 0 to 1. When we set the discount factor to a small value (close to 0) then it implies that we give more importance to immediate reward than the future rewards and when we set the discount factor to a high value (close to 1) then it implies that we give more importance to future rewards than the immediate reward. Let us understand this with an example with different values of discount factor:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Small discount factor\\n\",\n    \"\\n\",\n    \"Let's set the discount factor to a small value, say 0.2, that is, let's set $\\\\gamma = 0.2$, then we can write:\\n\",\n    \"$$ \\\\begin{aligned}R &= (\\\\gamma)^0 r_0 + (\\\\gamma)^1 r_1+ (\\\\gamma)^2 r_2 +  \\\\dots \\\\\\\\&\\\\\\\\&=(0.2)^0 r_0 + (0.2)^1 r_1+ (0.2)^2 r_2 +  \\\\dots\\\\\\\\&\\\\\\\\&= (1) r_0 + (0.2) r_1+ (0.04) r_2+ \\\\dots \\\\end{aligned} $$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"From the above equation, we can observe that the reward at each time step is weighted by a discount factor. As the time step increases the discount factor (weight) decreases and thus the importance of rewards at future time steps also decreases. That is, from the above equation, we can observe that:\\n\",\n    \"\\n\",\n    \"* At the time step 0, the reward $r_0$ is weighted by the discount factor 1\\n\",\n    \"* At the time step 1, the discount factor is heavily decreased and the reward  $r_1$ is weighted by the discount factor 0.2\\n\",\n    \"* At the time step 2, the discount factor is again decreased to 0.04 and the reward $r2$ is weighted by the discount 0.04\\n\",\n    \"\\n\",\n    \"As we can observe, the discount factor is heavily decreased for the subsequent time steps and more importance is given to the immediate reward $r_0$ than the rewards obtained at the future time steps. Thus, when we set the discount factor to a small value we give more importance to the immediate reward than the future rewards. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## High discount factor\\n\",\n    \"Let's set the discount factor to a high value, say 0.9, that is, let's set, $\\\\gamma = 0.9$ , then we can write:\\n\",\n    \"\\n\",\n    \"$$\\\\begin{aligned}R &=(\\\\gamma)^0 r_0 + (\\\\gamma)^1 r_1+ (\\\\gamma)^2 r_2 +  \\\\dots \\\\\\\\&\\\\\\\\ &=(0.9)^0 r_0 + (0.9)^1 r_1+ (0.9)^2 r_2 +  \\\\dots\\\\\\\\&\\\\\\\\&= (1) r_0 + (0.9) r_1+ (0.81) r_2+ \\\\dots  \\\\end{aligned} $$\\n\",\n    \"\\n\",\n    \"From the above equation, we can infer that as the time step increases the discount factor (weight) decreases, however, it is not decreasing heavily unlike the previous case since here we started off with $\\\\gamma=0.9$. So in this case, we can say that we give more importance to future rewards. That is, from the above equation, we can observe that:\\n\",\n    \"\\n\",\n    \"* At the time step 0, the reward  $r_0$ is weighted by the discount factor 1\\n\",\n    \"* At the time step 1, the discount factor is decreased but not heavily decreased and the reward $r_1$  is weighted by the discount factor 0.9\\n\",\n    \"* At the time step 2, the discount factor is decreased to 0.81 and the reward $r_2$ is weighted by the discount 0.81\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"As we can observe the discount factor is decreased for the subsequent time steps but unlike the previous case, the discount factor is not decreased heavily. Thus, when we set the discount factor to high value we give more importance to future rewards than the immediate reward. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## What happens when we set the discount factor to 0?\\n\",\n    \"\\n\",\n    \"When we set the discount factor to 0, that is $\\\\gamma=0$, then it implies that we consider only the immediate reward $r_0$ and not the reward obtained from the future time steps. Thus, when we set the discount factor to 0 then the agent will never learn considering only the immediate reward $r_0$ as shown below:\\n\",\n    \"\\n\",\n    \"$$\\\\begin{aligned}R &=(\\\\gamma)^0 r_0 + (\\\\gamma)^1 r_1+ (\\\\gamma)^2 r_2 + \\\\dots \\\\\\\\&\\\\\\\\ &=(0)^0 r_0 + (0)^1 r_1+ (0)^2 r_2 +  \\\\dots\\\\\\\\&\\\\\\\\& = r_0 \\\\end{aligned} $$\\n\",\n    \"\\n\",\n    \"As we can observe when we set $\\\\gamma=0$, then our return will be just the immediate reward $r_0$.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## What happens when we set the discount factor to 1?\\n\",\n    \"\\n\",\n    \"When we set the discount factor to 1, that is $\\\\gamma=1$, then it implies that we consider all the future rewards. Thus, when we set the discount factor to 1 then the agent will learn forever looking for all  the future reward which may lead to infinity as shown below:\\n\",\n    \"\\n\",\n    \"$$\\\\begin{aligned}R &=(\\\\gamma)^0 r_0 + (\\\\gamma)^1 r_1+ (\\\\gamma)^2 r_2 + \\\\dots \\\\\\\\&\\\\\\\\&=(1)^0 r_0 + (1)^1 r_1+ (1)^2 r_2 +  \\\\dots\\\\\\\\&\\\\\\\\& = r_0 + r_1 + r_2 + \\\\dots \\\\end{aligned} $$\\n\",\n    \"\\n\",\n    \"As we can observe when we set $\\\\gamma=1$, then our return will be the sum of rewards up to infinity. \\n\",\n    \"\\n\",\n    \"Thus, we learned that we set discount factor to 0 then the agent will never learn considering only the immediate reward and when we set the discount factor to 1 the agent will learn forever looking for the future rewards which lead to infinity. So the optimal value of the discount factor lies between 0.2 to 0.8.\\n\",\n    \"\\n\",\n    \"But the question is why should we care about immediate and future rewards? we give importance to immediate reward and future rewards depending on the tasks. In some tasks, future rewards are more desirable than immediate reward and vice versa. In a chess game, the goal is to defeat the opponent's king. If we give more importance to the immediate reward, which is acquired by actions like our pawn defeating any opponent chessman and so on, then the agent will learn to perform this sub-goal instead of learning the actual goal. So, in this case, we give importance to future rewards than the immediate reward, whereas in some cases, we prefer immediate rewards over future rewards. Say, would you prefer chocolates if I gave you them today or 13 days later?\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Math Essentials \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Before looking into the next important concept in reinforcement learning, let's quickly recap expectation as we will be dealing with expectation throughout the book.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Expectation \\n\",\n    \"\\n\",\n    \"Let's say we have a variable X and it has the following values 1,2,3,4,5,6. To compute the average value of X, we can just sum all the values of X divided by the number of values of X. Thus, the average of X is (1+2+3+4+5+6)/6 = 3.5\\n\",\n    \"\\n\",\n    \"Now, let's suppose X is a random variable. The random variable takes value based on the random experiment such as throwing dice, tossing a coin and so on. The random variable takes different values with some probabilities. Let's suppose we are throwing a fair dice then the possible outcomes (X) are 1,2,3,4,5,6 and the probability of occurrence of each of these outcomes is 1/6 as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/26.PNG)\\n\",\n    \"\\n\",\n    \"How can we compute the average value of the random variable X? Since each value has a probability of an occurrence we can't just take the average. So, what we will do is that here we will compute the weighted average, that is, the sum of values of X multiplied by their respective probabilities and this is called expectation. The expectation of a random variable X can be defined as:\\n\",\n    \"\\n\",\n    \"$$E(X) = \\\\sum_{i=1}^N x_i p(x_i) $$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, the expectation of the random variable X is, E(X) = 1(1/6) + 2(1/6) + 3(1/6) + 4(1/6) + 5 (1/6) + 6(1/6) = 3.5\\n\",\n    \"\\n\",\n    \"The expectation is also known as the expected value. Thus, the expected value of the random variable X is 3.5. Thus, when we say expectation or expected value of a random variable it basically means the weighted average.\\n\",\n    \"\\n\",\n    \"Now, we will look into the expectation of a function of a random variable. Let $f(x) = x^2$ then we can write:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/27.PNG)\\n\",\n    \"\\n\",\n    \"The expectation of a function of a random variable can be computed as:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"$$\\\\mathbb{E}_{x \\\\sim p(x)}[f(X)] = \\\\sum_{i=1}^N f(x_i) p(x_i) $$\\n\",\n    \"\\n\",\n    \"Thus the expected value of f(X) is given as E(f(X)) = 1(1/6) + 4(1/6) + 9(1/6) + 16(1/6) + 25(1/6) + 36(1/6) = 15.1\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.09 Value function and Q function.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Value function\\n\",\n    \"\\n\",\n    \"The value function also called the state value function denotes the value of the state. The value of a state is the return an agent would obtain starting from that state following the policy $\\\\pi$. The value of a state or value function is usually denoted by $V(s)$ and it can be expressed as: \\n\",\n    \"\\n\",\n    \"$$ V^{\\\\pi}(s) = [R(\\\\tau) | s_0 = s] $$\\n\",\n    \"\\n\",\n    \"where, $ s_0 = s $ implies that the starting state is $s_0$. The value of a state is called the state value. \\n\",\n    \"\\n\",\n    \"Let's understand the value function with an example. Let's suppose we generate the trajectory $\\\\tau$ following some policy $\\\\pi$ in our grid world environment as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/28.png)\\n\",\n    \"\\n\",\n    \"Now, how do we compute the value of all the states in our trajectory? We learned that the value of a state is the return(sum of reward) an agent would obtain starting from that state following the policy $\\\\pi$. The above trajectory is generated using the policy $\\\\pi$, thus we can say that the value of a state is the return(sum of rewards) of the trajectory starting from that state. \\n\",\n    \"\\n\",\n    \"* The value of the state A is the return of the trajectory starting from the state A. Thus, $V(A) = 1+1+-1+1 = 2 $\\n\",\n    \"* The value of the state D is the return of the trajectory starting from the state D. Thus, $V(D) = 1-1+1 = 1 $\\n\",\n    \"* The value of the state E is the return of the trajectory starting from the state E. Thus, $V(E) = -1+1 = 0 $\\n\",\n    \"* The value of the state H is the return of the trajectory starting from the state H. Thus, $V(H) = 1$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"What about the value of the final state I?  We learned the value of a state is return (sum of rewards) starting from that state. We know that we obtain a reward when we transition from one state to another. Since I is the final state, we don't make any transition from final state and so there will no reward and thus no value for the final state I. \\n\",\n    \"\\n\",\n    \"_In a nutshell, the value of a state is the return of the trajectory starting from that state._\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Wait! there is a small change here, instead of taking the return directly as a value of a state we will use the expected return. Thus, the value function or the value of the state $s$ can be defined as the expected return that the agent would obtain starting from the state $s$ following the policy $\\\\pi$ . It can be expressed as:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"$$V^{\\\\pi}(s) = \\\\underset{\\\\tau \\\\sim \\\\pi}{\\\\mathbb{E}}[R(\\\\tau) | s_0 = s] $$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Let's understand this with a simple example. Let's suppose we have a stochastic policy $\\\\pi$.  We learned that, unlike deterministic policy which maps the state to action directly, stochastic policy maps the state to the probability distribution over action space. Thus, the stochastic policy selects actions based on a probability distribution.\\n\",\n    \"\\n\",\n    \"Let's suppose we are in state A and the stochastic policy returns the probability distribution over the action space as [0.0,0.80,0.00,0.20]. It implies that with the stochastic policy, in the state A, we perform action down for 80% of the time, that is, $\\\\pi(\\\\text{down} |A) = 0.80 $ and the action right for the 20 % of the time, that is $\\\\pi(\\\\text{right} |A) = 0.20 $ .\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, in state A, our stochastic policy  $\\\\pi$ selects action down for 80% of time and action right for 20% of the time and say our stochastic policy selects action right in the state D and E and action down in the state B and F for the 100% of the time.\\n\",\n    \"\\n\",\n    \"First, we generate an episode $\\\\tau_1$ using our given stochastic policy $\\\\pi$ as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/29.PNG)\\n\",\n    \"\\n\",\n    \"For better understanding, let's focus only on the value of state A. The value of the state A is the return (sum of rewards) of the trajectory starting from the state A. Thus, $V(A) = R(\\\\tau_1) = 1+1+1+1 =4 $\\n\",\n    \"\\n\",\n    \"Say, we generate another episode $\\\\tau_2$ using the same given stochastic policy $\\\\pi$ as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/30.PNG)\\n\",\n    \"\\n\",\n    \"The value of the state, A is the return (sum of rewards) of the trajectory from the state A. Thus, $V(A) = R(\\\\tau_2) = -1+1+1+1 =2 $\\n\",\n    \"\\n\",\n    \"As you may observe, although we use the same policy, the value of state A in the trajectory $\\\\tau_1$ and $\\\\tau_2$ are different. This is because our policy is a stochastic policy and it performs an action down in state A for 80% of time and action right in state A for 20% of the time. So, when we generate a trajectory using the policy $\\\\pi$, the trajectory  $\\\\tau_1$ will occur for 80% of the time and the trajectory $\\\\tau_2$ will occur for 20% of the time. Thus, the return will be 4 for 80% of the time and 2 for 20% of the time. \\n\",\n    \"\\n\",\n    \"Thus, instead of taking the value of the state as a return directly we will take the expected return since the return takes different values with some probability. The expected return is basically the weighted average, that is, the sum of the return multiplied by their probability. Thus, we can write:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"$$ V^{\\\\pi}(s) = \\\\underset{\\\\tau \\\\sim \\\\pi}{\\\\mathbb{E}}[R(\\\\tau) | s_0 = s] $$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"The value of a state A can be obtained as:\\n\",\n    \"$$\\\\begin{align}V^{\\\\pi}(A) &= \\\\underset{\\\\tau \\\\sim \\\\pi}{\\\\mathbb{E}}[R(\\\\tau) | s_0 = A] \\\\\\\\ &= \\\\sum_i R(\\\\tau_i) \\\\pi(a_i|A) \\\\\\\\ &= R(\\\\tau_1) \\\\pi (\\\\text{down}|A) + R(\\\\tau_2) \\\\pi (\\\\text{right}|A)\\\\\\\\ &= 4(0.8) + 2(0.2) \\\\\\\\ &= 3.6 \\n\",\n    \"\\\\end{align} $$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, the value of a state is the expected return of the trajectory starting from that state.\\n\",\n    \"\\n\",\n    \"Note that the value function depends on the policy, that is, the value of the state varies based on the policy we choose. There can be many different value functions according to different policies. The optimal value function,  $ V^*(s) $ is the one which yields maximum value compared to all the other value functions. It can be expressed as:\\n\",\n    \"\\n\",\n    \"$$V^{*}(s) = \\\\max_{\\\\pi} V^{\\\\pi}(s) $$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"For example, let's say we have two policies $\\\\pi_1$ and $\\\\pi_2$. Let the value of a state  using the policy $\\\\pi_1$  be $V^{\\\\pi_1} (s) = 13 $ and the value of the state  using the policy $\\\\pi_2$ be $V^{\\\\pi_2} (s) = 11 $.  Then the optimal value of the state  will be  $V^*(s) = 13 $ as it is the maximum. The policy which gives the maximum state value is called optimal policy $\\\\pi^*$. Thus, in this case, $\\\\pi_1$ is the optimal policy as it gives the maximum state value. \\n\",\n    \"\\n\",\n    \"We can view the value function in a table and it is called a value table. Let us say we have two states $s_0$ and  $s_1$ then the value function can be represented as:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/31.PNG)\\n\",\n    \"\\n\",\n    \"From the above value table, we can tell that it is good to be in the state $s_1$ than the state $s_0$  as $s_1$ has high value. Thus, we can say that the state $s_1$ is an optimal state than the state $s_0$ . \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Q function\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"A Q function also called the state-action value function denotes the value of a state-action pair. The value of a state-action pair is the return agent would obtain starting from the state  and an action  following the policy . The value of a state-action pair or Q function is usually denoted by  and it is called Q value or state-action value. It is expressed as: \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/32.PNG)\\n\",\n    \"\\n\",\n    \"Note that the only difference between the value function and Q function is that in the value function we compute the value of a state whereas in the Q function we compute the value of a state-action pair. Let's understand the Q function with an example. Consider the below trajectory generated using some policy :\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"We learned that Q function computes the value of a state-action pair. Say we need to compute the Q value of state action pair, A-down. That is the Q value of performing action down in the state A. Then the Q value will be just the return of our trajectory starting from the state A and performing action down as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Let's suppose we need to compute the Q value of the state action pair, D-right. That is the Q value of performing action right in the state D. Then Q value will be just the return of our trajectory starting from the state D and performing action right as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Similarly, we can compute the Q value for all the state-action pairs. Similar to what we learned in the value function, instead of taking the return directly as the Q value of a state-action pair we will use the expected return because the return is the random variable and it takes different values with some probability. So we can redefine our Q function as:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"It implies that the Q value is the expected return agent would obtain starting from the state  and action  following the policy \\n\",\n    \"\\n\",\n    \"Similar to value function, the Q function depends on the policy, that is, the Q value varies based on the policy we choose. There can be many different Q functions according to different policies. Optimal Q function is the one which has the maximum Q value over other Q functions and it can be expressed as:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"The optimal policy  is the policy which gives the maximum Q value. \\n\",\n    \"\\n\",\n    \"Like value function, Q function can be viewed in a table. It is called a Q table. Let us say we have two states  and  and two actions 0 and 1, then the Q function can be represented as follows:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/33.PNG)\\n\",\n    \"\\n\",\n    \"As we can observe, the above Q table represents the Q-value of all possible state-action pairs. We can extract the policy from the Q function by just selecting the action which has the maximum Q value in each state:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, our policy will select action 1 in the state  and action 0 in the state  since they have a maximum Q value as shown below:\\n\",\n    \"\\n\",\n    \"![title](Images/34.PNG)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, we can extract the policy by computing the Q function. \\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.10. Model-Based and Model-Free Learning .ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Model-Based and Model-free learning\\n\",\n    \"\\n\",\n    \"**Model-based learning** - In model-based learning, an agent will have a complete description of the environment. That is, we learned that the transition probability tells the probability of moving from a state  to the next state  by performing an action  and the reward function tells the reward we would obtain while moving from a state  to the next state  by performing an action . When the agent knows the model dynamics of the environment, that is, when the agent knows the transition probability of the environment then it is called model-based learning. Thus, in model-based learning the agent uses the model dynamics for finding the optimal policy. \\n\",\n    \"\\n\",\n    \"**Model-free learning** - When the agent does not know the model dynamics of the environment then it is called model-free learning. That is, In model-free learning, an agent tries to find the optimal policy without the model dynamics. \\n\",\n    \"\\n\",\n    \"Thus, to summarize, in a model-free setting, the agent learns optimal policy without the model dynamics of the environment whereas, in a model-based setting, the agent learns the optimal policy with the model dynamics of the environment.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.11. Different Types of Environments.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Different types of environments\\n\",\n    \"\\n\",\n    \"We learned that the environment is the world of the agent and the agent lives/stays within the environment. We can categorize the environment into different types as follows:\\n\",\n    \"\\n\",\n    \"## Deterministic and Stochastic environment\\n\",\n    \"\\n\",\n    \"__Deterministic environment__ - In a deterministic environment, we can be sure that when an agent performs an action $a$ in the state $s$ then it always reaches the state $s'$ exactly. For example, let's consider our grid world environment. Say the agent is in state A when we perform action down in the state A we always reach the state D and so it is called deterministic environment:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/35.png)\\n\",\n    \"\\n\",\n    \"__Stochastic environment__ - In a stochastic environment, we cannot say that by performing some action $a$ in the state $s$ the agent always reaches the state $s'$ exactly because there will be some randomness associated with the stochastic environment. For example, let's suppose our grid world environment is a stochastic environment. Say our agent is in state A, now if we perform action down in the state A then the agent doesn't always reach the state D  instead it reaches the state D for 70% of the time and the state B for 30 % of the time. That is, if we perform action down in the state A then the agent reaches the state D with 70% probability and the state B with 30% probability as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/36.png)\\n\",\n    \"\\n\",\n    \"## Discrete and continuous environment \\n\",\n    \"\\n\",\n    \"__Discrete Environment__ - When the action space of the environment is discrete then our environment is called a discrete environment. For instance, in the grid world environment, we have discrete action space which includes actions such as [up, down, left, right] and thus our grid world environment is called the discrete environment. \\n\",\n    \"\\n\",\n    \"__Continuous environment__ - When the action space of the environment is continuous then our environment is called a continuous environment. For instance, suppose, we are training an agent to drive a car then our action space will be continuous with several continuous actions such as speed in which we need to drive the car, the number of degrees we need to rotate the wheel and so on. In such a case where our action space of the environment is continuous, it is called continuous environment. \\n\",\n    \"\\n\",\n    \"## Episodic and non-episodic environment \\n\",\n    \"\\n\",\n    \"__Episodic environment__ - In an episodic environment, an agent's current action will not affect future action and thus the episodic environment is also called the non-sequential environment. \\n\",\n    \"\\n\",\n    \"__Non-episodic environment__ - In a non-episodic environment, an agent's current action will affect future action and thus the non-episodic environment is also called the sequential environment. Example: The chessboard is a sequential environment since the agent's current action will affect future action in a chess game.\\n\",\n    \"\\n\",\n    \"## Single and multi-agent environment\\n\",\n    \"\\n\",\n    \"__Single-agent environment__ - When our environment consists of only a single agent then it is called a single-agent environment. \\n\",\n    \"\\n\",\n    \"__Multi-agent environment__ - When our environment consists of multiple agents then it is called a multi-agent environment. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.12. Applications of Reinforcement Learning.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Applications of Reinforcement Learning\\n\",\n    \"\\n\",\n    \"Reinforcement learning has evolved rapidly over the past couple of years with a wide range of applications ranging from playing games to self-driving cars. One of the major reasons for this evolution is due to deep reinforcement learning (DRL) which is a combination of reinforcement learning and deep learning. We will learn about the various state of the art deep reinforcement learning algorithms in the upcoming chapters so be excited. In this section, we will look into some of the real-life applications of reinforcement learning.\\n\",\n    \"\\n\",\n    \"__Manufacturing__ - In manufacturing, intelligent robots are trained using reinforcement learning to place objects in the right position. The use of intelligent robots reduces labor costs and increases productivity. \\n\",\n    \"\\n\",\n    \"Dynamic Pricing - One of the popular applications of reinforcement learning includes dynamic pricing. Dynamic pricing implies that we change the price of the products based on demand and supply. We can train the RL agent for the dynamic pricing of the products with the goal of maximizing the revenue.\\n\",\n    \"\\n\",\n    \"__Inventory management__ - Reinforcement learning is extensively used in inventory management which is a crucial business activity. Some of these activities include supply chain management, demand forecasting, and handling several warehouse operations (such as placing products in warehouses for managing space efficiently).\\n\",\n    \"\\n\",\n    \"__Recommendation System__ - Reinforcement learning is widely used in building a recommendation system where the behavior of the user constantly changes. For instance, in the music recommendation system, the behavior or the music preference of the user changes from time to time. So in those cases using an RL agent can be very useful as the agent constantly learn by interacting with the environment. \\n\",\n    \"\\n\",\n    \"__Neural Architecture search__ - In order for the neural networks to perform a given task with good accuracy, the architecture of the network is very important and it has to properly designed. With reinforcement learning, we can automate the process of complex neural architecture search by training the agent to find the best neural architecture for a given task with the goal of maximizing the accuracy.\\n\",\n    \"\\n\",\n    \"__Natural Language Processing__ - With the increase in popularity of the deep reinforcement algorithms, RL has been widely used in several NLP tasks such as abstractive text summarization, chatbots and more.\\n\",\n    \"\\n\",\n    \"__Finance__ - Reinforcement learning is widely used in financial portfolio management which is the process of constant redistribution of a fund into different financial products. RL is also used in predicting and trading in commercial transaction markets. JP Morgan has successfully used RL to provide better trade execution results for large orders.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "01. Fundamentals of Reinforcement Learning/1.13. Reinforcement Learning Glossary.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Reinforcement Learning Glossary \\n\",\n    \"\\n\",\n    \"We have learned several important and fundamental concepts of Reinforcement learning. In this section, we will revisit the several important terms and terminologies that are very useful for understanding upcoming chapters. \\n\",\n    \"\\n\",\n    \"__Agent__ - Agent is the software program that learns to make intelligent decisions. Example: A software program that plays the chess game intelligently. \\n\",\n    \"\\n\",\n    \"__Environment__ - The environment is the world of the agent. A chessboard is called the environment when the agent plays the chess game.\\n\",\n    \"\\n\",\n    \"__State__ -  A state is a position or a moment in the environment where the agent can be in. Example: All the positions in the chessboard are called the states. \\n\",\n    \"\\n\",\n    \"__Action__ - The agent interacts with the environment by performing an action and move from one state to another. Example: Moves made by the chessman can be considered an action. \\n\",\n    \"\\n\",\n    \"__Reward__ -  A reward is a numerical value that the agent receives based on its action. Consider reward as a point. For instance, an agent receives +1 point (reward) for good action and -1 point (reward) for a bad action. \\n\",\n    \"\\n\",\n    \"__Action space__ - A set of all possible actions in the environment is called action space. The action space is called a discrete action space when our action space consists of discrete actions and the action space is called continuous action space when our actions space consists of continuous actions.\\n\",\n    \"\\n\",\n    \"__Policy__  - The agent makes a decision based on the policy. A policy tells the agent what action to perform in each state. It can be considered as the brain of an agent. A policy is called deterministic policy if it exactly maps a state to a particular action. Unlike deterministic policy, stochastic policy maps the state to a probability distribution over the action space. An optimal policy is the one that gives the maximum reward. \\n\",\n    \"\\n\",\n    \"__Episode__ - Agent environment interaction starting from initial state to terminal state is called the episode. An episode is often called the trajectory or rollout. \\n\",\n    \"\\n\",\n    \"__Episodic and continuous task__ - A reinforcement learning task is called episodic task if it has the terminal state and it is called a continuous task if it does not has a terminal state. \\n\",\n    \"\\n\",\n    \"__Horizon__ - Horizon can be considered as an agent's life span, that is, time step until which the agent interacts with the environment. Horizon is called finite horizon if the agent environment interaction stops at a particular time step and it is called an infinite horizon when the agent environment interaction continues forever. \\n\",\n    \"\\n\",\n    \"__Return__ - Return is the sum of rewards received by the agent in an episode.\\n\",\n    \"\\n\",\n    \"__Discount factor__ - Discount factor helps to control whether we want to give importance to the immediate reward or future reward. The value of the discount factor ranges from 0 to 1. A discount factor close to 0 implies that we give more importance to immediate reward while a discount factor close to 1 implies that we give more importance to future reward than the immediate reward.\\n\",\n    \"\\n\",\n    \"__Value function__ - Value function or the value of the state is the expected return that the agent would get starting from the state $s$ following the policy $\\\\pi#. \\n\",\n    \"\\n\",\n    \"__Q function__ - Q function or the value of a state-action pair implies the expected return agent would obtain starting from the state $s$ and an action $a$ following the policy $\\\\pi$. \\n\",\n    \"\\n\",\n    \"__Model-based and Model-free learning__ - When the agent tries to learn the optimal policy with the model dynamics then it is called model-based learning and when the agent tries to learn the optimal policy without the model dynamics then it is called model-free learning. \\n\",\n    \"\\n\",\n    \"__Deterministic and Stochastic environment__ - When an agent performs an action $a$ in the state $s$ and if it reaches the state  exactly every time, then the environment is called a deterministic environment. When an agent performs an action $a$ in the state $s$ and if it reaches different states every time based on some probability distribution then the environment is called stochastic environment. \\n\",\n    \"\\n\",\n    \"Thus, in this chapter, we have learned several important and fundamental concepts of reinforcement learning. In the next chapter, we will begin our Hands-on reinforcement learning journey by implementing all the fundamental concepts we have learned in this chapter using the popular toolkit called the gym. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "02. A Guide to the Gym Toolkit/2.02.  Creating our First Gym Environment.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Creating our first gym environment\\n\",\n    \"\\n\",\n    \"We learned that the gym provides a variety of environments for training the reinforcement learning agent. To clearly understand how the gym environment is designed, we will start off with the basic gym environment. Going forward, we will understand all other complex gym environments. \\n\",\n    \"\\n\",\n    \"Let's introduce one of the simplest environments called the frozen lake environment. The frozen lake environment is shown below. As we can observe, in the frozen lake environment, the goal of the agent is to start from the initial state S and reach the goal state G.\\n\",\n    \"\\n\",\n    \"![title](Images/1.png)\\n\",\n    \"\\n\",\n    \"In the above environment, the following applies:\\n\",\n    \"\\n\",\n    \"* S denotes the starting state\\n\",\n    \"* F denotes the frozen state\\n\",\n    \"* H denotes the hole state\\n\",\n    \"* G denotes the goal state\\n\",\n    \"\\n\",\n    \"So, the agent has to start from the state S and reach the goal state G. But one issue is that if the agent visits the state H, which is just the hole state, then the agent will fall into the hole and die as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/2.png)\\n\",\n    \"\\n\",\n    \"So, we need to make sure that the agent starts from S and reaches G without falling into the hole state H as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/3.png)\\n\",\n    \"Each grid box in the above environment is called state, thus we have 16 states (S to G) and we have 4 possible actions which are up, down, left and right. We learned that our goal is to reach the state G from S without visiting H. So, we assign reward as 0 to all the states and + 1 for the goal state G. \\n\",\n    \"\\n\",\n    \"Thus, we learned how the frozen lake environment works. Now, to train our agent in the frozen lake environment, first, we need to create the environment by coding it from scratch in Python. But luckily we don't have to do that! Since the gym provides the various environment, we can directly import the gym toolkit and create a frozen lake environment using the gym.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Now, we will learn how to create our frozen lake environment using the gym. Before running any code, make sure that you activated our virtual environment universe. First, let's import the gym library:\\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"Next, we can create a gym environment using the make function.  The make function requires the environment id as a parameter. In the gym, the id of the frozen lake environment is `FrozenLake-v0`. So, we can create our frozen lake environment as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"FrozenLake-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After creating the environment, we can see how our environment looks like using the render function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\n\",\n      \"\\u001b[41mS\\u001b[0mFFF\\n\",\n      \"FHFH\\n\",\n      \"FFFH\\n\",\n      \"HFFG\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"env.render()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"As we can observe, the frozen lake environment consists of 16 states (S to G) as we learned. The state S is highlighted indicating that it is our current state, that is, agent is in the state S. So whenever we create an environment, an agent will always begin from the initial state, in our case, it is the state S. \\n\",\n    \"\\n\",\n    \"That's it! Creating the environment using the gym is that simple. In the next section, we will understand more about the gym environment by relating all the concepts we have learned in the previous chapter. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Exploring the environment\\n\",\n    \"\\n\",\n    \"In the previous chapter, we learned that the reinforcement learning environment can be modeled as the Markov decision process (MDP) and an MDP consists of the following: \\n\",\n    \"\\n\",\n    \"* __States__ -  A set of states present in the environment \\n\",\n    \"* __Actions__ - A set of actions that the agent can perform in each state. \\n\",\n    \"* __Transition probability__ - The transition probability is denoted by $P(s'|s,a) $. It implies the probability of moving from a state $s$ to the state $s'$ while performing an action $a$.\\n\",\n    \"* __Reward function__ - Reward function is denoted by $R(s,a,s')$. It implies the reward the agent obtains moving from a state $s$ to the state  $s'$ while performing an action $a$.\\n\",\n    \"\\n\",\n    \"Let's now understand how to obtain all the above information from the frozen lake environment we just created using the gym.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## States\\n\",\n    \"A state space consists of all of our states. We can obtain the number of states in our environment by just typing `env.observation_space` as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Discrete(16)\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.observation_space)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"It implies that we have 16 discrete states in our state space starting from the state S to G. Note that, in the gym, the states will be encoded as a number, so the state S will be encoded as 0, state F will be encoded as 1 and so on as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/5.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Actions\\n\",\n    \"\\n\",\n    \"We learned that the action space consists of all the possible actions in the environment. We can obtain the action space by `env.action_space` as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Discrete(4)\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.action_space)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"It implies that we have 4 discrete actions in our action space which are left, down, right, up. Note that, similar to states, actions also will be encoded into numbers as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/6.PNG)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Transition probability and Reward function\\n\",\n    \"\\n\",\n    \"Now, let's look at how to obtain the transition probability and the reward function. We learned that in the stochastic environment, we cannot say that by performing some action $a$, agent will always reach the next state $s'$ exactly because there will be some randomness associated with the stochastic environment and by performing an action $a$ in the state $s$, agent reaches the next state  with some probability.\\n\",\n    \"\\n\",\n    \"Let's suppose we are in state 2 (F). Now if we perform action 1 (down) in state 2, we can reach the state 6 as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/7.png)\\n\",\n    \"\\n\",\n    \"Our frozen lake environment is a stochastic environment. When our environment is stochastic we won't always reach the state 6 by performing action 1(down) in state 2, we also reach other states with some probability. So when we perform an action 1 (down) in the state 2, we reach state 1 with probability 0.33333, we reach state 6 with probability 0.33333 and we reach the state 3 with probability 0.33333 as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/8.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"As we can notice, in the stochastic environment we reach the next states with some probability. Now, let's learn how to obtain this transition probability using the gym environment.  \\n\",\n    \"\\n\",\n    \"We can obtain the transition probability and the reward function by just typing `env.P[state][action]` So, in order to obtain the transition probability of moving from the state S to the other states by performing an action right, we can type, `env.P[S][right]`. But we cannot just type state S and action right directly since they are encoded into numbers. We learned that state S is encoded as 0 and the action right is encoded as 2, so, in order to obtain the transition probability of state S by performing an action right, we type `env.P[0][2]` as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[(0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False), (0.3333333333333333, 0, 0.0, False)]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.P[0][2])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"What does this imply? Our output is in the form of `[(transition probability, next state, reward, Is terminal state?)]` It implies that if we perform an action 2 (right) in state 0 (S) then:\\n\",\n    \"\\n\",\n    \"* We reach the state 4 (F) with probability 0.33333 and receive 0 reward. \\n\",\n    \"* We reach the state 1 (F) with probability 0.33333 and receive 0 reward.\\n\",\n    \"* We reach the same state 0 (S) with probability 0.33333 and receive 0 reward.\\n\",\n    \"\\n\",\n    \"The transition probability is shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/9.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Thus, when we type `env.P[state][action]` we get the result in the form of `[(transition probability, next state, reward, Is terminal state?)]`. The last value is the boolean and it implies that whether the next state is a terminal state, since 4, 1 and 0 are not the terminal states it is given as false. \\n\",\n    \"\\n\",\n    \"The output of `env.P[0][2]` is shown in the below table for more clarity:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/10.PNG)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's understand this with one more example. Let's suppose we are in the state 3 (F) as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/11.png)\\n\",\n    \"\\n\",\n    \"Say, we perform action 1 (down) in the state 3(F). Then the transition probability of the state 3(F) by performing action 1(down) can be obtained as shown below:\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[(0.3333333333333333, 2, 0.0, False), (0.3333333333333333, 7, 0.0, True), (0.3333333333333333, 3, 0.0, False)]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.P[3][1])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we learned, our output is in the form of `[(transition probability, next state, reward, Is terminal state?)]` It implies that if we perform an action 1 (down) in state 3 (F) then:\\n\",\n    \"\\n\",\n    \"* We reach the state 2 (F) with probability 0.33333 and receive 0 reward. \\n\",\n    \"* We reach the state 7 (H) with probability 0.33333 and receive 0 reward.\\n\",\n    \"* We reach the same state 3 (F) with probability 0.33333 and receive 0 reward.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"The transition probability is shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/12.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"The output of `env.P[3][1]` is shown in the below table for more clarity:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/13.PNG)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe, in the second row of our output, we have, `(0.33333, 7, 0.0, True)`,and the last value here is marked as True. It implies that state 7 is a terminal state. That is, if we perform action 1(down) in state 3(F) then we reach the state 7(H) with 0.33333 probability and since 7(H) is a hole, the agent dies if it reaches the state 7(H). Thus 7(H) is a terminal state and so it is marked as True. \\n\",\n    \"\\n\",\n    \"Thus, we learned how to obtain the state space, action space, transition probability and the reward function using the gym environment. In the next section, we will learn how to generate an episode. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "02. A Guide to the Gym Toolkit/2.05. Cart Pole Balancing with Random Policy.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Cart Pole Balancing with Random Policy\\n\",\n    \"\\n\",\n    \"Let's create an agent with the random policy, that is, we create the agent that selects the random action in the environment and tries to balance the pole. The agent receives +1 reward every time the pole stands straight up on the cart. We will generate over 100 episodes and we will see the return (sum of rewards) obtained over each episode. Let's learn this step by step.\\n\",\n    \"\\n\",\n    \"First, create our cart pole environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"env = gym.make('CartPole-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"Set the number of episodes and number of time steps in the episode:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 100\\n\",\n    \"num_timesteps = 50\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Episode: 0, Return: 23.0\\n\",\n      \"Episode: 10, Return: 12.0\\n\",\n      \"Episode: 20, Return: 23.0\\n\",\n      \"Episode: 30, Return: 15.0\\n\",\n      \"Episode: 40, Return: 19.0\\n\",\n      \"Episode: 50, Return: 10.0\\n\",\n      \"Episode: 60, Return: 16.0\\n\",\n      \"Episode: 70, Return: 10.0\\n\",\n      \"Episode: 80, Return: 22.0\\n\",\n      \"Episode: 90, Return: 38.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #set the Return to 0\\n\",\n    \"    Return = 0\\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #for each step in the episode\\n\",\n    \"    for t in range(num_timesteps):\\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"        \\n\",\n    \"        #randomly select an action by sampling from the environment\\n\",\n    \"        random_action = env.action_space.sample()\\n\",\n    \"        \\n\",\n    \"        #perform the randomly selected action\\n\",\n    \"        next_state, reward, done, info = env.step(random_action)\\n\",\n    \"\\n\",\n    \"        #update the return\\n\",\n    \"        Return = Return + reward\\n\",\n    \"\\n\",\n    \"        #if the next state is a terminal state then end the episode\\n\",\n    \"        if done:\\n\",\n    \"            break\\n\",\n    \"    #for every 10 episodes, print the return (sum of rewards)\\n\",\n    \"    if i%10==0:\\n\",\n    \"        print('Episode: {}, Return: {}'.format(i, Return))\\n\",\n    \"        \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Close the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env.close()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "02. A Guide to the Gym Toolkit/README.md",
    "content": "# 2. A Guide to the Gym Toolkit\n* 2.1. Setting Up our Machine\n   * 2.1.1. Installing Anaconda\n   * 2.1.2. Installing the Gym Toolkit\n   * 2.1.3. Common Error Fixes\n* 2.2. Creating our First Gym Environment\n   * 2.2.1. Exploring the Environment\n   * 2.2.2. States\n   * 2.2.3. Actions\n   * 2.2.4. Transition Probability and Reward Function\n* 2.3. Generating an episode\n* 2.4. Classic Control Environments\n   * 2.4.1. State Space\n   * 2.4.2. Action Space\n* 2.5. Cart Pole Balancing with Random Policy\n* 2.6. Atari Game Environments\n   * 2.6.1. General Environment\n   * 2.6.2. Deterministic Environment\n* 2.7. Agent Playing the Tennis Game\n* 2.8. Recording the Game\n* 2.9. Other environments\n   * 2.9.1. Box 2D\n   * 2.9.2. Mujoco\n   * 2.9.3. Robotics\n   * 2.9.4. Toy text\n   * 2.9.5. Algorithms\n* 2.10. Environment Synopsis"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/.ipynb_checkpoints/3.06. Solving the Frozen Lake Problem with Value Iteration-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"# Solving the Frozen Lake Problem with Value Iteration\\n\",\n    \"\\n\",\n    \"In the previous chapter, we have learned about the frozen lake environment. The frozen\\n\",\n    \"lake environment is shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"![title](Images/4.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's recap the frozen lake environment a bit. In the frozen lake environment shown above,\\n\",\n    \"the following applies:\\n\",\n    \"    \\n\",\n    \"* S implies the starting state\\n\",\n    \"* F implies the frozen states\\n\",\n    \"* H implies the hold states\\n\",\n    \"* G implies the goal state\\n\",\n    \"\\n\",\n    \"We learned that in the frozen lake environment, our goal is to reach the goal state G from\\n\",\n    \"the starting state S without visiting the hole states H. That is, while trying to reach the goal\\n\",\n    \"state G from the starting state S if the agent visits the hole state H then it will fall into the\\n\",\n    \"hole and die as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"![title](Images/5.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"So, the goal of the agent is to reach the state G starting from the state S without visiting the\\n\",\n    \"hole states H as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"![title](Images/6.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"How can we achieve this goal? That is, how can we reach the state G from S without\\n\",\n    \"visiting H? We learned that the optimal policy tells the agent to perform correct action in\\n\",\n    \"each state. So, if we find the optimal policy then we can reach the state G from S visiting the state H. Okay, how to find the optimal policy? We can use the value iteration method\\n\",\n    \"we just learned to find the optimal policy.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Remember that all our states (S to G) will be encoded from 0 to 16 and all the four actions -\\n\",\n    \"left, down, up, right will be encoded from 0 to 3 in the gym toolkit.\\n\",\n    \"So, in this section, we will learn how to find the optimal policy using the value iteration\\n\",\n    \"method so that the agent can reach the state G from S without visiting H.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create the frozen lake environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('FrozenLake-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's look at the frozen lake environment using the render function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\n\",\n      \"\\u001b[41mS\\u001b[0mFFF\\n\",\n      \"FHFH\\n\",\n      \"FFFH\\n\",\n      \"HFFG\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"env.render()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can notice, our agent is in the state S and it has to reach the state G without visiting\\n\",\n    \"the states H. So, let's learn how to compute the optimal policy using the value iteration\\n\",\n    \"method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, let's learn how to compute the optimal value function and then we will see how to\\n\",\n    \"extract the optimal policy from the computed optimal value function. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Computing optimal value function\\n\",\n    \"\\n\",\n    \"We will define a function called `value_iteration` where we compute the optimal value\\n\",\n    \"function iteratively by taking maximum over Q function. For\\n\",\n    \"better understanding, let's closely look at the every line of the function and then we look at\\n\",\n    \"the complete function at the end which gives us more clarity.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Define `value_iteration` function which takes the environment as a parameter: \\n\",\n    \"    \\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def value_iteration(env):\\n\",\n    \"\\n\",\n    \"    #set the number of iterations\\n\",\n    \"    num_iterations = 1000\\n\",\n    \"    \\n\",\n    \"    #set the threshold number for checking the convergence of the value function\\n\",\n    \"    threshold = 1e-20\\n\",\n    \"    \\n\",\n    \"    #we also set the discount factor\\n\",\n    \"    gamma = 1.0\\n\",\n    \"    \\n\",\n    \"    #now, we will initialize the value table, with the value of all states to zero\\n\",\n    \"    value_table = np.zeros(env.observation_space.n)\\n\",\n    \"    \\n\",\n    \"    #for every iteration\\n\",\n    \"    for i in range(num_iterations):\\n\",\n    \"        \\n\",\n    \"        #update the value table, that is, we learned that on every iteration, we use the updated value\\n\",\n    \"        #table (state values) from the previous iteration\\n\",\n    \"        updated_value_table = np.copy(value_table) \\n\",\n    \"             \\n\",\n    \"        #now, we compute the value function (state value) by taking the maximum of Q value.\\n\",\n    \"        \\n\",\n    \"        #thus, for each state, we compute the Q values of all the actions in the state and then\\n\",\n    \"        #we update the value of the state as the one which has maximum Q value as shown below:\\n\",\n    \"        for s in range(env.observation_space.n):\\n\",\n    \"            \\n\",\n    \"            Q_values = [sum([prob*(r + gamma * updated_value_table[s_])\\n\",\n    \"                             for prob, s_, r, _ in env.P[s][a]]) \\n\",\n    \"                                   for a in range(env.action_space.n)] \\n\",\n    \"                                        \\n\",\n    \"            value_table[s] = max(Q_values) \\n\",\n    \"                        \\n\",\n    \"        #after computing the value table, that is, value of all the states, we check whether the\\n\",\n    \"        #difference between value table obtained in the current iteration and previous iteration is\\n\",\n    \"        #less than or equal to a threshold value if it is less then we break the loop and return the\\n\",\n    \"        #value table as our optimal value function as shown below:\\n\",\n    \"    \\n\",\n    \"        if (np.sum(np.fabs(updated_value_table - value_table)) <= threshold):\\n\",\n    \"             break\\n\",\n    \"    \\n\",\n    \"    return value_table\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"Now, that we have computed the optimal value function by taking the maximum over Q\\n\",\n    \"values, let's see how to extract the optimal policy from the optimal value function. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Extracting optimal policy from the optimal value function\\n\",\n    \"\\n\",\n    \"In the previous step, we computed the optimal value function. Now, let see how to extract\\n\",\n    \"the optimal policy from the computed optimal value function.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"First, we define a function called `extract_policy` which takes the `value_table` as a\\n\",\n    \"parameter: \\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def extract_policy(value_table):\\n\",\n    \"    \\n\",\n    \"    #set the discount factor\\n\",\n    \"    gamma = 1.0\\n\",\n    \"     \\n\",\n    \"    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to\\n\",\n    \"    #be zero\\n\",\n    \"    policy = np.zeros(env.observation_space.n) \\n\",\n    \"    \\n\",\n    \"    #now, we compute the Q function using the optimal value function obtained from the\\n\",\n    \"    #previous step. After computing the Q function, we can extract policy by selecting action which has\\n\",\n    \"    #maximum Q value. Since we are computing the Q function using the optimal value\\n\",\n    \"    #function, the policy extracted from the Q function will be the optimal policy. \\n\",\n    \"    \\n\",\n    \"    #As shown below, for each state, we compute the Q values for all the actions in the state and\\n\",\n    \"    #then we extract policy by selecting the action which has maximum Q value.\\n\",\n    \"    \\n\",\n    \"    #for each state\\n\",\n    \"    for s in range(env.observation_space.n):\\n\",\n    \"        \\n\",\n    \"        #compute the Q value of all the actions in the state\\n\",\n    \"        Q_values = [sum([prob*(r + gamma * value_table[s_])\\n\",\n    \"                             for prob, s_, r, _ in env.P[s][a]]) \\n\",\n    \"                                   for a in range(env.action_space.n)] \\n\",\n    \"                \\n\",\n    \"        #extract policy by selecting the action which has maximum Q value\\n\",\n    \"        policy[s] = np.argmax(np.array(Q_values))        \\n\",\n    \"    \\n\",\n    \"    return policy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"That's it! Now, we will see how to extract the optimal policy in our frozen lake\\n\",\n    \"environment. \\n\",\n    \"\\n\",\n    \"## Putting it all together\\n\",\n    \"We learn that in the frozen lake environment our goal is to find the optimal policy which\\n\",\n    \"selects the correct action in each state so that we can reach the state G from the state\\n\",\n    \"A without visiting the hole states.\\n\",\n    \"\\n\",\n    \"First, we compute the optimal value function using our `value_iteration` function by\\n\",\n    \"passing our frozen lake environment as the parameter: \\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"optimal_value_function = value_iteration(env=env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Next, we extract the optimal policy from the optimal value function using our\\n\",\n    \"extract_policy function as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"optimal_policy = extract_policy(optimal_value_function)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can print the obtained optimal policy:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(optimal_policy)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe, our optimal policy tells us to\\n\",\n    \"perform the correct action in each state. \\n\",\n    \"\\n\",\n    \"Now, that we have learned what is value iteration and how to perform the value iteration\\n\",\n    \"method to compute the optimal policy in our frozen lake environment, in the next section\\n\",\n    \"we will learn about another interesting method called the policy iteration. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/.ipynb_checkpoints/3.08. Solving the Frozen Lake Problem with Policy Iteration-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Solving the Frozen Lake Problem with Policy Iteration\\n\",\n    \"\\n\",\n    \"We learned that in the frozen lake environment, our goal is to reach the goal state G from\\n\",\n    \"the starting state S without visiting the hole states H. Now, let's learn how to compute the optimal policy using the policy iteration method in the frozen lake environment.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create the frozen lake environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('FrozenLake-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"We learned that in the policy iteration, we compute the value function using the policy\\n\",\n    \"iteratively. Once we found the optimal value function then the policy which is used to\\n\",\n    \"compute the optimal value function will be the optimal policy.\\n\",\n    \"\\n\",\n    \"So, first, let's learn how to compute the value function using the policy. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Computing value function using policy\\n\",\n    \"\\n\",\n    \"This step is exactly the same as how we computed the value function in the value iteration\\n\",\n    \"method but with a small difference. Here we compute the value function using the policy\\n\",\n    \"but in the value iteration method, we compute the value function by taking the maximum\\n\",\n    \"over Q values. Now, let's learn how to define a function that computes the value function\\n\",\n    \"using the given policy.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Let's define a function called `compute_value_function` which takes the policy as a\\n\",\n    \"parameter:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def compute_value_function(policy):\\n\",\n    \"    \\n\",\n    \"    #now, let's define the number of iterations\\n\",\n    \"    num_iterations = 1000\\n\",\n    \"    \\n\",\n    \"    #define the threshold value\\n\",\n    \"    threshold = 1e-20\\n\",\n    \"    \\n\",\n    \"    #set the discount factor\\n\",\n    \"    gamma = 1.0\\n\",\n    \"    \\n\",\n    \"    #now, we will initialize the value table, with the value of all states to zero\\n\",\n    \"    value_table = np.zeros(env.observation_space.n)\\n\",\n    \"    \\n\",\n    \"    #for every iteration\\n\",\n    \"    for i in range(num_iterations):\\n\",\n    \"        \\n\",\n    \"        #update the value table, that is, we learned that on every iteration, we use the updated value\\n\",\n    \"        #table (state values) from the previous iteration\\n\",\n    \"        updated_value_table = np.copy(value_table)\\n\",\n    \"        \\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"        #thus, for each state, we select the action according to the given policy and then we update the\\n\",\n    \"        #value of the state using the selected action as shown below\\n\",\n    \"        \\n\",\n    \"        #for each state\\n\",\n    \"        for s in range(env.observation_space.n):\\n\",\n    \"            \\n\",\n    \"            #select the action in the state according to the policy\\n\",\n    \"            a = policy[s]\\n\",\n    \"            \\n\",\n    \"            #compute the value of the state using the selected action\\n\",\n    \"            value_table[s] = sum([prob * (r + gamma * updated_value_table[s_]) \\n\",\n    \"                                        for prob, s_, r, _ in env.P[s][a]])\\n\",\n    \"            \\n\",\n    \"        #after computing the value table, that is, value of all the states, we check whether the\\n\",\n    \"        #difference between value table obtained in the current iteration and previous iteration is\\n\",\n    \"        #less than or equal to a threshold value if it is less then we break the loop and return the\\n\",\n    \"        #value table as an accurate value function of the given policy\\n\",\n    \"\\n\",\n    \"        if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):\\n\",\n    \"            break\\n\",\n    \"            \\n\",\n    \"    return value_table\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"Now that we have computed the value function of the policy, let's see how to extract the\\n\",\n    \"policy from the value function. \\n\",\n    \"\\n\",\n    \"## Extracting policy from the value function\\n\",\n    \"\\n\",\n    \"This step is exactly the same as how we extracted policy from the value function in the\\n\",\n    \"value iteration method. Thus, similar to what we learned in the value iteration method, we\\n\",\n    \"define a function called `extract_policy` to extract a policy given the value function as\\n\",\n    \"shown below:\\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def extract_policy(value_table):\\n\",\n    \"    \\n\",\n    \"    #set the discount factor\\n\",\n    \"    gamma = 1.0\\n\",\n    \"     \\n\",\n    \"    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to\\n\",\n    \"    #be zero\\n\",\n    \"    policy = np.zeros(env.observation_space.n) \\n\",\n    \"    \\n\",\n    \"    #now, we compute the Q function using the optimal value function obtained from the\\n\",\n    \"    #previous step. After computing the Q function, we can extract policy by selecting action which has\\n\",\n    \"    #maximum Q value. Since we are computing the Q function using the optimal value\\n\",\n    \"    #function, the policy extracted from the Q function will be the optimal policy. \\n\",\n    \"    \\n\",\n    \"    #As shown below, for each state, we compute the Q values for all the actions in the state and\\n\",\n    \"    #then we extract policy by selecting the action which has maximum Q value.\\n\",\n    \"    \\n\",\n    \"    #for each state\\n\",\n    \"    for s in range(env.observation_space.n):\\n\",\n    \"        \\n\",\n    \"        #compute the Q value of all the actions in the state\\n\",\n    \"        Q_values = [sum([prob*(r + gamma * value_table[s_])\\n\",\n    \"                             for prob, s_, r, _ in env.P[s][a]]) \\n\",\n    \"                                   for a in range(env.action_space.n)] \\n\",\n    \"                \\n\",\n    \"        #extract policy by selecting the action which has maximum Q value\\n\",\n    \"        policy[s] = np.argmax(np.array(Q_values))        \\n\",\n    \"    \\n\",\n    \"    return policy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"## Putting it all together\\n\",\n    \"\\n\",\n    \"First, let's define a function called `policy_iteration` which takes the environment as a\\n\",\n    \"parameter\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def policy_iteration(env):\\n\",\n    \"    \\n\",\n    \"    #set the number of iterations\\n\",\n    \"    num_iterations = 1000\\n\",\n    \"    \\n\",\n    \"    #we learned that in the policy iteration method, we begin by initializing a random policy.\\n\",\n    \"    #so, we will initialize the random policy which selects the action 0 in all the states\\n\",\n    \"    policy = np.zeros(env.observation_space.n)  \\n\",\n    \"    \\n\",\n    \"    #for every iteration\\n\",\n    \"    for i in range(num_iterations):\\n\",\n    \"        #compute the value function using the policy\\n\",\n    \"        value_function = compute_value_function(policy)\\n\",\n    \"        \\n\",\n    \"        #extract the new policy from the computed value function\\n\",\n    \"        new_policy = extract_policy(value_function)\\n\",\n    \"           \\n\",\n    \"        #if the policy and new_policy are same then break the loop\\n\",\n    \"        if (np.all(policy == new_policy)):\\n\",\n    \"            break\\n\",\n    \"        \\n\",\n    \"        #else, update the current policy to new_policy\\n\",\n    \"        policy = new_policy\\n\",\n    \"        \\n\",\n    \"    return policy\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"Now, let's learn how to perform policy iteration and find the optimal policy in the frozen\\n\",\n    \"lake environment. \\n\",\n    \"\\n\",\n    \"So, we just feed the frozen lake environment to our `policy_iteration`\\n\",\n    \"function as shown below and get the optimal policy:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"optimal_policy = policy_iteration(env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can print the optimal policy: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(optimal_policy)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe, our optimal policy tells us to perform the correct action in each\\n\",\n    \"state. Thus, we learned how to perform the policy iteration method to compute the optimal\\n\",\n    \"policy. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/3.06. Solving the Frozen Lake Problem with Value Iteration.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"# Solving the Frozen Lake Problem with Value Iteration\\n\",\n    \"\\n\",\n    \"In the previous chapter, we have learned about the frozen lake environment. The frozen\\n\",\n    \"lake environment is shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"![title](Images/4.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's recap the frozen lake environment a bit. In the frozen lake environment shown above,\\n\",\n    \"the following applies:\\n\",\n    \"    \\n\",\n    \"* S implies the starting state\\n\",\n    \"* F implies the frozen states\\n\",\n    \"* H implies the hold states\\n\",\n    \"* G implies the goal state\\n\",\n    \"\\n\",\n    \"We learned that in the frozen lake environment, our goal is to reach the goal state G from\\n\",\n    \"the starting state S without visiting the hole states H. That is, while trying to reach the goal\\n\",\n    \"state G from the starting state S if the agent visits the hole state H then it will fall into the\\n\",\n    \"hole and die as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"![title](Images/5.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"So, the goal of the agent is to reach the state G starting from the state S without visiting the\\n\",\n    \"hole states H as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"![title](Images/6.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"How can we achieve this goal? That is, how can we reach the state G from S without\\n\",\n    \"visiting H? We learned that the optimal policy tells the agent to perform correct action in\\n\",\n    \"each state. So, if we find the optimal policy then we can reach the state G from S visiting the state H. Okay, how to find the optimal policy? We can use the value iteration method\\n\",\n    \"we just learned to find the optimal policy.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Remember that all our states (S to G) will be encoded from 0 to 16 and all the four actions -\\n\",\n    \"left, down, up, right will be encoded from 0 to 3 in the gym toolkit.\\n\",\n    \"So, in this section, we will learn how to find the optimal policy using the value iteration\\n\",\n    \"method so that the agent can reach the state G from S without visiting H.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create the frozen lake environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('FrozenLake-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's look at the frozen lake environment using the render function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\n\",\n      \"\\u001b[41mS\\u001b[0mFFF\\n\",\n      \"FHFH\\n\",\n      \"FFFH\\n\",\n      \"HFFG\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"env.render()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can notice, our agent is in the state S and it has to reach the state G without visiting\\n\",\n    \"the states H. So, let's learn how to compute the optimal policy using the value iteration\\n\",\n    \"method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, let's learn how to compute the optimal value function and then we will see how to\\n\",\n    \"extract the optimal policy from the computed optimal value function. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Computing optimal value function\\n\",\n    \"\\n\",\n    \"We will define a function called `value_iteration` where we compute the optimal value\\n\",\n    \"function iteratively by taking maximum over Q function. For\\n\",\n    \"better understanding, let's closely look at the every line of the function and then we look at\\n\",\n    \"the complete function at the end which gives us more clarity.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Define `value_iteration` function which takes the environment as a parameter: \\n\",\n    \"    \\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def value_iteration(env):\\n\",\n    \"\\n\",\n    \"    #set the number of iterations\\n\",\n    \"    num_iterations = 1000\\n\",\n    \"    \\n\",\n    \"    #set the threshold number for checking the convergence of the value function\\n\",\n    \"    threshold = 1e-20\\n\",\n    \"    \\n\",\n    \"    #we also set the discount factor\\n\",\n    \"    gamma = 1.0\\n\",\n    \"    \\n\",\n    \"    #now, we will initialize the value table, with the value of all states to zero\\n\",\n    \"    value_table = np.zeros(env.observation_space.n)\\n\",\n    \"    \\n\",\n    \"    #for every iteration\\n\",\n    \"    for i in range(num_iterations):\\n\",\n    \"        \\n\",\n    \"        #update the value table, that is, we learned that on every iteration, we use the updated value\\n\",\n    \"        #table (state values) from the previous iteration\\n\",\n    \"        updated_value_table = np.copy(value_table) \\n\",\n    \"             \\n\",\n    \"        #now, we compute the value function (state value) by taking the maximum of Q value.\\n\",\n    \"        \\n\",\n    \"        #thus, for each state, we compute the Q values of all the actions in the state and then\\n\",\n    \"        #we update the value of the state as the one which has maximum Q value as shown below:\\n\",\n    \"        for s in range(env.observation_space.n):\\n\",\n    \"            \\n\",\n    \"            Q_values = [sum([prob*(r + gamma * updated_value_table[s_])\\n\",\n    \"                             for prob, s_, r, _ in env.P[s][a]]) \\n\",\n    \"                                   for a in range(env.action_space.n)] \\n\",\n    \"                                        \\n\",\n    \"            value_table[s] = max(Q_values) \\n\",\n    \"                        \\n\",\n    \"        #after computing the value table, that is, value of all the states, we check whether the\\n\",\n    \"        #difference between value table obtained in the current iteration and previous iteration is\\n\",\n    \"        #less than or equal to a threshold value if it is less then we break the loop and return the\\n\",\n    \"        #value table as our optimal value function as shown below:\\n\",\n    \"    \\n\",\n    \"        if (np.sum(np.fabs(updated_value_table - value_table)) <= threshold):\\n\",\n    \"             break\\n\",\n    \"    \\n\",\n    \"    return value_table\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"Now, that we have computed the optimal value function by taking the maximum over Q\\n\",\n    \"values, let's see how to extract the optimal policy from the optimal value function. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Extracting optimal policy from the optimal value function\\n\",\n    \"\\n\",\n    \"In the previous step, we computed the optimal value function. Now, let see how to extract\\n\",\n    \"the optimal policy from the computed optimal value function.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"First, we define a function called `extract_policy` which takes the `value_table` as a\\n\",\n    \"parameter: \\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def extract_policy(value_table):\\n\",\n    \"    \\n\",\n    \"    #set the discount factor\\n\",\n    \"    gamma = 1.0\\n\",\n    \"     \\n\",\n    \"    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to\\n\",\n    \"    #be zero\\n\",\n    \"    policy = np.zeros(env.observation_space.n) \\n\",\n    \"    \\n\",\n    \"    #now, we compute the Q function using the optimal value function obtained from the\\n\",\n    \"    #previous step. After computing the Q function, we can extract policy by selecting action which has\\n\",\n    \"    #maximum Q value. Since we are computing the Q function using the optimal value\\n\",\n    \"    #function, the policy extracted from the Q function will be the optimal policy. \\n\",\n    \"    \\n\",\n    \"    #As shown below, for each state, we compute the Q values for all the actions in the state and\\n\",\n    \"    #then we extract policy by selecting the action which has maximum Q value.\\n\",\n    \"    \\n\",\n    \"    #for each state\\n\",\n    \"    for s in range(env.observation_space.n):\\n\",\n    \"        \\n\",\n    \"        #compute the Q value of all the actions in the state\\n\",\n    \"        Q_values = [sum([prob*(r + gamma * value_table[s_])\\n\",\n    \"                             for prob, s_, r, _ in env.P[s][a]]) \\n\",\n    \"                                   for a in range(env.action_space.n)] \\n\",\n    \"                \\n\",\n    \"        #extract policy by selecting the action which has maximum Q value\\n\",\n    \"        policy[s] = np.argmax(np.array(Q_values))        \\n\",\n    \"    \\n\",\n    \"    return policy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"That's it! Now, we will see how to extract the optimal policy in our frozen lake\\n\",\n    \"environment. \\n\",\n    \"\\n\",\n    \"## Putting it all together\\n\",\n    \"We learn that in the frozen lake environment our goal is to find the optimal policy which\\n\",\n    \"selects the correct action in each state so that we can reach the state G from the state\\n\",\n    \"A without visiting the hole states.\\n\",\n    \"\\n\",\n    \"First, we compute the optimal value function using our `value_iteration` function by\\n\",\n    \"passing our frozen lake environment as the parameter: \\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"optimal_value_function = value_iteration(env=env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Next, we extract the optimal policy from the optimal value function using our\\n\",\n    \"extract_policy function as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"optimal_policy = extract_policy(optimal_value_function)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can print the obtained optimal policy:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(optimal_policy)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe, our optimal policy tells us to\\n\",\n    \"perform the correct action in each state. \\n\",\n    \"\\n\",\n    \"Now, that we have learned what is value iteration and how to perform the value iteration\\n\",\n    \"method to compute the optimal policy in our frozen lake environment, in the next section\\n\",\n    \"we will learn about another interesting method called the policy iteration. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/3.08. Solving the Frozen Lake Problem with Policy Iteration.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Solving the Frozen Lake Problem with Policy Iteration\\n\",\n    \"\\n\",\n    \"We learned that in the frozen lake environment, our goal is to reach the goal state G from\\n\",\n    \"the starting state S without visiting the hole states H. Now, let's learn how to compute the optimal policy using the policy iteration method in the frozen lake environment.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create the frozen lake environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('FrozenLake-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"We learned that in the policy iteration, we compute the value function using the policy\\n\",\n    \"iteratively. Once we found the optimal value function then the policy which is used to\\n\",\n    \"compute the optimal value function will be the optimal policy.\\n\",\n    \"\\n\",\n    \"So, first, let's learn how to compute the value function using the policy. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Computing value function using policy\\n\",\n    \"\\n\",\n    \"This step is exactly the same as how we computed the value function in the value iteration\\n\",\n    \"method but with a small difference. Here we compute the value function using the policy\\n\",\n    \"but in the value iteration method, we compute the value function by taking the maximum\\n\",\n    \"over Q values. Now, let's learn how to define a function that computes the value function\\n\",\n    \"using the given policy.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Let's define a function called `compute_value_function` which takes the policy as a\\n\",\n    \"parameter:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def compute_value_function(policy):\\n\",\n    \"    \\n\",\n    \"    #now, let's define the number of iterations\\n\",\n    \"    num_iterations = 1000\\n\",\n    \"    \\n\",\n    \"    #define the threshold value\\n\",\n    \"    threshold = 1e-20\\n\",\n    \"    \\n\",\n    \"    #set the discount factor\\n\",\n    \"    gamma = 1.0\\n\",\n    \"    \\n\",\n    \"    #now, we will initialize the value table, with the value of all states to zero\\n\",\n    \"    value_table = np.zeros(env.observation_space.n)\\n\",\n    \"    \\n\",\n    \"    #for every iteration\\n\",\n    \"    for i in range(num_iterations):\\n\",\n    \"        \\n\",\n    \"        #update the value table, that is, we learned that on every iteration, we use the updated value\\n\",\n    \"        #table (state values) from the previous iteration\\n\",\n    \"        updated_value_table = np.copy(value_table)\\n\",\n    \"        \\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"        #thus, for each state, we select the action according to the given policy and then we update the\\n\",\n    \"        #value of the state using the selected action as shown below\\n\",\n    \"        \\n\",\n    \"        #for each state\\n\",\n    \"        for s in range(env.observation_space.n):\\n\",\n    \"            \\n\",\n    \"            #select the action in the state according to the policy\\n\",\n    \"            a = policy[s]\\n\",\n    \"            \\n\",\n    \"            #compute the value of the state using the selected action\\n\",\n    \"            value_table[s] = sum([prob * (r + gamma * updated_value_table[s_]) \\n\",\n    \"                                        for prob, s_, r, _ in env.P[s][a]])\\n\",\n    \"            \\n\",\n    \"        #after computing the value table, that is, value of all the states, we check whether the\\n\",\n    \"        #difference between value table obtained in the current iteration and previous iteration is\\n\",\n    \"        #less than or equal to a threshold value if it is less then we break the loop and return the\\n\",\n    \"        #value table as an accurate value function of the given policy\\n\",\n    \"\\n\",\n    \"        if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):\\n\",\n    \"            break\\n\",\n    \"            \\n\",\n    \"    return value_table\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"Now that we have computed the value function of the policy, let's see how to extract the\\n\",\n    \"policy from the value function. \\n\",\n    \"\\n\",\n    \"## Extracting policy from the value function\\n\",\n    \"\\n\",\n    \"This step is exactly the same as how we extracted policy from the value function in the\\n\",\n    \"value iteration method. Thus, similar to what we learned in the value iteration method, we\\n\",\n    \"define a function called `extract_policy` to extract a policy given the value function as\\n\",\n    \"shown below:\\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def extract_policy(value_table):\\n\",\n    \"    \\n\",\n    \"    #set the discount factor\\n\",\n    \"    gamma = 1.0\\n\",\n    \"     \\n\",\n    \"    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to\\n\",\n    \"    #be zero\\n\",\n    \"    policy = np.zeros(env.observation_space.n) \\n\",\n    \"    \\n\",\n    \"    #now, we compute the Q function using the optimal value function obtained from the\\n\",\n    \"    #previous step. After computing the Q function, we can extract policy by selecting action which has\\n\",\n    \"    #maximum Q value. Since we are computing the Q function using the optimal value\\n\",\n    \"    #function, the policy extracted from the Q function will be the optimal policy. \\n\",\n    \"    \\n\",\n    \"    #As shown below, for each state, we compute the Q values for all the actions in the state and\\n\",\n    \"    #then we extract policy by selecting the action which has maximum Q value.\\n\",\n    \"    \\n\",\n    \"    #for each state\\n\",\n    \"    for s in range(env.observation_space.n):\\n\",\n    \"        \\n\",\n    \"        #compute the Q value of all the actions in the state\\n\",\n    \"        Q_values = [sum([prob*(r + gamma * value_table[s_])\\n\",\n    \"                             for prob, s_, r, _ in env.P[s][a]]) \\n\",\n    \"                                   for a in range(env.action_space.n)] \\n\",\n    \"                \\n\",\n    \"        #extract policy by selecting the action which has maximum Q value\\n\",\n    \"        policy[s] = np.argmax(np.array(Q_values))        \\n\",\n    \"    \\n\",\n    \"    return policy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"## Putting it all together\\n\",\n    \"\\n\",\n    \"First, let's define a function called `policy_iteration` which takes the environment as a\\n\",\n    \"parameter\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def policy_iteration(env):\\n\",\n    \"    \\n\",\n    \"    #set the number of iterations\\n\",\n    \"    num_iterations = 1000\\n\",\n    \"    \\n\",\n    \"    #we learned that in the policy iteration method, we begin by initializing a random policy.\\n\",\n    \"    #so, we will initialize the random policy which selects the action 0 in all the states\\n\",\n    \"    policy = np.zeros(env.observation_space.n)  \\n\",\n    \"    \\n\",\n    \"    #for every iteration\\n\",\n    \"    for i in range(num_iterations):\\n\",\n    \"        #compute the value function using the policy\\n\",\n    \"        value_function = compute_value_function(policy)\\n\",\n    \"        \\n\",\n    \"        #extract the new policy from the computed value function\\n\",\n    \"        new_policy = extract_policy(value_function)\\n\",\n    \"           \\n\",\n    \"        #if the policy and new_policy are same then break the loop\\n\",\n    \"        if (np.all(policy == new_policy)):\\n\",\n    \"            break\\n\",\n    \"        \\n\",\n    \"        #else, update the current policy to new_policy\\n\",\n    \"        policy = new_policy\\n\",\n    \"        \\n\",\n    \"    return policy\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"Now, let's learn how to perform policy iteration and find the optimal policy in the frozen\\n\",\n    \"lake environment. \\n\",\n    \"\\n\",\n    \"So, we just feed the frozen lake environment to our `policy_iteration`\\n\",\n    \"function as shown below and get the optimal policy:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"optimal_policy = policy_iteration(env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can print the optimal policy: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(optimal_policy)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe, our optimal policy tells us to perform the correct action in each\\n\",\n    \"state. Thus, we learned how to perform the policy iteration method to compute the optimal\\n\",\n    \"policy. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "03. Bellman Equation and Dynamic Programming/README.md",
    "content": "# 3. Bellman Equation and Dynamic Programming\n* 3.1. The Bellman Equation\n   * 3.1.1. Bellman Equation of the Value Function\n   * 3.1.2. Bellman Equation of the Q Function\n* 3.2. Bellman Optimality Equation\n* 3.3. Relation Between Value and Q Function\n* 3.4. Dynamic Programming\n* 3.5. Value Iteration\n   * 3.5.1. Algorithm - Value Iteration\n* 3.6. Solving the Frozen Lake Problem with Value Iteration\n* 3.7. Policy iteration\n   * 3.7.1. Algorithm - Policy iteration\n* 3.8. Solving the Frozen Lake Problem with Policy Iteration\n* 3.9. Is DP Applicable to all Environments?"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.01. Understanding the Monte Carlo Method-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Understanding the Monte Carlo method\\n\",\n    \"\\n\",\n    \"Before understanding how the Monte Carlo method is useful in reinforcement learning, first, let's understand what is Monte Carlo method and how does it work. The Monte Carlo method is a statistical technique used to find an approximate solution through sampling. \\n\",\n    \"\\n\",\n    \"For instance, the Monte Carlo method approximates the expectation of a random variable by sampling and when the sample size is greater the approximation will be better. Let's suppose we have a random variable X and say we need to compute the expected value of X, that is E[X], then we can compute it by taking the sum of values of X multiplied by their respective probabilities as shown below:\\n\",\n    \"\\n\",\n    \"$$ E(X) = \\\\sum_{i=1}^N x_i p(x_i) $$\\n\",\n    \"\\n\",\n    \"But instead of computing the expectation like this, can we approximate them with the Monte Carlo method? Yes! We can estimate the expected value of X by just sampling the values of X for some N times and compute the average value of X as the expected value of X as shown below:\\n\",\n    \"\\n\",\n    \"$$ \\\\mathbb{E}_{x \\\\sim p(x)}[X]  \\\\approx \\\\frac{1}{N} \\\\sum_i x_i $$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"When N is larger our approximation will be better. Thus, with the Monte Carlo method, we can approximate the solution through sampling and our approximation will be better when the sample size is large.\\n\",\n    \"\\n\",\n    \"In the upcoming sections, we will learn how exactly the Monte Carlo method is used in reinforcement learning. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.02.  Prediction and control tasks-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Prediction and control tasks\\n\",\n    \"\\n\",\n    \"In reinforcement learning, we perform two important tasks, and they are:\\n\",\n    \"* The prediction task\\n\",\n    \"* The control task\\n\",\n    \"\\n\",\n    \"## Prediction task\\n\",\n    \"In the prediction task, a policy π is given as an input and we try to predict the value\\n\",\n    \"function or Q function using the given policy. But what is the use of doing this? Our\\n\",\n    \"goal is to evaluate the given policy.That is, we need to determine whether the given policy is good or bad.  How can we determine that? If the agent obtains\\n\",\n    \"a good return using the given policy then we can say that our policy is good. Thus,\\n\",\n    \"to evaluate the given policy, we need to understand what the return the agent would\\n\",\n    \"obtain if it uses the given policy. To obtain the return, we predict the value function\\n\",\n    \"or Q function using the given policy.\\n\",\n    \"\\n\",\n    \"That is, we learned that the value function or value of a state denotes the expected\\n\",\n    \"return an agent would obtain starting from that state following some policy π. Thus,\\n\",\n    \"by predicting the value function using the given policy π, we can understand what\\n\",\n    \"the expected return the agent would obtain in each state if it uses the given\\n\",\n    \"policy π. If the return is good then we can say that the given policy is good.\\n\",\n    \"\\n\",\n    \"Similarly, we learned that the Q function or Q value denotes the expected return the\\n\",\n    \"agent would obtain starting from the state s and an action a following the policy π .\\n\",\n    \"Thus, predicting the Q function using the given policy π, we can understand what the\\n\",\n    \"expected return the agent would obtain in each state-action pair if it uses the given\\n\",\n    \"policy. If the return is good then we can say that the given policy is good.\\n\",\n    \"\\n\",\n    \"Thus, we can evaluate the given policy π by computing the value and Q functions.\\n\",\n    \"Note that, in the prediction task, we don't make any change to the given input policy.\\n\",\n    \"We keep the given policy as fixed and predict the value function or Q function using\\n\",\n    \"the given policy and obtain the expected return. Based on the expected return, we\\n\",\n    \"evaluate the given policy.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Control task\\n\",\n    \"\\n\",\n    \"Unlike the prediction task, in the control task, we will not be given any policy as\\n\",\n    \"an input. In the control task, our goal is to find the optimal policy. So, we will start\\n\",\n    \"off by initializing a random policy and we try to find the optimal policy iteratively.\\n\",\n    \"That is, we try to find an optimal policy that gives the maximum return.\\n\",\n    \"\\n\",\n    \"Thus, in a nutshell, in the prediction task, we evaluate the given input policy by\\n\",\n    \"predicting the value function or Q function, which helps us to understand the\\n\",\n    \"expected return an agent would get if it uses the given policy, while in the control\\n\",\n    \"task our goal is to find the optimal policy and we will not be given any policy as\\n\",\n    \"input; so we will start off by initializing a random policy and we try to find the\\n\",\n    \"optimal policy iteratively.\\n\",\n    \"\\n\",\n    \"Now that we have understood what prediction and control tasks are, in the next\\n\",\n    \"section, we will learn how to use the Monte Carlo method for performing the\\n\",\n    \"prediction and control tasks.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.05. Every-visit MC Prediction with Blackjack Game-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Every-visit MC prediction with blackjack game\\n\",\n    \"\\n\",\n    \"To understand this section clearly, you can recap every visit Monte Carlo method we\\n\",\n    \"learned earlier. Let's now understand how to implement the every-visit MC prediction with\\n\",\n    \"the blackjack game step by step:\\n\",\n    \"\\n\",\n    \"Import the necessary libraries\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import pandas as pd\\n\",\n    \"from collections import defaultdict\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a blackjack environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('Blackjack-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining a policy\\n\",\n    \"\\n\",\n    \"We learned that in the prediction method, we will be given an input policy and we predict\\n\",\n    \"the value function of the given input policy. So, now, we first define a policy function\\n\",\n    \"which acts as an input policy. That is, we define the input policy whose value function will\\n\",\n    \"be predicted in the upcoming steps.\\n\",\n    \"\\n\",\n    \"As shown below, our policy function takes the state as an input and if the `state[0]`, sum of\\n\",\n    \"our cards value is greater than `19`, then it will return action 0 (stand) else it will return\\n\",\n    \"action 1 (hit):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def policy(state):\\n\",\n    \"    return 0 if state[0] > 19 else 1\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We defined an optimal policy, that is, it makes more sense to perform an action 0 (stand)\\n\",\n    \"when our sum value is already greater than 19. That is, when the sum value is greater than\\n\",\n    \"19 we don't have to perform 1 (hit) action and receive a new card which may cause us to\\n\",\n    \"lose the game or burst.\\n\",\n    \"\\n\",\n    \"For example, let's generate an initial state by resetting the environment as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"(20, 7, False)\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"print(state)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As you can notice, `state[0] = 20`, that is our sum of cards value is 20, so in this case, our\\n\",\n    \"policy will return the action 0 (stand) as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(policy(state))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, that we have defined the policy, in the next section, we will predict the value\\n\",\n    \"function (state values) of this policy. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Generating an episode\\n\",\n    \"Next, we generate an episode using the given policy, so, we, define a function\\n\",\n    \"called `generate_episode` which takes the policy as an input and generates the episode\\n\",\n    \"using the given policy.\\n\",\n    \"\\n\",\n    \"First, let's set the number of time steps:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timestep = 100\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def generate_episode(policy):\\n\",\n    \"    \\n\",\n    \"    #let's define a list called episode for storing the episode\\n\",\n    \"    episode = []\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #then for each time step\\n\",\n    \"    for i in range(num_timestep):\\n\",\n    \"        \\n\",\n    \"        #select the action according to the given policy\\n\",\n    \"        action = policy(state)\\n\",\n    \"        \\n\",\n    \"        #perform the action and store the next state information\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #store the state, action, reward into our episode list\\n\",\n    \"        episode.append((state, action, reward))\\n\",\n    \"        \\n\",\n    \"        #If the next state is a final state then break the loop else update the next state to the current state\\n\",\n    \"        if done:\\n\",\n    \"            break\\n\",\n    \"            \\n\",\n    \"        state = next_state\\n\",\n    \"\\n\",\n    \"    return episode\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's take a look at how the output of our `generate_episode` function looks like. Note\\n\",\n    \"that we generate episode using the policy we defined earlier:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[((12, 10, False), 1, 0), ((15, 10, False), 1, -1)]\"\n      ]\n     },\n     \"execution_count\": 8,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"generate_episode(policy)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe our output is in the form of `[(state, action, reward)]`. As shown above,\\n\",\n    \"we have two states in our episode. We performed action 1 (hit) in the state `(10, 2,\\n\",\n    \"False)` and received a 0 reward and the action 0 (stand) in the state `(20, 2, False)` and\\n\",\n    \"received 1.0 reward.\\n\",\n    \"\\n\",\n    \"Now that we have learned how to generate an episode using the given policy, next, we will\\n\",\n    \"look at how to compute the value of the state (value function) using every visit-MC\\n\",\n    \"method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Computing the value function\\n\",\n    \"\\n\",\n    \"We learned that in order to predict the value function, we generate several episodes using\\n\",\n    \"the given policy and compute the value of the state as an average return across several\\n\",\n    \"episodes. Let's see how to do implement that.\\n\",\n    \"\\n\",\n    \"First, we define the `total_return` and `N` as a dictionary for storing the total return and the\\n\",\n    \"number of times the state is visited across episodes respectively. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"total_return = defaultdict(float)\\n\",\n    \"N = defaultdict(int)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the number of iterations, that is, the number of episodes, we want to generate:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_iterations = 500000\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#then, for every iteration\\n\",\n    \"for i in range(num_iterations):\\n\",\n    \"    \\n\",\n    \"    #generate the episode using the given policy, that is, generate an episode using the policy\\n\",\n    \"    #function we defined earlier\\n\",\n    \"    episode = generate_episode(policy)\\n\",\n    \"    \\n\",\n    \"    #store all the states, actions, rewards obtained from the episode\\n\",\n    \"    states, actions, rewards = zip(*episode)\\n\",\n    \"    \\n\",\n    \"    #then for each step in the episode \\n\",\n    \"    for t, state in enumerate(states):\\n\",\n    \"        \\n\",\n    \"            #compute the return R of the state as the sum of reward\\n\",\n    \"            R = (sum(rewards[t:]))\\n\",\n    \"            \\n\",\n    \"            #update the total_return of the state\\n\",\n    \"            total_return[state] =  total_return[state] + R\\n\",\n    \"            \\n\",\n    \"            #update the number of times the state is visited in the episode\\n\",\n    \"            N[state] =  N[state] + 1\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After computing the `total_return` and `N` We can just convert them into a pandas data\\n\",\n    \"frame for a better understanding. [Note that this is just to give a clear understanding of the\\n\",\n    \"algorithm, we don't necessarily have to convert to the pandas data frame, we can also\\n\",\n    \"implement this efficiently just using the dictionary]\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Convert `total_return` dictionary to a data frame:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"total_return = pd.DataFrame(total_return.items(),columns=['state', 'total_return'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Convert the counter `N` dictionary to a data frame\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"N = pd.DataFrame(N.items(),columns=['state', 'N'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Merge the two data frames on states:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = pd.merge(total_return, N, on=\\\"state\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Look at the first few rows of the data frame:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>state</th>\\n\",\n       \"      <th>total_return</th>\\n\",\n       \"      <th>N</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>(7, 7, False)</td>\\n\",\n       \"      <td>-4.0</td>\\n\",\n       \"      <td>16</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>(11, 7, False)</td>\\n\",\n       \"      <td>19.0</td>\\n\",\n       \"      <td>43</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>(16, 7, False)</td>\\n\",\n       \"      <td>-38.0</td>\\n\",\n       \"      <td>104</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>(19, 7, False)</td>\\n\",\n       \"      <td>55.0</td>\\n\",\n       \"      <td>113</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>4</td>\\n\",\n       \"      <td>(20, 8, False)</td>\\n\",\n       \"      <td>96.0</td>\\n\",\n       \"      <td>129</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>5</td>\\n\",\n       \"      <td>(20, 2, False)</td>\\n\",\n       \"      <td>94.0</td>\\n\",\n       \"      <td>142</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>6</td>\\n\",\n       \"      <td>(15, 5, False)</td>\\n\",\n       \"      <td>-42.0</td>\\n\",\n       \"      <td>93</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>7</td>\\n\",\n       \"      <td>(20, 5, False)</td>\\n\",\n       \"      <td>62.0</td>\\n\",\n       \"      <td>115</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>8</td>\\n\",\n       \"      <td>(12, 3, False)</td>\\n\",\n       \"      <td>-55.0</td>\\n\",\n       \"      <td>91</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>9</td>\\n\",\n       \"      <td>(15, 3, False)</td>\\n\",\n       \"      <td>-36.0</td>\\n\",\n       \"      <td>96</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"            state  total_return    N\\n\",\n       \"0   (7, 7, False)          -4.0   16\\n\",\n       \"1  (11, 7, False)          19.0   43\\n\",\n       \"2  (16, 7, False)         -38.0  104\\n\",\n       \"3  (19, 7, False)          55.0  113\\n\",\n       \"4  (20, 8, False)          96.0  129\\n\",\n       \"5  (20, 2, False)          94.0  142\\n\",\n       \"6  (15, 5, False)         -42.0   93\\n\",\n       \"7  (20, 5, False)          62.0  115\\n\",\n       \"8  (12, 3, False)         -55.0   91\\n\",\n       \"9  (15, 3, False)         -36.0   96\"\n      ]\n     },\n     \"execution_count\": 15,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df.head(10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe from above, we have the total return and\\n\",\n    \"the number of times the state is visited.\\n\",\n    \"\\n\",\n    \"Next, we can compute the value of the state as the average return, thus, we can write:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df['value'] = df['total_return']/df['N']\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's look at the first few rows of the data frame:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>state</th>\\n\",\n       \"      <th>total_return</th>\\n\",\n       \"      <th>N</th>\\n\",\n       \"      <th>value</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>(7, 7, False)</td>\\n\",\n       \"      <td>-4.0</td>\\n\",\n       \"      <td>16</td>\\n\",\n       \"      <td>-0.250000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>(11, 7, False)</td>\\n\",\n       \"      <td>19.0</td>\\n\",\n       \"      <td>43</td>\\n\",\n       \"      <td>0.441860</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>(16, 7, False)</td>\\n\",\n       \"      <td>-38.0</td>\\n\",\n       \"      <td>104</td>\\n\",\n       \"      <td>-0.365385</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>(19, 7, False)</td>\\n\",\n       \"      <td>55.0</td>\\n\",\n       \"      <td>113</td>\\n\",\n       \"      <td>0.486726</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>4</td>\\n\",\n       \"      <td>(20, 8, False)</td>\\n\",\n       \"      <td>96.0</td>\\n\",\n       \"      <td>129</td>\\n\",\n       \"      <td>0.744186</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>5</td>\\n\",\n       \"      <td>(20, 2, False)</td>\\n\",\n       \"      <td>94.0</td>\\n\",\n       \"      <td>142</td>\\n\",\n       \"      <td>0.661972</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>6</td>\\n\",\n       \"      <td>(15, 5, False)</td>\\n\",\n       \"      <td>-42.0</td>\\n\",\n       \"      <td>93</td>\\n\",\n       \"      <td>-0.451613</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>7</td>\\n\",\n       \"      <td>(20, 5, False)</td>\\n\",\n       \"      <td>62.0</td>\\n\",\n       \"      <td>115</td>\\n\",\n       \"      <td>0.539130</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>8</td>\\n\",\n       \"      <td>(12, 3, False)</td>\\n\",\n       \"      <td>-55.0</td>\\n\",\n       \"      <td>91</td>\\n\",\n       \"      <td>-0.604396</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>9</td>\\n\",\n       \"      <td>(15, 3, False)</td>\\n\",\n       \"      <td>-36.0</td>\\n\",\n       \"      <td>96</td>\\n\",\n       \"      <td>-0.375000</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"            state  total_return    N     value\\n\",\n       \"0   (7, 7, False)          -4.0   16 -0.250000\\n\",\n       \"1  (11, 7, False)          19.0   43  0.441860\\n\",\n       \"2  (16, 7, False)         -38.0  104 -0.365385\\n\",\n       \"3  (19, 7, False)          55.0  113  0.486726\\n\",\n       \"4  (20, 8, False)          96.0  129  0.744186\\n\",\n       \"5  (20, 2, False)          94.0  142  0.661972\\n\",\n       \"6  (15, 5, False)         -42.0   93 -0.451613\\n\",\n       \"7  (20, 5, False)          62.0  115  0.539130\\n\",\n       \"8  (12, 3, False)         -55.0   91 -0.604396\\n\",\n       \"9  (15, 3, False)         -36.0   96 -0.375000\"\n      ]\n     },\n     \"execution_count\": 17,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df.head(10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"As we can observe we now have the value of the state which is just the average of a return\\n\",\n    \"of the state across several episodes. Thus, we have successfully predicted the value function\\n\",\n    \"of the given policy using the every-visit MC method.\\n\",\n    \"\\n\",\n    \"Okay, let's check the value of some states and understand how accurately our value\\n\",\n    \"function is estimated according to the given policy. Recall that when we started off, to\\n\",\n    \"generate episodes, we used the optimal policy which selects action 0 (stand) when the sum\\n\",\n    \"value is greater than 19 and action 1 (hit) when the sum value is less than 19.\\n\",\n    \"\\n\",\n    \"Let's evaluate the value of the state `(21,9,False)`, as we can observe, our sum of cards\\n\",\n    \"value is already 21 and so this is a good state and should have a high value. Let's see what's\\n\",\n    \"our estimated value of the state:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([0.90163934])\"\n      ]\n     },\n     \"execution_count\": 18,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df[df['state']==(21,9,False)]['value'].values\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe above our value of the state is high.\\n\",\n    \"Now, let's check the value of the state `(5,8,False)` as we can notice, our sum of cards\\n\",\n    \"value is just 5 and even the one dealer's single card has a high value, 8, then, in this case,\\n\",\n    \"the value of the state should be less. Let's see what's our estimated value of the state:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([0.08333333])\"\n      ]\n     },\n     \"execution_count\": 19,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df[df['state']==(5,8,False)]['value'].values\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can notice, the value of the state is less.\\n\",\n    \"Thus, we learned how to predict the value function of the given policy using the every-visit\\n\",\n    \"MC prediction method, in the next section, we will look at how to compute the value of the\\n\",\n    \"state using the first-visit mC method. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.06. First-visit MC Prediction with Blackjack Game-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# First-visit MC prediction with blackjack game\\n\",\n    \"\\n\",\n    \"To understand this section clearly, you can recap first visit Monte Carlo method we\\n\",\n    \"learned earlier. Let's now understand how to implement the first-visit MC prediction with\\n\",\n    \"the blackjack game step by step:\\n\",\n    \"\\n\",\n    \"Import the necessary libraries\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import pandas as pd\\n\",\n    \"from collections import defaultdict\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a blackjack environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('Blackjack-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining a policy\\n\",\n    \"\\n\",\n    \"We learned that in the prediction method, we will be given an input policy and we predict\\n\",\n    \"the value function of the given input policy. So, now, we first define a policy function\\n\",\n    \"which acts as an input policy. That is, we define the input policy whose value function will\\n\",\n    \"be predicted in the upcoming steps.\\n\",\n    \"\\n\",\n    \"As shown below, our policy function takes the state as an input and if the `state[0]`, sum of\\n\",\n    \"our cards value is greater than 19, then it will return action 0 (stand) else it will return\\n\",\n    \"action 1 (hit):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def policy(state):\\n\",\n    \"    return 0 if state[0] > 19 else 1\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We defined an optimal policy, that is, it makes more sense to perform an action 0 (stand)\\n\",\n    \"when our sum value is already greater than 19. That is, when the sum value is greater than\\n\",\n    \"19 we don't have to perform 1 (hit) action and receive a new card which may cause us to\\n\",\n    \"lose the game or burst.\\n\",\n    \"\\n\",\n    \"For example, let's generate an initial state by resetting the environment as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"(11, 6, False)\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"print(state)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As you can notice, `state[0] = 11`, that is our sum of cards value is 11, so in this case, our\\n\",\n    \"policy will return the action 1 (hit) as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(policy(state))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, that we have defined the policy, in the next section, we will predict the value\\n\",\n    \"function (state values) of this policy. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Generating an episode\\n\",\n    \"Next, we generate an episode using the given policy, so, we, define a function\\n\",\n    \"called `generate_episode` which takes the policy as an input and generates the episode\\n\",\n    \"using the given policy.\\n\",\n    \"\\n\",\n    \"First, let's set the number of time steps:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timestep = 100\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def generate_episode(policy):\\n\",\n    \"    \\n\",\n    \"    #let's define a list called episode for storing the episode\\n\",\n    \"    episode = []\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #then for each time step\\n\",\n    \"    for i in range(num_timestep):\\n\",\n    \"        \\n\",\n    \"        #select the action according to the given policy\\n\",\n    \"        action = policy(state)\\n\",\n    \"        \\n\",\n    \"        #perform the action and store the next state information\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #store the state, action, reward into our episode list\\n\",\n    \"        episode.append((state, action, reward))\\n\",\n    \"        \\n\",\n    \"        #If the next state is a final state then break the loop else update the next state to the current state\\n\",\n    \"        if done:\\n\",\n    \"            break\\n\",\n    \"            \\n\",\n    \"        state = next_state\\n\",\n    \"\\n\",\n    \"    return episode\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's take a look at how the output of our `generate_episode` function looks like. Note\\n\",\n    \"that we generate episode using the policy we defined earlier:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[((15, 10, False), 1, -1)]\"\n      ]\n     },\n     \"execution_count\": 8,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"generate_episode(policy)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe our output is in the form of `[(state, action, reward)]`. As shown above,\\n\",\n    \"we have two states in our episode. We performed action 1 (hit) in the state `(10, 2, False)` and received a 0 reward and the action 0 (stand) in the state `(20, 2, False)` and\\n\",\n    \"received 1.0 reward.\\n\",\n    \"\\n\",\n    \"Now that we have learned how to generate an episode using the given policy, next, we will\\n\",\n    \"look at how to compute the value of the state (value function) using first visit-MC\\n\",\n    \"method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Computing the value function\\n\",\n    \"\\n\",\n    \"We learned that in order to predict the value function, we generate several episodes using\\n\",\n    \"the given policy and compute the value of the state as an average return across several\\n\",\n    \"episodes. Let's see how to do implement that.\\n\",\n    \"\\n\",\n    \"First, we define the `total_return` and `N` as a dictionary for storing the total return and the\\n\",\n    \"number of times the state is visited across episodes respectively. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"total_return = defaultdict(float)\\n\",\n    \"N = defaultdict(int)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the number of iterations, that is, the number of episodes, we want to generate:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_iterations = 10000\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#then, for every iteration\\n\",\n    \"for i in range(num_iterations):\\n\",\n    \"    \\n\",\n    \"    #generate the episode using the given policy, that is, generate an episode using the policy\\n\",\n    \"    #function we defined earlier\\n\",\n    \"    episode = generate_episode(policy)\\n\",\n    \"    \\n\",\n    \"    #store all the states, actions, rewards obtained from the episode\\n\",\n    \"    states, actions, rewards = zip(*episode)\\n\",\n    \"    \\n\",\n    \"    #then, for each step in the episode\\n\",\n    \"    for t, state in enumerate(states):\\n\",\n    \"        \\n\",\n    \"        #if the state is not visited already\\n\",\n    \"        if state not in states[0:t]:\\n\",\n    \"                \\n\",\n    \"            #compute the return R of the state as the sum of reward\\n\",\n    \"            R = (sum(rewards[t:]))\\n\",\n    \"            \\n\",\n    \"            #update the total_return of the state\\n\",\n    \"            total_return[state] =  total_return[state] + R\\n\",\n    \"            \\n\",\n    \"            #update the number of times the state is visited in the episode\\n\",\n    \"            N[state] =  N[state] + 1\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After computing the `total_return` and `N` We can just convert them into a pandas data\\n\",\n    \"frame for a better understanding. [Note that this is just to give a clear understanding of the\\n\",\n    \"algorithm, we don't necessarily have to convert to the pandas data frame, we can also\\n\",\n    \"implement this efficiently just using the dictionary]\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Convert `total_returns` dictionary to a data frame:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"total_return = pd.DataFrame(total_return.items(),columns=['state', 'total_return'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Convert the counter `N` dictionary to a data frame\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"N = pd.DataFrame(N.items(),columns=['state', 'N'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Merge the two data frames on states:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = pd.merge(total_return, N, on=\\\"state\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Look at the first few rows of the data frame:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>state</th>\\n\",\n       \"      <th>total_return</th>\\n\",\n       \"      <th>N</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>(16, 3, False)</td>\\n\",\n       \"      <td>-53.0</td>\\n\",\n       \"      <td>98</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>(11, 9, False)</td>\\n\",\n       \"      <td>6.0</td>\\n\",\n       \"      <td>49</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>(21, 9, False)</td>\\n\",\n       \"      <td>67.0</td>\\n\",\n       \"      <td>68</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>(9, 8, False)</td>\\n\",\n       \"      <td>-3.0</td>\\n\",\n       \"      <td>27</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>4</td>\\n\",\n       \"      <td>(16, 8, False)</td>\\n\",\n       \"      <td>-60.0</td>\\n\",\n       \"      <td>95</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>5</td>\\n\",\n       \"      <td>(17, 8, False)</td>\\n\",\n       \"      <td>-69.0</td>\\n\",\n       \"      <td>117</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>6</td>\\n\",\n       \"      <td>(12, 5, False)</td>\\n\",\n       \"      <td>-38.0</td>\\n\",\n       \"      <td>91</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>7</td>\\n\",\n       \"      <td>(20, 5, False)</td>\\n\",\n       \"      <td>95.0</td>\\n\",\n       \"      <td>122</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>8</td>\\n\",\n       \"      <td>(9, 6, False)</td>\\n\",\n       \"      <td>2.0</td>\\n\",\n       \"      <td>39</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>9</td>\\n\",\n       \"      <td>(14, 6, False)</td>\\n\",\n       \"      <td>-53.0</td>\\n\",\n       \"      <td>115</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"            state  total_return    N\\n\",\n       \"0  (16, 3, False)         -53.0   98\\n\",\n       \"1  (11, 9, False)           6.0   49\\n\",\n       \"2  (21, 9, False)          67.0   68\\n\",\n       \"3   (9, 8, False)          -3.0   27\\n\",\n       \"4  (16, 8, False)         -60.0   95\\n\",\n       \"5  (17, 8, False)         -69.0  117\\n\",\n       \"6  (12, 5, False)         -38.0   91\\n\",\n       \"7  (20, 5, False)          95.0  122\\n\",\n       \"8   (9, 6, False)           2.0   39\\n\",\n       \"9  (14, 6, False)         -53.0  115\"\n      ]\n     },\n     \"execution_count\": 15,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df.head(10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe from above, we have the total return and\\n\",\n    \"the number of times the state is visited.\\n\",\n    \"\\n\",\n    \"Next, we can compute the value of the state as the average return, thus, we can write:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df['value'] = df['total_return']/df['N']\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's look at the first few rows of the data frame:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>state</th>\\n\",\n       \"      <th>total_return</th>\\n\",\n       \"      <th>N</th>\\n\",\n       \"      <th>value</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>(16, 3, False)</td>\\n\",\n       \"      <td>-53.0</td>\\n\",\n       \"      <td>98</td>\\n\",\n       \"      <td>-0.540816</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>(11, 9, False)</td>\\n\",\n       \"      <td>6.0</td>\\n\",\n       \"      <td>49</td>\\n\",\n       \"      <td>0.122449</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>(21, 9, False)</td>\\n\",\n       \"      <td>67.0</td>\\n\",\n       \"      <td>68</td>\\n\",\n       \"      <td>0.985294</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>(9, 8, False)</td>\\n\",\n       \"      <td>-3.0</td>\\n\",\n       \"      <td>27</td>\\n\",\n       \"      <td>-0.111111</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>4</td>\\n\",\n       \"      <td>(16, 8, False)</td>\\n\",\n       \"      <td>-60.0</td>\\n\",\n       \"      <td>95</td>\\n\",\n       \"      <td>-0.631579</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>5</td>\\n\",\n       \"      <td>(17, 8, False)</td>\\n\",\n       \"      <td>-69.0</td>\\n\",\n       \"      <td>117</td>\\n\",\n       \"      <td>-0.589744</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>6</td>\\n\",\n       \"      <td>(12, 5, False)</td>\\n\",\n       \"      <td>-38.0</td>\\n\",\n       \"      <td>91</td>\\n\",\n       \"      <td>-0.417582</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>7</td>\\n\",\n       \"      <td>(20, 5, False)</td>\\n\",\n       \"      <td>95.0</td>\\n\",\n       \"      <td>122</td>\\n\",\n       \"      <td>0.778689</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>8</td>\\n\",\n       \"      <td>(9, 6, False)</td>\\n\",\n       \"      <td>2.0</td>\\n\",\n       \"      <td>39</td>\\n\",\n       \"      <td>0.051282</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>9</td>\\n\",\n       \"      <td>(14, 6, False)</td>\\n\",\n       \"      <td>-53.0</td>\\n\",\n       \"      <td>115</td>\\n\",\n       \"      <td>-0.460870</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"            state  total_return    N     value\\n\",\n       \"0  (16, 3, False)         -53.0   98 -0.540816\\n\",\n       \"1  (11, 9, False)           6.0   49  0.122449\\n\",\n       \"2  (21, 9, False)          67.0   68  0.985294\\n\",\n       \"3   (9, 8, False)          -3.0   27 -0.111111\\n\",\n       \"4  (16, 8, False)         -60.0   95 -0.631579\\n\",\n       \"5  (17, 8, False)         -69.0  117 -0.589744\\n\",\n       \"6  (12, 5, False)         -38.0   91 -0.417582\\n\",\n       \"7  (20, 5, False)          95.0  122  0.778689\\n\",\n       \"8   (9, 6, False)           2.0   39  0.051282\\n\",\n       \"9  (14, 6, False)         -53.0  115 -0.460870\"\n      ]\n     },\n     \"execution_count\": 17,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df.head(10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"As we can observe we now have the value of the state which is just the average of a return\\n\",\n    \"of the state across several episodes. Thus, we have successfully predicted the value function\\n\",\n    \"of the given policy using the first-visit MC method.\\n\",\n    \"\\n\",\n    \"Okay, let's check the value of some states and understand how accurately our value\\n\",\n    \"function is estimated according to the given policy. Recall that when we started off, to\\n\",\n    \"generate episodes, we used the optimal policy which selects action 0 (stand) when the sum\\n\",\n    \"value is greater than 19 and action 1 (hit) when the sum value is less than 19.\\n\",\n    \"\\n\",\n    \"Let's evaluate the value of the state `(21,9,False)`, as we can observe, our sum of cards\\n\",\n    \"value is already 21 and so this is a good state and should have a high value. Let's see what's\\n\",\n    \"our estimated value of the state:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([0.98529412])\"\n      ]\n     },\n     \"execution_count\": 21,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df[df['state']==(21,9,False)]['value'].values\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe above our value of the state is high.\\n\",\n    \"Now, let's check the value of the state `(5,8,False)` as we can notice, our sum of cards\\n\",\n    \"value is just 5 and even the one dealer's single card has a high value, 8, then, in this case,\\n\",\n    \"the value of the state should be less. Let's see what's our estimated value of the state:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([-0.55555556])\"\n      ]\n     },\n     \"execution_count\": 22,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df[df['state']==(5,8,False)]['value'].values\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can notice, the value of the state is less.\\n\",\n    \"Thus, we learned how to predict the value function of the given policy using the first-visit\\n\",\n    \"MC prediction method. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "04. Monte Carlo Methods/.ipynb_checkpoints/4.13. Implementing On-Policy MC Control-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing On-policy MC control\\n\",\n    \"\\n\",\n    \"Now, let's learn how to implement the MC control method with epsilon-greedy policy for playing the blackjack game, that is, we will see how can we use the MC control method for\\n\",\n    \"finding the optimal policy in the blackjack game:\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import pandas as pd\\n\",\n    \"from collections import defaultdict\\n\",\n    \"import random\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a blackjack environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('Blackjack-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the dictionary for storing the Q values:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = defaultdict(float)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the dictionary for storing the total return of the state-action pair:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"total_return = defaultdict(float)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the dictionary for storing the count of the number of times a state-action pair is\\n\",\n    \"visited:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"N = defaultdict(int)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Define the epsilon-greedy policy\\n\",\n    \"\\n\",\n    \"We learned that we select actions based on the epsilon-greedy policy, so we define a\\n\",\n    \"function called `epsilon_greedy_policy` which takes the state and Q value as an input\\n\",\n    \"and returns the action to be performed in the given state:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def epsilon_greedy_policy(state,Q):\\n\",\n    \"    \\n\",\n    \"    #set the epsilon value to 0.5\\n\",\n    \"    epsilon = 0.5\\n\",\n    \"    \\n\",\n    \"    #sample a random value from the uniform distribution, if the sampled value is less than\\n\",\n    \"    #epsilon then we select a random action else we select the best action which has maximum Q\\n\",\n    \"    #value as shown below\\n\",\n    \"    \\n\",\n    \"    if random.uniform(0,1) < epsilon:\\n\",\n    \"        return env.action_space.sample()\\n\",\n    \"    else:\\n\",\n    \"        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Generating an episode\\n\",\n    \"\\n\",\n    \"Now, let's generate an episode using the epsilon-greedy policy. We define a function called\\n\",\n    \"`generate_episode` which takes the Q value as an input and returns the episode.\\n\",\n    \"\\n\",\n    \"First, let's set the number of time steps:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 100\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def generate_episode(Q):\\n\",\n    \"    \\n\",\n    \"    #initialize a list for storing the episode\\n\",\n    \"    episode = []\\n\",\n    \"    \\n\",\n    \"    #initialize the state using the reset function\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #then for each time step\\n\",\n    \"    for t in range(num_timesteps):\\n\",\n    \"        \\n\",\n    \"        #select the action according to the epsilon-greedy policy\\n\",\n    \"        action = epsilon_greedy_policy(state,Q)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action and store the next state information\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #store the state, action, reward in the episode list\\n\",\n    \"        episode.append((state, action, reward))\\n\",\n    \"        \\n\",\n    \"        #if the next state is a final state then break the loop else update the next state to the current\\n\",\n    \"        #state\\n\",\n    \"        if done:\\n\",\n    \"            break\\n\",\n    \"            \\n\",\n    \"        state = next_state\\n\",\n    \"\\n\",\n    \"    return episode\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Computing the optimal policy\\n\",\n    \"\\n\",\n    \"Now, let's learn how to compute the optimal policy. First, let's set the number of iterations, that is, the number of episodes, we want to generate:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_iterations = 50000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We learned that in the on-policy control method, we will not be given any policy as an\\n\",\n    \"input. So, we initialize a random policy in the first iteration and improve the policy\\n\",\n    \"iteratively by computing Q value. Since we extract the policy from the Q function, we don't\\n\",\n    \"have to explicitly define the policy. As the Q value improves the policy also improves\\n\",\n    \"implicitly. That is, in the first iteration we generate episode by extracting the policy\\n\",\n    \"(epsilon-greedy) from the initialized Q function. Over a series of iterations, we will find the\\n\",\n    \"optimal Q function and hence we also find the optimal policy. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each iteration\\n\",\n    \"for i in range(num_iterations):\\n\",\n    \"    \\n\",\n    \"    #so, here we pass our initialized Q function to generate an episode\\n\",\n    \"    episode = generate_episode(Q)\\n\",\n    \"    \\n\",\n    \"    #get all the state-action pairs in the episode\\n\",\n    \"    all_state_action_pairs = [(s, a) for (s,a,r) in episode]\\n\",\n    \"    \\n\",\n    \"    #store all the rewards obtained in the episode in the rewards list\\n\",\n    \"    rewards = [r for (s,a,r) in episode]\\n\",\n    \"\\n\",\n    \"    #for each step in the episode \\n\",\n    \"    for t, (state, action, reward) in enumerate(episode):\\n\",\n    \"\\n\",\n    \"        #if the state-action pair is occurring for the first time in the episode\\n\",\n    \"        if not (state, action) in all_state_action_pairs[0:t]:\\n\",\n    \"            \\n\",\n    \"            #compute the return R of the state-action pair as the sum of rewards\\n\",\n    \"            R = sum(rewards[t:])\\n\",\n    \"            \\n\",\n    \"            #update total return of the state-action pair\\n\",\n    \"            total_return[(state,action)] = total_return[(state,action)] + R\\n\",\n    \"            \\n\",\n    \"            #update the number of times the state-action pair is visited\\n\",\n    \"            N[(state, action)] += 1\\n\",\n    \"\\n\",\n    \"            #compute the Q value by just taking the average\\n\",\n    \"            Q[(state,action)] = total_return[(state, action)] / N[(state, action)]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Thus on every iteration, the Q value improves and so does policy.\\n\",\n    \"After all the iterations, we can have a look at the Q value of each state-action in the pandas\\n\",\n    \"data frame for more clarity.\\n\",\n    \"\\n\",\n    \"First, let's convert the Q value dictionary to a pandas data\\n\",\n    \"frame:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = pd.DataFrame(Q.items(),columns=['state_action pair','value'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's look at the first few rows of the data frame:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>state_action pair</th>\\n\",\n       \"      <th>value</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>((14, 10, False), 0)</td>\\n\",\n       \"      <td>-0.641944</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>((14, 10, False), 1)</td>\\n\",\n       \"      <td>-0.617698</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>((11, 10, False), 1)</td>\\n\",\n       \"      <td>-0.170015</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>((12, 3, False), 0)</td>\\n\",\n       \"      <td>-0.180328</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>4</td>\\n\",\n       \"      <td>((12, 3, False), 1)</td>\\n\",\n       \"      <td>-0.320388</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>5</td>\\n\",\n       \"      <td>((13, 1, False), 0)</td>\\n\",\n       \"      <td>-0.752381</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>6</td>\\n\",\n       \"      <td>((11, 6, False), 1)</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>7</td>\\n\",\n       \"      <td>((17, 6, False), 0)</td>\\n\",\n       \"      <td>-0.118644</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>8</td>\\n\",\n       \"      <td>((10, 9, False), 0)</td>\\n\",\n       \"      <td>-0.714286</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>9</td>\\n\",\n       \"      <td>((10, 9, False), 1)</td>\\n\",\n       \"      <td>-0.041322</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>10</td>\\n\",\n       \"      <td>((14, 4, False), 0)</td>\\n\",\n       \"      <td>-0.148289</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"       state_action pair     value\\n\",\n       \"0   ((14, 10, False), 0) -0.641944\\n\",\n       \"1   ((14, 10, False), 1) -0.617698\\n\",\n       \"2   ((11, 10, False), 1) -0.170015\\n\",\n       \"3    ((12, 3, False), 0) -0.180328\\n\",\n       \"4    ((12, 3, False), 1) -0.320388\\n\",\n       \"5    ((13, 1, False), 0) -0.752381\\n\",\n       \"6    ((11, 6, False), 1)  0.000000\\n\",\n       \"7    ((17, 6, False), 0) -0.118644\\n\",\n       \"8    ((10, 9, False), 0) -0.714286\\n\",\n       \"9    ((10, 9, False), 1) -0.041322\\n\",\n       \"10   ((14, 4, False), 0) -0.148289\"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df.head(11)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe, we have the Q values for all the state-action pairs. Now we can extract\\n\",\n    \"the policy by selecting the action which has maximum Q value in each state. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"To learn more how to select action based on this Q value, check the book under the section, implementing on-policy control.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "04. Monte Carlo Methods/4.13. Implementing On-Policy MC Control.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing On-policy MC control\\n\",\n    \"\\n\",\n    \"Now, let's learn how to implement the MC control method with epsilon-greedy policy for playing the blackjack game, that is, we will see how can we use the MC control method for\\n\",\n    \"finding the optimal policy in the blackjack game:\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import pandas as pd\\n\",\n    \"from collections import defaultdict\\n\",\n    \"import random\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a blackjack environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('Blackjack-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the dictionary for storing the Q values:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = defaultdict(float)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the dictionary for storing the total return of the state-action pair:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"total_return = defaultdict(float)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the dictionary for storing the count of the number of times a state-action pair is\\n\",\n    \"visited:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"N = defaultdict(int)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Define the epsilon-greedy policy\\n\",\n    \"\\n\",\n    \"We learned that we select actions based on the epsilon-greedy policy, so we define a\\n\",\n    \"function called `epsilon_greedy_policy` which takes the state and Q value as an input\\n\",\n    \"and returns the action to be performed in the given state:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def epsilon_greedy_policy(state,Q):\\n\",\n    \"    \\n\",\n    \"    #set the epsilon value to 0.5\\n\",\n    \"    epsilon = 0.5\\n\",\n    \"    \\n\",\n    \"    #sample a random value from the uniform distribution, if the sampled value is less than\\n\",\n    \"    #epsilon then we select a random action else we select the best action which has maximum Q\\n\",\n    \"    #value as shown below\\n\",\n    \"    \\n\",\n    \"    if random.uniform(0,1) < epsilon:\\n\",\n    \"        return env.action_space.sample()\\n\",\n    \"    else:\\n\",\n    \"        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Generating an episode\\n\",\n    \"\\n\",\n    \"Now, let's generate an episode using the epsilon-greedy policy. We define a function called\\n\",\n    \"`generate_episode` which takes the Q value as an input and returns the episode.\\n\",\n    \"\\n\",\n    \"First, let's set the number of time steps:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 100\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def generate_episode(Q):\\n\",\n    \"    \\n\",\n    \"    #initialize a list for storing the episode\\n\",\n    \"    episode = []\\n\",\n    \"    \\n\",\n    \"    #initialize the state using the reset function\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #then for each time step\\n\",\n    \"    for t in range(num_timesteps):\\n\",\n    \"        \\n\",\n    \"        #select the action according to the epsilon-greedy policy\\n\",\n    \"        action = epsilon_greedy_policy(state,Q)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action and store the next state information\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #store the state, action, reward in the episode list\\n\",\n    \"        episode.append((state, action, reward))\\n\",\n    \"        \\n\",\n    \"        #if the next state is a final state then break the loop else update the next state to the current\\n\",\n    \"        #state\\n\",\n    \"        if done:\\n\",\n    \"            break\\n\",\n    \"            \\n\",\n    \"        state = next_state\\n\",\n    \"\\n\",\n    \"    return episode\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Computing the optimal policy\\n\",\n    \"\\n\",\n    \"Now, let's learn how to compute the optimal policy. First, let's set the number of iterations, that is, the number of episodes, we want to generate:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_iterations = 50000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We learned that in the on-policy control method, we will not be given any policy as an\\n\",\n    \"input. So, we initialize a random policy in the first iteration and improve the policy\\n\",\n    \"iteratively by computing Q value. Since we extract the policy from the Q function, we don't\\n\",\n    \"have to explicitly define the policy. As the Q value improves the policy also improves\\n\",\n    \"implicitly. That is, in the first iteration we generate episode by extracting the policy\\n\",\n    \"(epsilon-greedy) from the initialized Q function. Over a series of iterations, we will find the\\n\",\n    \"optimal Q function and hence we also find the optimal policy. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each iteration\\n\",\n    \"for i in range(num_iterations):\\n\",\n    \"    \\n\",\n    \"    #so, here we pass our initialized Q function to generate an episode\\n\",\n    \"    episode = generate_episode(Q)\\n\",\n    \"    \\n\",\n    \"    #get all the state-action pairs in the episode\\n\",\n    \"    all_state_action_pairs = [(s, a) for (s,a,r) in episode]\\n\",\n    \"    \\n\",\n    \"    #store all the rewards obtained in the episode in the rewards list\\n\",\n    \"    rewards = [r for (s,a,r) in episode]\\n\",\n    \"\\n\",\n    \"    #for each step in the episode \\n\",\n    \"    for t, (state, action, reward) in enumerate(episode):\\n\",\n    \"\\n\",\n    \"        #if the state-action pair is occurring for the first time in the episode\\n\",\n    \"        if not (state, action) in all_state_action_pairs[0:t]:\\n\",\n    \"            \\n\",\n    \"            #compute the return R of the state-action pair as the sum of rewards\\n\",\n    \"            R = sum(rewards[t:])\\n\",\n    \"            \\n\",\n    \"            #update total return of the state-action pair\\n\",\n    \"            total_return[(state,action)] = total_return[(state,action)] + R\\n\",\n    \"            \\n\",\n    \"            #update the number of times the state-action pair is visited\\n\",\n    \"            N[(state, action)] += 1\\n\",\n    \"\\n\",\n    \"            #compute the Q value by just taking the average\\n\",\n    \"            Q[(state,action)] = total_return[(state, action)] / N[(state, action)]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Thus on every iteration, the Q value improves and so does policy.\\n\",\n    \"After all the iterations, we can have a look at the Q value of each state-action in the pandas\\n\",\n    \"data frame for more clarity.\\n\",\n    \"\\n\",\n    \"First, let's convert the Q value dictionary to a pandas data\\n\",\n    \"frame:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = pd.DataFrame(Q.items(),columns=['state_action pair','value'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's look at the first few rows of the data frame:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>state_action pair</th>\\n\",\n       \"      <th>value</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>((14, 10, False), 0)</td>\\n\",\n       \"      <td>-0.641944</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>((14, 10, False), 1)</td>\\n\",\n       \"      <td>-0.617698</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>((11, 10, False), 1)</td>\\n\",\n       \"      <td>-0.170015</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>((12, 3, False), 0)</td>\\n\",\n       \"      <td>-0.180328</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>4</td>\\n\",\n       \"      <td>((12, 3, False), 1)</td>\\n\",\n       \"      <td>-0.320388</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>5</td>\\n\",\n       \"      <td>((13, 1, False), 0)</td>\\n\",\n       \"      <td>-0.752381</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>6</td>\\n\",\n       \"      <td>((11, 6, False), 1)</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>7</td>\\n\",\n       \"      <td>((17, 6, False), 0)</td>\\n\",\n       \"      <td>-0.118644</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>8</td>\\n\",\n       \"      <td>((10, 9, False), 0)</td>\\n\",\n       \"      <td>-0.714286</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>9</td>\\n\",\n       \"      <td>((10, 9, False), 1)</td>\\n\",\n       \"      <td>-0.041322</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>10</td>\\n\",\n       \"      <td>((14, 4, False), 0)</td>\\n\",\n       \"      <td>-0.148289</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"       state_action pair     value\\n\",\n       \"0   ((14, 10, False), 0) -0.641944\\n\",\n       \"1   ((14, 10, False), 1) -0.617698\\n\",\n       \"2   ((11, 10, False), 1) -0.170015\\n\",\n       \"3    ((12, 3, False), 0) -0.180328\\n\",\n       \"4    ((12, 3, False), 1) -0.320388\\n\",\n       \"5    ((13, 1, False), 0) -0.752381\\n\",\n       \"6    ((11, 6, False), 1)  0.000000\\n\",\n       \"7    ((17, 6, False), 0) -0.118644\\n\",\n       \"8    ((10, 9, False), 0) -0.714286\\n\",\n       \"9    ((10, 9, False), 1) -0.041322\\n\",\n       \"10   ((14, 4, False), 0) -0.148289\"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df.head(11)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe, we have the Q values for all the state-action pairs. Now we can extract\\n\",\n    \"the policy by selecting the action which has maximum Q value in each state. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"To learn more how to select action based on this Q value, check the book under the section, implementing on-policy control.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "04. Monte Carlo Methods/README.md",
    "content": "### 4. Monte Carlo Methods\n* 4.1. Understanding the Monte Carlo Method\n* 4.2. Prediction and Control Tasks\n   * 4.2.1. Prediction Task\n   * 4.2.2. Control Task\n* 4.3. Monte Carlo Prediction\n   * 4.3.1. MC Prediction Algorithm\n   * 4.3.2. Types of MC prediction\n   * 4.3.3. First-visit Monte Carlo\n   * 4.3.4. Every visit Monte Carlo\n* 4.4. Understanding the BlackJack Game\n   * 4.4.1. Blackjack Environment in the Gym\n* 4.5. Every-visit MC Prediction with Blackjack Game\n* 4.6. First-visit MC Prediction with Blackjack Game\n* 4.7. Incremental Mean Updates\n* 4.8. MC Prediction (Q Function)\n* 4.9. Monte Carlo Control\n* 4.10. On-Policy Monte Carlo Control\n* 4.11. Monte Carlo Exploring Starts\n* 4.12. Monte Carlo with Epsilon-Greedy Policy\n   * 4.7.5. Algorithm MC Control with Epsilon-Greedy Policy\n* 4.13. Implementing On-Policy MC Control\n* 4.14. Off-Policy Monte Carlo Control\n* 4.15. Is MC Method Applicable to all Tasks?"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/.ipynb_checkpoints/5.03. Predicting the Value of States in a Frozen Lake Environment-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Predicting the value of states in a frozen lake environment\\n\",\n    \"We learned that in the prediction method, the policy is given as an input and we predict\\n\",\n    \"value function using the given policy. So let's initialize a random policy and predict the\\n\",\n    \"value function (state values) of the frozen lake environment using the random policy.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import pandas as pd\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we create the frozen lake environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('FrozenLake-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true,\n    \"scrolled\": true\n   },\n   \"source\": [\n    \"Define the random policy which returns the random action by sampling from the action\\n\",\n    \"space:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def random_policy():\\n\",\n    \"    return env.action_space.sample()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's define the dictionary for storing the value of states and we initialize the value of all the\\n\",\n    \"states to 0.0:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"V = {}\\n\",\n    \"for s in range(env.observation_space.n):\\n\",\n    \"    V[s] = 0.0\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the discount factor $\\\\gamma$ and the learning rate $\\\\alpha$: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"alpha = 0.85\\n\",\n    \"gamma = 0.90\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes and number of time steps in the episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 5000\\n\",\n    \"num_timesteps = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Computing the value of states\\n\",\n    \"Now, let's compute the value function (state values) using the given random policy as:\\n\",\n    \"\\n\",\n    \"$$V(s) = V(s) + \\\\alpha (r + \\\\gamma V(s') - V(s)) $$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    s = env.reset()\\n\",\n    \"    \\n\",\n    \"    #for every step in the episode\\n\",\n    \"    for t in range(num_timesteps):\\n\",\n    \"        \\n\",\n    \"        #select an action according to random policy\\n\",\n    \"        a = random_policy()\\n\",\n    \"        \\n\",\n    \"        #perform the selected action and store the next state information\\n\",\n    \"        s_, r, done, _ = env.step(a)\\n\",\n    \"        \\n\",\n    \"        #compute the value of the state\\n\",\n    \"        V[s] += alpha * (r + gamma * V[s_]-V[s])\\n\",\n    \"        \\n\",\n    \"        #update next state to the current state\\n\",\n    \"        s = s_\\n\",\n    \"        \\n\",\n    \"        #if the current state is the terminal state then break\\n\",\n    \"        if done:\\n\",\n    \"            break\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the iterations, we will have a value of all the states according to the given random\\n\",\n    \"policy. \\n\",\n    \"\\n\",\n    \"## Evaluating the value of states \\n\",\n    \"\\n\",\n    \"Now, let's evaluate our value function (state values). First, let's convert our value dictionary\\n\",\n    \"to a pandas data frame for more clarity:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = pd.DataFrame(list(V.items()), columns=['state', 'value'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Before checking the value of the states, let's recollect that in the gym all the states in the\\n\",\n    \"frozen lake environment will be encoded into numbers. Since we have 16 states, all the\\n\",\n    \"states will be encoded into numbers from 0 to 15 as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"![title](Images/1.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"Now, Let's check the value of the states:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>state</th>\\n\",\n       \"      <th>value</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0.004119</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>0.000092</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>0.000428</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>0.000088</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>4</td>\\n\",\n       \"      <td>4</td>\\n\",\n       \"      <td>0.001522</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>5</td>\\n\",\n       \"      <td>5</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>6</td>\\n\",\n       \"      <td>6</td>\\n\",\n       \"      <td>0.000089</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>7</td>\\n\",\n       \"      <td>7</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>8</td>\\n\",\n       \"      <td>8</td>\\n\",\n       \"      <td>0.000141</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>9</td>\\n\",\n       \"      <td>9</td>\\n\",\n       \"      <td>0.526370</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>10</td>\\n\",\n       \"      <td>10</td>\\n\",\n       \"      <td>0.009634</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>11</td>\\n\",\n       \"      <td>11</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>13</td>\\n\",\n       \"      <td>13</td>\\n\",\n       \"      <td>0.273292</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>14</td>\\n\",\n       \"      <td>14</td>\\n\",\n       \"      <td>0.081485</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>15</td>\\n\",\n       \"      <td>15</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"    state     value\\n\",\n       \"0       0  0.004119\\n\",\n       \"1       1  0.000092\\n\",\n       \"2       2  0.000428\\n\",\n       \"3       3  0.000088\\n\",\n       \"4       4  0.001522\\n\",\n       \"5       5  0.000000\\n\",\n       \"6       6  0.000089\\n\",\n       \"7       7  0.000000\\n\",\n       \"8       8  0.000141\\n\",\n       \"9       9  0.526370\\n\",\n       \"10     10  0.009634\\n\",\n       \"11     11  0.000000\\n\",\n       \"12     12  0.000000\\n\",\n       \"13     13  0.273292\\n\",\n       \"14     14  0.081485\\n\",\n       \"15     15  0.000000\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"As we can observe, now we have the value of all the states and also we can notice that\\n\",\n    \"the value of all the terminal states (hole states and goal state) is zero.\\n\",\n    \"\\n\",\n    \"Now that we have understood how TD learning can be used for the prediction task, in the\\n\",\n    \"next section, we will learn how to use TD learning for the control task. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/.ipynb_checkpoints/5.06. Computing Optimal Policy using SARSA-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Computing optimal policy using SARSA\\n\",\n    \"\\n\",\n    \"Now, let's implement SARSA to find the optimal policy in the frozen lake environment.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import pandas as pd\\n\",\n    \"import random\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we create the frozen lake environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('FrozenLake-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's define the dictionary for storing the Q value of the state-action pair and we initialize\\n\",\n    \"the Q value of all the state-action pair to 0.0:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = {}\\n\",\n    \"for s in range(env.observation_space.n):\\n\",\n    \"    for a in range(env.action_space.n):\\n\",\n    \"        Q[(s,a)] = 0.0\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's define the epsilon-greedy policy. We generate a random number from the\\n\",\n    \"uniform distribution and if the random number is less than epsilon we select the random\\n\",\n    \"action else we select the best action which has the maximum Q value: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def epsilon_greedy(state, epsilon):\\n\",\n    \"    if random.uniform(0,1) < epsilon:\\n\",\n    \"        return env.action_space.sample()\\n\",\n    \"    else:\\n\",\n    \"        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the discount factor $\\\\gamma$ and the learning rate $\\\\alpha$ and epsilon value:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"alpha = 0.85\\n\",\n    \"gamma = 0.90\\n\",\n    \"epsilon = 0.8\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"Define the number of episodes and number of time steps in the episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 5000\\n\",\n    \"num_timesteps = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Compute the optimal policy using the SARSA update rule as:\\n\",\n    \"\\n\",\n    \"$$ Q(s,a) = Q(s,a) + \\\\alpha (r + \\\\gamma Q(s',a') - Q(s,a)) $$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(20000):\\n\",\n    \"       \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    s = env.reset()\\n\",\n    \"    \\n\",\n    \"    #select the action using the epsilon-greedy policy\\n\",\n    \"    a = epsilon_greedy(s,epsilon)\\n\",\n    \"    \\n\",\n    \"    #for each step in the episode:\\n\",\n    \"    for t in range(num_timesteps):\\n\",\n    \"\\n\",\n    \"        #perform the selected action and store the next state information: \\n\",\n    \"        s_, r, done, _ = env.step(a)\\n\",\n    \"        \\n\",\n    \"        #select the action a dash in the next state using the epsilon greedy policy:\\n\",\n    \"        a_ = epsilon_greedy(s_,epsilon) \\n\",\n    \"        \\n\",\n    \"        #compute the Q value of the state-action pair\\n\",\n    \"        Q[(s,a)] += alpha * (r + gamma * Q[(s_,a_)]-Q[(s,a)])\\n\",\n    \"        \\n\",\n    \"        #update next state to current state\\n\",\n    \"        s = s_\\n\",\n    \"        \\n\",\n    \"        #update next action to current action\\n\",\n    \"        a = a_\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"        #if the current state is the terminal state then break:\\n\",\n    \"        if done:\\n\",\n    \"            break\\n\",\n    \"     \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that on every iteration we update the Q function. After all the iterations, we will have\\n\",\n    \"the optimal Q function. Once we have the optimal Q function then we can extract the\\n\",\n    \"optimal policy by selecting the action which has maximum Q value in each state. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/.ipynb_checkpoints/5.08. Computing the Optimal Policy using Q Learning-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Computing the optimal policy using Q Learning\\n\",\n    \"\\n\",\n    \"Now, let's implement Q learning to find the optimal policy in the frozen lake environment:\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import pandas as pd\\n\",\n    \"import random\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we create the frozen lake environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('FrozenLake-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's define the dictionary for storing the Q value of the state-action pair and we initialize\\n\",\n    \"the Q value of all the state-action pair to 0.0:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"\\n\",\n    \"Q = {}\\n\",\n    \"for s in range(env.observation_space.n):\\n\",\n    \"    for a in range(env.action_space.n):\\n\",\n    \"        Q[(s,a)] = 0.0\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's define the epsilon-greedy policy. We generate a random number from the\\n\",\n    \"uniform distribution and if the random number is less than epsilon we select the random\\n\",\n    \"action else we select the best action which has the maximum Q value: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def epsilon_greedy(state, epsilon):\\n\",\n    \"    if random.uniform(0,1) < epsilon:\\n\",\n    \"        return env.action_space.sample()\\n\",\n    \"    else:\\n\",\n    \"        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the discount factor $\\\\gamma$ and the learning rate $\\\\alpha$ and epsilon value:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"alpha = 0.85\\n\",\n    \"gamma = 0.90\\n\",\n    \"epsilon = 0.8\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the number of episodes and number of time steps in the episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 5000\\n\",\n    \"num_steps = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Compute the optimal policy using the Q learning update rule as:\\n\",\n    \"\\n\",\n    \"$$ Q(s,a) = Q(s,a) + \\\\alpha (r + \\\\gamma \\\\max_{a'} Q(s'a') - Q(s,a)) $$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode:\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    s = env.reset()\\n\",\n    \"    \\n\",\n    \"    #for each step in the episode\\n\",\n    \"    for t in range(num_steps):\\n\",\n    \"        \\n\",\n    \"        #select the action using the epsilon-greedy policy\\n\",\n    \"        a = epsilon_greedy(s,epsilon)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action and store the next state information\\n\",\n    \"        s_, r, done, _ = env.step(a)\\n\",\n    \"        \\n\",\n    \"        #first, select the action a dash which has a maximum Q value in the next state\\n\",\n    \"        a_ = np.argmax([Q[(s_, a)] for a in range(env.action_space.n)])\\n\",\n    \"    \\n\",\n    \"        # we calculate the Q value of previous state using our update rule\\n\",\n    \"        Q[(s,a)] += alpha * (r + gamma * Q[(s_,a_)]-Q[(s,a)])\\n\",\n    \"    \\n\",\n    \"        #update current state to next state\\n\",\n    \"        s = s_\\n\",\n    \"        \\n\",\n    \"        #if the current state is the terminal state then break  \\n\",\n    \"        if done:\\n\",\n    \"            break\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the iterations, we will have the optimal Q function. Then we can extract the\\n\",\n    \"optimal policy by selecting the action which has maximum Q value in each state. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/5.03. Predicting the Value of States in a Frozen Lake Environment.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Predicting the value of states in a frozen lake environment\\n\",\n    \"We learned that in the prediction method, the policy is given as an input and we predict\\n\",\n    \"value function using the given policy. So let's initialize a random policy and predict the\\n\",\n    \"value function (state values) of the frozen lake environment using the random policy.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import pandas as pd\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we create the frozen lake environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('FrozenLake-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true,\n    \"scrolled\": true\n   },\n   \"source\": [\n    \"Define the random policy which returns the random action by sampling from the action\\n\",\n    \"space:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def random_policy():\\n\",\n    \"    return env.action_space.sample()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's define the dictionary for storing the value of states and we initialize the value of all the\\n\",\n    \"states to 0.0:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"V = {}\\n\",\n    \"for s in range(env.observation_space.n):\\n\",\n    \"    V[s] = 0.0\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the discount factor $\\\\gamma$ and the learning rate $\\\\alpha$: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"alpha = 0.85\\n\",\n    \"gamma = 0.90\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes and number of time steps in the episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 5000\\n\",\n    \"num_timesteps = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Computing the value of states\\n\",\n    \"Now, let's compute the value function (state values) using the given random policy as:\\n\",\n    \"\\n\",\n    \"$$V(s) = V(s) + \\\\alpha (r + \\\\gamma V(s') - V(s)) $$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    s = env.reset()\\n\",\n    \"    \\n\",\n    \"    #for every step in the episode\\n\",\n    \"    for t in range(num_timesteps):\\n\",\n    \"        \\n\",\n    \"        #select an action according to random policy\\n\",\n    \"        a = random_policy()\\n\",\n    \"        \\n\",\n    \"        #perform the selected action and store the next state information\\n\",\n    \"        s_, r, done, _ = env.step(a)\\n\",\n    \"        \\n\",\n    \"        #compute the value of the state\\n\",\n    \"        V[s] += alpha * (r + gamma * V[s_]-V[s])\\n\",\n    \"        \\n\",\n    \"        #update next state to the current state\\n\",\n    \"        s = s_\\n\",\n    \"        \\n\",\n    \"        #if the current state is the terminal state then break\\n\",\n    \"        if done:\\n\",\n    \"            break\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the iterations, we will have a value of all the states according to the given random\\n\",\n    \"policy. \\n\",\n    \"\\n\",\n    \"## Evaluating the value of states \\n\",\n    \"\\n\",\n    \"Now, let's evaluate our value function (state values). First, let's convert our value dictionary\\n\",\n    \"to a pandas data frame for more clarity:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = pd.DataFrame(list(V.items()), columns=['state', 'value'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Before checking the value of the states, let's recollect that in the gym all the states in the\\n\",\n    \"frozen lake environment will be encoded into numbers. Since we have 16 states, all the\\n\",\n    \"states will be encoded into numbers from 0 to 15 as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"![title](Images/1.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"Now, Let's check the value of the states:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>state</th>\\n\",\n       \"      <th>value</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0.004119</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>0.000092</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>0.000428</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>0.000088</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>4</td>\\n\",\n       \"      <td>4</td>\\n\",\n       \"      <td>0.001522</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>5</td>\\n\",\n       \"      <td>5</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>6</td>\\n\",\n       \"      <td>6</td>\\n\",\n       \"      <td>0.000089</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>7</td>\\n\",\n       \"      <td>7</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>8</td>\\n\",\n       \"      <td>8</td>\\n\",\n       \"      <td>0.000141</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>9</td>\\n\",\n       \"      <td>9</td>\\n\",\n       \"      <td>0.526370</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>10</td>\\n\",\n       \"      <td>10</td>\\n\",\n       \"      <td>0.009634</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>11</td>\\n\",\n       \"      <td>11</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>13</td>\\n\",\n       \"      <td>13</td>\\n\",\n       \"      <td>0.273292</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>14</td>\\n\",\n       \"      <td>14</td>\\n\",\n       \"      <td>0.081485</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>15</td>\\n\",\n       \"      <td>15</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"    state     value\\n\",\n       \"0       0  0.004119\\n\",\n       \"1       1  0.000092\\n\",\n       \"2       2  0.000428\\n\",\n       \"3       3  0.000088\\n\",\n       \"4       4  0.001522\\n\",\n       \"5       5  0.000000\\n\",\n       \"6       6  0.000089\\n\",\n       \"7       7  0.000000\\n\",\n       \"8       8  0.000141\\n\",\n       \"9       9  0.526370\\n\",\n       \"10     10  0.009634\\n\",\n       \"11     11  0.000000\\n\",\n       \"12     12  0.000000\\n\",\n       \"13     13  0.273292\\n\",\n       \"14     14  0.081485\\n\",\n       \"15     15  0.000000\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"As we can observe, now we have the value of all the states and also we can notice that\\n\",\n    \"the value of all the terminal states (hole states and goal state) is zero.\\n\",\n    \"\\n\",\n    \"Now that we have understood how TD learning can be used for the prediction task, in the\\n\",\n    \"next section, we will learn how to use TD learning for the control task. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/5.06. Computing Optimal Policy using SARSA.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Computing optimal policy using SARSA\\n\",\n    \"\\n\",\n    \"Now, let's implement SARSA to find the optimal policy in the frozen lake environment.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import pandas as pd\\n\",\n    \"import random\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we create the frozen lake environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('FrozenLake-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's define the dictionary for storing the Q value of the state-action pair and we initialize\\n\",\n    \"the Q value of all the state-action pair to 0.0:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = {}\\n\",\n    \"for s in range(env.observation_space.n):\\n\",\n    \"    for a in range(env.action_space.n):\\n\",\n    \"        Q[(s,a)] = 0.0\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's define the epsilon-greedy policy. We generate a random number from the\\n\",\n    \"uniform distribution and if the random number is less than epsilon we select the random\\n\",\n    \"action else we select the best action which has the maximum Q value: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def epsilon_greedy(state, epsilon):\\n\",\n    \"    if random.uniform(0,1) < epsilon:\\n\",\n    \"        return env.action_space.sample()\\n\",\n    \"    else:\\n\",\n    \"        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the discount factor $\\\\gamma$ and the learning rate $\\\\alpha$ and epsilon value:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"alpha = 0.85\\n\",\n    \"gamma = 0.90\\n\",\n    \"epsilon = 0.8\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"Define the number of episodes and number of time steps in the episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 5000\\n\",\n    \"num_timesteps = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Compute the optimal policy using the SARSA update rule as:\\n\",\n    \"\\n\",\n    \"$$ Q(s,a) = Q(s,a) + \\\\alpha (r + \\\\gamma Q(s',a') - Q(s,a)) $$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(20000):\\n\",\n    \"       \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    s = env.reset()\\n\",\n    \"    \\n\",\n    \"    #select the action using the epsilon-greedy policy\\n\",\n    \"    a = epsilon_greedy(s,epsilon)\\n\",\n    \"    \\n\",\n    \"    #for each step in the episode:\\n\",\n    \"    for t in range(num_timesteps):\\n\",\n    \"\\n\",\n    \"        #perform the selected action and store the next state information: \\n\",\n    \"        s_, r, done, _ = env.step(a)\\n\",\n    \"        \\n\",\n    \"        #select the action a dash in the next state using the epsilon greedy policy:\\n\",\n    \"        a_ = epsilon_greedy(s_,epsilon) \\n\",\n    \"        \\n\",\n    \"        #compute the Q value of the state-action pair\\n\",\n    \"        Q[(s,a)] += alpha * (r + gamma * Q[(s_,a_)]-Q[(s,a)])\\n\",\n    \"        \\n\",\n    \"        #update next state to current state\\n\",\n    \"        s = s_\\n\",\n    \"        \\n\",\n    \"        #update next action to current action\\n\",\n    \"        a = a_\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"        #if the current state is the terminal state then break:\\n\",\n    \"        if done:\\n\",\n    \"            break\\n\",\n    \"     \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that on every iteration we update the Q function. After all the iterations, we will have\\n\",\n    \"the optimal Q function. Once we have the optimal Q function then we can extract the\\n\",\n    \"optimal policy by selecting the action which has maximum Q value in each state. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/5.08. Computing the Optimal Policy using Q Learning.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Computing the optimal policy using Q Learning\\n\",\n    \"\\n\",\n    \"Now, let's implement Q learning to find the optimal policy in the frozen lake environment:\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import pandas as pd\\n\",\n    \"import random\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we create the frozen lake environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('FrozenLake-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's define the dictionary for storing the Q value of the state-action pair and we initialize\\n\",\n    \"the Q value of all the state-action pair to 0.0:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"\\n\",\n    \"Q = {}\\n\",\n    \"for s in range(env.observation_space.n):\\n\",\n    \"    for a in range(env.action_space.n):\\n\",\n    \"        Q[(s,a)] = 0.0\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's define the epsilon-greedy policy. We generate a random number from the\\n\",\n    \"uniform distribution and if the random number is less than epsilon we select the random\\n\",\n    \"action else we select the best action which has the maximum Q value: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def epsilon_greedy(state, epsilon):\\n\",\n    \"    if random.uniform(0,1) < epsilon:\\n\",\n    \"        return env.action_space.sample()\\n\",\n    \"    else:\\n\",\n    \"        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the discount factor $\\\\gamma$ and the learning rate $\\\\alpha$ and epsilon value:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"alpha = 0.85\\n\",\n    \"gamma = 0.90\\n\",\n    \"epsilon = 0.8\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the number of episodes and number of time steps in the episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 5000\\n\",\n    \"num_steps = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Compute the optimal policy using the Q learning update rule as:\\n\",\n    \"\\n\",\n    \"$$ Q(s,a) = Q(s,a) + \\\\alpha (r + \\\\gamma \\\\max_{a'} Q(s'a') - Q(s,a)) $$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode:\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    s = env.reset()\\n\",\n    \"    \\n\",\n    \"    #for each step in the episode\\n\",\n    \"    for t in range(num_steps):\\n\",\n    \"        \\n\",\n    \"        #select the action using the epsilon-greedy policy\\n\",\n    \"        a = epsilon_greedy(s,epsilon)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action and store the next state information\\n\",\n    \"        s_, r, done, _ = env.step(a)\\n\",\n    \"        \\n\",\n    \"        #first, select the action a dash which has a maximum Q value in the next state\\n\",\n    \"        a_ = np.argmax([Q[(s_, a)] for a in range(env.action_space.n)])\\n\",\n    \"    \\n\",\n    \"        # we calculate the Q value of previous state using our update rule\\n\",\n    \"        Q[(s,a)] += alpha * (r + gamma * Q[(s_,a_)]-Q[(s,a)])\\n\",\n    \"    \\n\",\n    \"        #update current state to next state\\n\",\n    \"        s = s_\\n\",\n    \"        \\n\",\n    \"        #if the current state is the terminal state then break  \\n\",\n    \"        if done:\\n\",\n    \"            break\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the iterations, we will have the optimal Q function. Then we can extract the\\n\",\n    \"optimal policy by selecting the action which has maximum Q value in each state. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "05. Understanding Temporal Difference Learning/README.md",
    "content": "### 5. Understanding Temporal Difference Learning\n* 5.1. TD Learning\n* 5.2. TD Prediction\n   * 5.2.1. TD Prediction Algorithm\n* 5.3. Predicting the Value of States in a Frozen Lake Environment\n* 5.4. TD Control\n* 5.5. On-Policy TD Control - SARSA\n* 5.6. Computing Optimal Policy using SARSA\n* 5.7. Off-Policy TD Control - Q Learning\n* 5.8. Computing the Optimal Policy using Q Learning\n* 5.9. The Difference Between Q Learning and SARSA\n* 5.10. Comparing DP, MC, and TD Methods"
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.01 .The MAB Problem-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# The MAB problem\\n\",\n    \"\\n\",\n    \"The MAB problem is one of the classic problems in reinforcement learning. A MAB\\n\",\n    \"is a slot machine where we pull the arm (lever) and get a payout (reward) based on\\n\",\n    \"some probability distribution. A single slot machine is called a one-armed bandit and\\n\",\n    \"when there are multiple slot machines it is called a MAB or k-armed bandit, where k\\n\",\n    \"denotes the number of slot machines.\\n\",\n    \"\\n\",\n    \"The following figure shows a 3-armed bandit:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/7.png)\\n\",\n    \"\\n\",\n    \"Slot machines are one of the most popular games in the casino, where we pull the\\n\",\n    \"arm and get a reward. If we get 0 reward then we lose the game, and if we get +1\\n\",\n    \"reward then we win the game. There can be several slot machines, and each slot\\n\",\n    \"machine is referred to as an arm. For instance, slot machine 1 is referred to as arm\\n\",\n    \"1, slot machine 2 is referred to as arm 2, and so on. Thus, whenever we say arm n,\\n\",\n    \"it actually means that we are referring to slot machine n.\\n\",\n    \"\\n\",\n    \"Each arm has its own probability distribution indicating the probability of winning\\n\",\n    \"and losing the game. For example, let's suppose we have two arms. Let the\\n\",\n    \"probability of winning if we pull arm 1 (slot machine 1) be 0.7 and the probability\\n\",\n    \"of winning if we pull arm 2 (slot machine 2) be 0.5.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Then, if we pull arm 1, 70% of the time we win the game and get the +1 reward, and\\n\",\n    \"if we pull arm 2, then 50% of the time we win the game and get the +1 reward.\\n\",\n    \"Thus, we can say that pulling arm 1 is desirable as it makes us win the game 70% of\\n\",\n    \"the time. However, this probability distribution of the arm (slot machine) will not\\n\",\n    \"be given to us. We need to find out which arm helps us to win the game most of the\\n\",\n    \"time and gives us a good reward.\\n\",\n    \"\\n\",\n    \"Okay, how can we find this?\\n\",\n    \"\\n\",\n    \"Say we pulled arm 1 once and received a +1 reward, and we pulled arm 2 once\\n\",\n    \"and received a 0 reward. Since arm 1 gives a +1 reward, we cannot come to the\\n\",\n    \"conclusion that arm 1 is the best arm immediately after pulling it only once. We need\\n\",\n    \"to pull both of the arms many times and compute the average reward we obtain from\\n\",\n    \"each of the arms, and then we can select the arm that gives the maximum average\\n\",\n    \"reward as the best arm.\\n\",\n    \"\\n\",\n    \"Let's denote the arm by a and define the average reward by pulling the arm a as:\\n\",\n    \"\\n\",\n    \"$$ Q(a) \\\\frac{\\\\text{Sum of rewards obtained from the arm}}{\\\\text{Number of times the arm was pulled}}$$\\n\",\n    \"\\n\",\n    \"Where $Q(a)$ denotes the average reward of arm $a$.\\n\",\n    \"The optimal arm $a^*$ is the one that gives us the maximum average reward, that is:\\n\",\n    \"\\n\",\n    \"$$ a^* = \\\\text{arg} \\\\max_a Q(a) $$\\n\",\n    \"\\n\",\n    \"Okay, we have learned that the arm that gives the maximum average reward is the\\n\",\n    \"optimal arm. But how can we find this?\\n\",\n    \"\\n\",\n    \"We play the game for several rounds and we can pull only one arm in each round.\\n\",\n    \"Say in the first round we pull arm 1 and observe the reward, and in the second round\\n\",\n    \"we pull arm 2 and observe the reward. Similarly, in every round, we keep pulling\\n\",\n    \"arm 1 or arm 2 and observe the reward. After completing several rounds of the\\n\",\n    \"game, we compute the average reward of each of the arms, and then we select the\\n\",\n    \"arm that has the maximum average reward as the best arm.\\n\",\n    \"\\n\",\n    \"But this is not a good approach to find the best arm. Say we have 20 arms; if we keep\\n\",\n    \"pulling a different arm in each round, then in most of the rounds we will lose the\\n\",\n    \"game and get a 0 reward. Along with finding the best arm, our goal should be to\\n\",\n    \"minimize the cost of identifying the best arm, and this is usually referred to as regret.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, we need to find the best arm while minimizing regret. That is, we need to find\\n\",\n    \"the best arm, but we don't want to end up selecting the arms that make us lose the\\n\",\n    \"game in most of the rounds.\\n\",\n    \"\\n\",\n    \"So, should we explore a different arm in each round, or should we select only the\\n\",\n    \"arm that got us a good reward in the previous rounds? This leads to a situation\\n\",\n    \"called the exploration-exploitation dilemma, which we learned about in Chapter\\n\",\n    \"4, Monte Carlo Methods. So, to resolve this, we use the epsilon-greedy method and\\n\",\n    \"select the arm that got us a good reward in the previous rounds with probability\\n\",\n    \"1-epsilon and select the random arm with probability epsilon. After completing\\n\",\n    \"several rounds, we select the best arm as the one that has the maximum average\\n\",\n    \"reward.\\n\",\n    \"\\n\",\n    \"Similar to the epsilon-greedy method, there are several different exploration\\n\",\n    \"strategies that help us to overcome the exploration-exploitation dilemma. In the\\n\",\n    \"upcoming section, we will learn more about several different exploration strategies\\n\",\n    \"in detail and how they help us to find the optimal arm, but first let's look at\\n\",\n    \"creating a bandit.\\n\",\n    \"\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.04. Implementing epsilon-greedy -checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing epsilon-greedy \\n\",\n    \"\\n\",\n    \"Now, let's learn how to implement the epsilon-greedy method to find the best arm.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import gym_bandits\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the bandit environment\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For better understanding, let's create the bandit with only two arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"BanditTwoArmedHighLowFixed-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's check the probability distribution of the arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.8, 0.2]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.p_dist)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can observe that with arm 1 we win the game with 80% probability and with arm 2 we\\n\",\n    \"win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the\\n\",\n    \"game 80% probability. Now, let's see how to find this best arm using the epsilon-greedy\\n\",\n    \"method. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Initialize the variables\\n\",\n    \"\\n\",\n    \"First, let's initialize the variables:\\n\",\n    \"\\n\",\n    \"Initialize the `count` for storing the number of times an arm is pulled:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"count = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `sum_rewards` for storing the sum of rewards of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sum_rewards = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `Q` for storing the average reward of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define `num_rounds` - number of rounds (iterations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"num_rounds = 100\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the epsilon-greedy method\\n\",\n    \"\\n\",\n    \"First, we generate a random number from a uniform distribution, if the random number is\\n\",\n    \"less than epsilon then pull the random arm else we pull the best arm which has maximum\\n\",\n    \"average reward as shown below: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def epsilon_greedy(epsilon):\\n\",\n    \"    \\n\",\n    \"    if np.random.uniform(0,1) < epsilon:\\n\",\n    \"        return env.action_space.sample()\\n\",\n    \"    else:\\n\",\n    \"        return np.argmax(Q)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Start pulling the arm\\n\",\n    \"\\n\",\n    \"Now, let's play the game and try to find the best arm using the epsilon-greedy method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"for i in range(num_rounds):\\n\",\n    \"    \\n\",\n    \"    #select the arm based on the epsilon-greedy method\\n\",\n    \"    arm = epsilon_greedy(0.5)\\n\",\n    \"\\n\",\n    \"    #pull the arm and store the reward and next state information\\n\",\n    \"    next_state, reward, done, info = env.step(arm) \\n\",\n    \"\\n\",\n    \"    #increment the count of the arm by 1\\n\",\n    \"    count[arm] += 1\\n\",\n    \"    \\n\",\n    \"    #update the sum of rewards of the arm\\n\",\n    \"    sum_rewards[arm]+=reward\\n\",\n    \"\\n\",\n    \"    #update the average reward of the arm\\n\",\n    \"    Q[arm] = sum_rewards[arm]/count[arm]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the rounds, we look at the average reward obtained from each of the arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.77631579 0.20833333]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(Q)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be\\n\",\n    \"arm 1. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"The optimal arm is arm 1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print('The optimal arm is arm {}'.format(np.argmax(Q)+1))\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.06. Implementing Softmax Exploration-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing Softmax Exploration\\n\",\n    \"\\n\",\n    \"Now, let's learn how to implement the softmax exploration to find the best arm.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import gym_bandits\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the bandit environment\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's take the same two-armed bandit we saw in the epsilon-greedy section: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"BanditTwoArmedHighLowFixed-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's check the probability distribution of the arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.8, 0.2]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.p_dist)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can observe that with arm 1 we win the game with 80% probability and with arm 2 we\\n\",\n    \"win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the\\n\",\n    \"game 80% probability. Now, let's see how to find this best arm using the softmax exploration method\\n\",\n    \"method. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Initialize the variables\\n\",\n    \"\\n\",\n    \"First, let's initialize the variables:\\n\",\n    \"\\n\",\n    \"Initialize the `count` for storing the number of times an arm is pulled:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"count = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `sum_rewards` for storing the sum of rewards of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sum_rewards = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `Q` for storing the average reward of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define `num_rounds` - number of rounds (iterations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"num_rounds = 100\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the softmax exploration function\\n\",\n    \"\\n\",\n    \"Now, let's define the softmax function with temperature `T` as:\\n\",\n    \"\\n\",\n    \"$$P_t(a) = \\\\frac{\\\\text{exp}(Q_t(a)/T)} {\\\\sum_{i=1}^n \\\\text{exp}(Q_t(i)/T)} $$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def softmax(T):\\n\",\n    \"    \\n\",\n    \"    #compute the probability of each arm based on the above equation\\n\",\n    \"    denom = sum([np.exp(i/T) for i in Q]) \\n\",\n    \"    probs = [np.exp(i/T)/denom for i in Q]\\n\",\n    \"    \\n\",\n    \"    #select the arm based on the computed probability distribution of arms\\n\",\n    \"    arm = np.random.choice(env.action_space.n, p=probs)\\n\",\n    \"    \\n\",\n    \"    return arm\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Start pulling the arm\\n\",\n    \"\\n\",\n    \"Now, let's play the game and try to find the best arm using the softmax exploration method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's begin by setting the temperature `T` to a high number, say 50:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"T = 50\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"for i in range(num_rounds):\\n\",\n    \"    \\n\",\n    \"    #select the arm based on the softmax exploration method\\n\",\n    \"    arm = softmax(T)\\n\",\n    \"\\n\",\n    \"    #pull the arm and store the reward and next state information\\n\",\n    \"    next_state, reward, done, info = env.step(arm) \\n\",\n    \"\\n\",\n    \"    #increment the count of the arm by 1\\n\",\n    \"    count[arm] += 1\\n\",\n    \"    \\n\",\n    \"    #update the sum of rewards of the arm\\n\",\n    \"    sum_rewards[arm]+=reward\\n\",\n    \"\\n\",\n    \"    #update the average reward of the arm\\n\",\n    \"    Q[arm] = sum_rewards[arm]/count[arm]\\n\",\n    \"    \\n\",\n    \"    #reduce the temperature\\n\",\n    \"    T = T*0.99\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the rounds, we look at the average reward obtained from each of the arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.84090909 0.17857143]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(Q)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be\\n\",\n    \"arm 1. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"The optimal arm is arm 1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print('The optimal arm is arm {}'.format(np.argmax(Q)+1))\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.08. Implementing UCB-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing UCB\\n\",\n    \"\\n\",\n    \"Now, let's learn how to implement the UCB algorithm to find the best arm.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import gym_bandits\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the bandit environment\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's take the same two-armed bandit we saw in the epsilon-greedy section: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"BanditTwoArmedHighLowFixed-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's check the probability distribution of the arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.8, 0.2]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.p_dist)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can observe that with arm 1 we win the game with 80% probability and with arm 2 we\\n\",\n    \"win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the\\n\",\n    \"game 80% probability. Now, let's see how to find this best arm using the UCB method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Initialize the variables\\n\",\n    \"\\n\",\n    \"First, let's initialize the variables:\\n\",\n    \"\\n\",\n    \"Initialize the `count` for storing the number of times an arm is pulled:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"count = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `sum_rewards` for storing the sum of rewards of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sum_rewards = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize `Q` for storing the average reward of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define `num_rounds` number of rounds (iterations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"num_rounds = 100\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the UCB function\\n\",\n    \"\\n\",\n    \"Now, we define the `UCB` function which returns the best arm as the one which has the\\n\",\n    \"high upper confidence bound (UCB) arm: \\n\",\n    \"\\n\",\n    \"$$ \\\\text{UCB(a)} =Q(a) +\\\\sqrt{\\\\frac{2 \\\\log(t)}{N(a)}}  --- (1) $$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def UCB(i):\\n\",\n    \"    \\n\",\n    \"    #initialize the numpy array for storing the UCB of all the arms\\n\",\n    \"    ucb = np.zeros(2)\\n\",\n    \"    \\n\",\n    \"    #before computing the UCB, we explore all the arms at least once, so for the first 2 rounds,\\n\",\n    \"    #we directly select the arm corresponding to the round number\\n\",\n    \"    if i < 2:\\n\",\n    \"        return i\\n\",\n    \"    \\n\",\n    \"    #if the round is greater than 10 then, we compute the UCB of all the arms as specified in the\\n\",\n    \"    #equation (1) and return the arm which has the highest UCB:\\n\",\n    \"    else:\\n\",\n    \"        for arm in range(2):\\n\",\n    \"            ucb[arm] = Q[arm] + np.sqrt((2*np.log(sum(count))) / count[arm])\\n\",\n    \"        return (np.argmax(ucb))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Start pulling the arm\\n\",\n    \"\\n\",\n    \"Now, let's play the game and try to find the best arm using the UCB method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"for i in range(num_rounds):\\n\",\n    \"    \\n\",\n    \"    #select the arm based on the UCB method\\n\",\n    \"    arm = UCB(i)\\n\",\n    \"\\n\",\n    \"    #pull the arm and store the reward and next state information\\n\",\n    \"    next_state, reward, done, info = env.step(arm) \\n\",\n    \"\\n\",\n    \"    #increment the count of the arm by 1\\n\",\n    \"    count[arm] += 1\\n\",\n    \"    \\n\",\n    \"    #update the sum of rewards of the arm\\n\",\n    \"    sum_rewards[arm]+=reward\\n\",\n    \"\\n\",\n    \"    #update the average reward of the arm\\n\",\n    \"    Q[arm] = sum_rewards[arm]/count[arm]\\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the rounds, we look at the average reward obtained from each of the arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.72289157 0.33333333]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(Q)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be\\n\",\n    \"arm 1. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"The optimal arm is arm 1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print('The optimal arm is arm {}'.format(np.argmax(Q)+1))\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.1-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# The MAB problem\\n\",\n    \"\\n\",\n    \"The MAB problem is one of the classic problems in reinforcement learning. A MAB\\n\",\n    \"is a slot machine where we pull the arm (lever) and get a payout (reward) based on\\n\",\n    \"some probability distribution. A single slot machine is called a one-armed bandit and\\n\",\n    \"when there are multiple slot machines it is called a MAB or k-armed bandit, where k\\n\",\n    \"denotes the number of slot machines.\\n\",\n    \"\\n\",\n    \"The following figure shows a 3-armed bandit:\\n\",\n    \"\\n\",\n    \"[IMAGE]\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Slot machines are one of the most popular games in the casino, where we pull the\\n\",\n    \"arm and get a reward. If we get 0 reward then we lose the game, and if we get +1\\n\",\n    \"reward then we win the game. There can be several slot machines, and each slot\\n\",\n    \"machine is referred to as an arm. For instance, slot machine 1 is referred to as arm\\n\",\n    \"1, slot machine 2 is referred to as arm 2, and so on. Thus, whenever we say arm n,\\n\",\n    \"it actually means that we are referring to slot machine n.\\n\",\n    \"\\n\",\n    \"Each arm has its own probability distribution indicating the probability of winning\\n\",\n    \"and losing the game. For example, let's suppose we have two arms. Let the\\n\",\n    \"probability of winning if we pull arm 1 (slot machine 1) be 0.7 and the probability\\n\",\n    \"of winning if we pull arm 2 (slot machine 2) be 0.5.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Then, if we pull arm 1, 70% of the time we win the game and get the +1 reward, and\\n\",\n    \"if we pull arm 2, then 50% of the time we win the game and get the +1 reward.\\n\",\n    \"Thus, we can say that pulling arm 1 is desirable as it makes us win the game 70% of\\n\",\n    \"the time. However, this probability distribution of the arm (slot machine) will not\\n\",\n    \"be given to us. We need to find out which arm helps us to win the game most of the\\n\",\n    \"time and gives us a good reward.\\n\",\n    \"\\n\",\n    \"Okay, how can we find this?\\n\",\n    \"\\n\",\n    \"Say we pulled arm 1 once and received a +1 reward, and we pulled arm 2 once\\n\",\n    \"and received a 0 reward. Since arm 1 gives a +1 reward, we cannot come to the\\n\",\n    \"conclusion that arm 1 is the best arm immediately after pulling it only once. We need\\n\",\n    \"to pull both of the arms many times and compute the average reward we obtain from\\n\",\n    \"each of the arms, and then we can select the arm that gives the maximum average\\n\",\n    \"reward as the best arm.\\n\",\n    \"\\n\",\n    \"Let's denote the arm by a and define the average reward by pulling the arm a as:\\n\",\n    \"\\n\",\n    \"$$ Q(a) \\\\frac{\\\\text{Sum of rewards obtained from the arm}}{\\\\text{Number of times the arm was pulled}}$$\\n\",\n    \"\\n\",\n    \"Where $Q(a)$ denotes the average reward of arm $a$.\\n\",\n    \"The optimal arm $a^*$ is the one that gives us the maximum average reward, that is:\\n\",\n    \"\\n\",\n    \"$$ a^* = \\\\text{arg} \\\\max_a Q(a) $$\\n\",\n    \"\\n\",\n    \"Okay, we have learned that the arm that gives the maximum average reward is the\\n\",\n    \"optimal arm. But how can we find this?\\n\",\n    \"\\n\",\n    \"We play the game for several rounds and we can pull only one arm in each round.\\n\",\n    \"Say in the first round we pull arm 1 and observe the reward, and in the second round\\n\",\n    \"we pull arm 2 and observe the reward. Similarly, in every round, we keep pulling\\n\",\n    \"arm 1 or arm 2 and observe the reward. After completing several rounds of the\\n\",\n    \"game, we compute the average reward of each of the arms, and then we select the\\n\",\n    \"arm that has the maximum average reward as the best arm.\\n\",\n    \"\\n\",\n    \"But this is not a good approach to find the best arm. Say we have 20 arms; if we keep\\n\",\n    \"pulling a different arm in each round, then in most of the rounds we will lose the\\n\",\n    \"game and get a 0 reward. Along with finding the best arm, our goal should be to\\n\",\n    \"minimize the cost of identifying the best arm, and this is usually referred to as regret.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, we need to find the best arm while minimizing regret. That is, we need to find\\n\",\n    \"the best arm, but we don't want to end up selecting the arms that make us lose the\\n\",\n    \"game in most of the rounds.\\n\",\n    \"\\n\",\n    \"So, should we explore a different arm in each round, or should we select only the\\n\",\n    \"arm that got us a good reward in the previous rounds? This leads to a situation\\n\",\n    \"called the exploration-exploitation dilemma, which we learned about in Chapter\\n\",\n    \"4, Monte Carlo Methods. So, to resolve this, we use the epsilon-greedy method and\\n\",\n    \"select the arm that got us a good reward in the previous rounds with probability\\n\",\n    \"1-epsilon and select the random arm with probability epsilon. After completing\\n\",\n    \"several rounds, we select the best arm as the one that has the maximum average\\n\",\n    \"reward.\\n\",\n    \"\\n\",\n    \"Similar to the epsilon-greedy method, there are several different exploration\\n\",\n    \"strategies that help us to overcome the exploration-exploitation dilemma. In the\\n\",\n    \"upcoming section, we will learn more about several different exploration strategies\\n\",\n    \"in detail and how they help us to find the optimal arm, but first let's look at\\n\",\n    \"creating a bandit.\\n\",\n    \"\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.10. Implementing Thompson Sampling-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing Thompson sampling\\n\",\n    \"\\n\",\n    \"Now, let's learn how to implement the Thompson sampling method to find the best arm.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import gym_bandits\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the bandit environment\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's take the same two-armed bandit we saw in the previous section: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"BanditTwoArmedHighLowFixed-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's check the probability distribution of the arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.8, 0.2]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.p_dist)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can observe that with arm 1 we win the game with 80% probability and with arm 2 we\\n\",\n    \"win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the\\n\",\n    \"game 80% probability. Now, let's see how to find this best arm using the thompson sampling method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Initialize the variables\\n\",\n    \"\\n\",\n    \"First, let's initialize the variables:\\n\",\n    \"\\n\",\n    \"Initialize the `count` for storing the number of times an arm is pulled:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"count = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `sum_rewards` for storing the sum of rewards of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sum_rewards = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `Q` for storing the average reward of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define `num_rounds` - number of rounds (iterations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"num_rounds = 100\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `alpha` value with 1 for both the arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"alpha = np.ones(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `beta` value with 1 for both the arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"beta = np.ones(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the Thompson Sampling function \\n\",\n    \"\\n\",\n    \"Now, let's define the `thompson_sampling` function.\\n\",\n    \"\\n\",\n    \"As shown below, we randomly sample value from the beta distribution of both the arms\\n\",\n    \"and return the arm which has the maximum sampled value: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def thompson_sampling(alpha,beta):\\n\",\n    \"    \\n\",\n    \"    samples = [np.random.beta(alpha[i]+1,beta[i]+1) for i in range(2)]\\n\",\n    \"\\n\",\n    \"    return np.argmax(samples)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Start pulling the arm\\n\",\n    \"\\n\",\n    \"Now, let's play the game and try to find the best arm using the Thompson sampling\\n\",\n    \"method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"for i in range(num_rounds):\\n\",\n    \"    \\n\",\n    \"    #select the arm based on the thompson sampling method\\n\",\n    \"    arm = thompson_sampling(alpha,beta)\\n\",\n    \"\\n\",\n    \"    #pull the arm and store the reward and next state information\\n\",\n    \"    next_state, reward, done, info = env.step(arm) \\n\",\n    \"\\n\",\n    \"    #increment the count of the arm by 1\\n\",\n    \"    count[arm] += 1\\n\",\n    \"    \\n\",\n    \"    #update the sum of rewards of the arm\\n\",\n    \"    sum_rewards[arm]+=reward\\n\",\n    \"\\n\",\n    \"    #update the average reward of the arm\\n\",\n    \"    Q[arm] = sum_rewards[arm]/count[arm]\\n\",\n    \"\\n\",\n    \"    #if we win the game, that is, if the reward is equal to 1, then we update the value of alpha as \\n\",\n    \"    #alpha = alpha + 1 else we update the value of beta as beta = beta + 1\\n\",\n    \"    if reward==1:\\n\",\n    \"        alpha[arm] = alpha[arm] + 1\\n\",\n    \"    else:\\n\",\n    \"        beta[arm] = beta[arm] + 1\\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the rounds, we look at the average reward obtained from each of the arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.77659574 0.33333333]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(Q)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be\\n\",\n    \"arm 1. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"The optimal arm is arm 1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print('The optimal arm is arm {}'.format(np.argmax(Q)+1))\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/.ipynb_checkpoints/6.12. Finding the Best Advertisement Banner using Bandits-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Finding the best advertisement banner using bandits\\n\",\n    \"\\n\",\n    \"In this section, let's understand how to find the best advertisement banner using the\\n\",\n    \"bandits. Suppose, we are running a website and we have five different banners for a single\\n\",\n    \"advertisement that is been showed on our website and say we want to figure out which\\n\",\n    \"advertisement banner is most liked by the user.\\n\",\n    \"\\n\",\n    \"We can frame this problem as a multi-armed bandit problem. The five advertisement\\n\",\n    \"banners represent the five arms of the bandit and we assign +1 reward if the user clicks the\\n\",\n    \"advertisement and 0 reward if the user does not click the advertisement. So, to find out\\n\",\n    \"which advertisement banner is most clicked by the user, that is which advertisement\\n\",\n    \"banner can give us a maximum reward, we can use various exploration strategies. In this\\n\",\n    \"section, let's just use an epsilon-greedy method to understand the best advertisement\\n\",\n    \"banner.\\n\",\n    \"\\n\",\n    \"First, let us import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import seaborn as sns\\n\",\n    \"\\n\",\n    \"%matplotlib inline\\n\",\n    \"plt.style.use('ggplot')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a dataset\\n\",\n    \"\\n\",\n    \"Now, let's create a dataset. We generate a dataset with five columns denoting the five\\n\",\n    \"advertisement banners and we generate 100000 rows where the values in the row will be\\n\",\n    \"either o or 1 indicating that whether the advertisement banner has been clicked (1) or not\\n\",\n    \"clicked (0) by the user:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = pd.DataFrame()\\n\",\n    \"for i in range(5):\\n\",\n    \"    df['Banner_type_'+str(i)] = np.random.randint(0,2,100000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's look at the first few rows of our dataset:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Banner_type_0</th>\\n\",\n       \"      <th>Banner_type_1</th>\\n\",\n       \"      <th>Banner_type_2</th>\\n\",\n       \"      <th>Banner_type_3</th>\\n\",\n       \"      <th>Banner_type_4</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>4</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Banner_type_0  Banner_type_1  Banner_type_2  Banner_type_3  Banner_type_4\\n\",\n       \"0              0              1              0              1              1\\n\",\n       \"1              1              1              1              0              0\\n\",\n       \"2              0              1              1              0              0\\n\",\n       \"3              1              1              1              0              0\\n\",\n       \"4              0              1              0              0              0\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe we have the 5 advertisement banners (0 to 4) and\\n\",\n    \"the rows consist of value 0 or 1 indicating that whether the banner has been clicked (0) or not clicked (1). \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Initialize the variables\\n\",\n    \"\\n\",\n    \"Now, let's initialize some of the important variables:\\n\",\n    \"\\n\",\n    \"Set the number of iterations:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_iterations = 100000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of banners:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_banner = 5\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize count for storing the number of times, the banner was clicked:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"count = np.zeros(num_banner)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize sum_rewards for storing the sum of rewards obtained from each banner: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sum_rewards = np.zeros(num_banner)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize Q for storing the mean reward of each banner:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = np.zeros(num_banner)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the list for storing the selected banners:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"banner_selected = []\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Define the epsilon-greedy method\\n\",\n    \"\\n\",\n    \"Now, let's define the epsilon-greedy method. We generate a random value from a uniform\\n\",\n    \"distribution. If the random value is less than epsilon, then we select the random banner else\\n\",\n    \"we select the best banner which has a maximum average reward:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def epsilon_greedy_policy(epsilon):\\n\",\n    \"    \\n\",\n    \"    if np.random.uniform(0,1) < epsilon:\\n\",\n    \"        return  np.random.choice(num_banner)\\n\",\n    \"    else:\\n\",\n    \"        return np.argmax(Q)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Run the bandit test\\n\",\n    \"\\n\",\n    \"Now, we run the epsilon-greedy policy to understand which is the best advertisement\\n\",\n    \"banner:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each iteration\\n\",\n    \"for i in range(num_iterations):\\n\",\n    \"    \\n\",\n    \"    #select the banner using the epsilon-greedy policy\\n\",\n    \"    banner = epsilon_greedy_policy(0.5)\\n\",\n    \"    \\n\",\n    \"    #get the reward of the banner\\n\",\n    \"    reward = df.values[i, banner]\\n\",\n    \"    \\n\",\n    \"    #increment the counter\\n\",\n    \"    count[banner] += 1\\n\",\n    \"    \\n\",\n    \"    #store the sum of rewards\\n\",\n    \"    sum_rewards[banner]+=reward\\n\",\n    \"    \\n\",\n    \"    #compute the average reward\\n\",\n    \"    Q[banner] = sum_rewards[banner]/count[banner]\\n\",\n    \"    \\n\",\n    \"    #store the banner to the banner selected list\\n\",\n    \"    banner_selected.append(banner)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the rounds, we can select the best banner as the one which has the maximum\\n\",\n    \"average reward:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"The optimal banner is banner 4\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print( 'The optimal banner is banner {}'.format(np.argmax(Q)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also plot and see which banner is selected most of the times:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"image/png\": \"iVBORw0KGgoAAAANSUhEUgAAAZQAAAEJCAYAAACzPdE9AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3df1DTd57H8WfCDxWjmATFlWIrKnPiykBNt2BbRMzVndLZc9TaaWu72rq2g6vTuuP4o3uyO3t69CiFcotnf1i6ve72x1nr7c3tXO9YRpkr6zYWwva0rVDrdTlBJIliEIuQ3B+2aV1BUb8mAq/HX8kn32++7893El58vp9vvl9TMBgMIiIico3MkS5ARESGBgWKiIgYQoEiIiKGUKCIiIghFCgiImIIBYqIiBgiOtIFRNqxY8ciXYKIyKAyadKkPts1QhEREUOEZYTS3d1NYWEhPT099Pb2kpWVxdKlS6moqODQoUPExcUBsHr1am655RaCwSCVlZXU19czYsQICgoKSElJAWDv3r3s3r0bgEWLFpGbmwvAkSNHqKiooLu7m8zMTFasWIHJZApH90REhDAFSkxMDIWFhYwcOZKenh62bNlCRkYGAA8//DBZWVkXLF9fX09rayvl5eU0Njby8ssvs23bNvx+P7t27aKoqAiAjRs34nA4sFgsvPTSSzz++ONMnz6dv//7v8ftdpOZmRmO7omICGE65GUymRg5ciQAvb299Pb2XnL0cODAAXJycjCZTKSmptLZ2YnP58PtdpOeno7FYsFisZCeno7b7cbn89HV1UVqaiomk4mcnBxcLlc4uiYiIl8J26R8IBBgw4YNtLa2smDBAqZPn85//ud/8sYbb7Br1y6++93v8tBDDxETE4PX6yUhISG0rt1ux+v14vV6sdvtoXabzdZn+9fL96WqqoqqqioAioqKLtiOiIhcvbAFitlspri4mM7OTp599lm++OILHnzwQcaNG0dPTw8vvPAC//qv/8qSJUvo63qV/Y1oTCZTn8v3x+l04nQ6Q8/b29uvvDMiIsPYDXOW1+jRo0lLS8PtdmO1WjGZTMTExDBv3jyampqA8yOMb/+h93g8WK1WbDYbHo8n1O71erFardjt9gvaPR4PNpstfJ0SEZHwBEpHRwednZ3A+TO+PvroI5KSkvD5fAAEg0FcLhfJyckAOBwOampqCAaDHD58mLi4OKxWKxkZGTQ0NOD3+/H7/TQ0NJCRkYHVamXUqFEcPnyYYDBITU0NDocjHF0TEZGvhOWQl8/no6KigkAgQDAYJDs7m9mzZ/Pzn/+cjo4OAG6++WZWrVoFQGZmJnV1daxdu5bY2FgKCgoAsFgsLF68mE2bNgGwZMkSLBYLACtXrmT79u10d3eTkZGhM7xERMLMNNxvsKVfyovItTB/XBLpEq6LwIyf9PvaDTOHIiIiQ5MCRUREDKFAERERQyhQRETEEAoUERExhAJFREQMoUARERFDKFBERMQQChQRETGEAkVERAyhQBEREUMoUERExBAKFBERMYQCRUREDKFAERERQ4TtnvIig917v22JdAnXxYIffCfSJcgQoRGKiIgYQoEiIiKGUKCIiIghFCgiImIIBYqIiBgiLGd5dXd3U1hYSE9PD729vWRlZbF06VLa2tooKyvD7/czZcoU1qxZQ3R0NOfOneOXv/wlR44cYcyYMTz55JNMmDABgHfffZfq6mrMZjMrVqwgIyMDALfbTWVlJYFAgPnz57Nw4cJwdE1ERL4SlhFKTEwMhYWFFBcX8w//8A+43W4OHz7M66+/Tn5+PuXl5YwePZrq6moAqqurGT16NP/4j/9Ifn4+v/71rwFobm6mtraW5557jqeffpqdO3cSCAQIBALs3LmTzZs3U1payvvvv09zc3M4uiYiIl8JS6CYTCZGjhwJQG9vL729vZhMJg4ePEhWVhYAubm5uFwuAA4cOEBubi4AWVlZ/M///A/BYBCXy8WcOXOIiYlhwoQJTJw4kaamJpqampg4cSKJiYlER0czZ86c0HuJiEh4hO2HjYFAgA0bNtDa2sqCBQtITEwkLi6OqKgoAGw2G16vFwCv14vdbgcgKiqKuLg4Tp8+jdfrZfr06aH3/PY6Xy//9ePGxsY+66iqqqKqqgqAoqIiEhISjO+sDFFD84eN+g5cG2+kC7hOruZzEbZAMZvNFBcX09nZybPPPsv//d//9btsMBi8qM1kMvXZfqnl++J0OnE6naHn7e3tlytdZEjTd+DaDNUzmy71uZg0aVKf7WHfF6NHjyYtLY3GxkbOnDlDb28vcH5UYrPZgPMjDI/HA5w/RHbmzBksFssF7d9e5y/bPR4PVqs1jL0SEZGwBEpHRwednZ3A+TO+PvroI5KSkpg5cyb79+8HYO/evTgcDgBmz57N3r17Adi/fz8zZ87EZDLhcDiora3l3LlztLW10dLSwrRp05g6dSotLS20tbXR09NDbW1t6L1ERCQ8wnLIy+fzUVFRQSAQIBgMkp2dzezZs7npppsoKyvjzTffZMqUKeTl5QGQl5fHL3/5S9asWYPFYuHJJ58EIDk5mezsbNatW4fZbOaxxx7DbD6fiY8++ihbt24lEAgwb948kpOTw9E1ERH5iinY38TEMHHs2LFIlyCDhK42LH0xf1wS6RKui8CMn/T72g0zhyIiIkOTAkVERAyhQBEREUMoUERExBAKFBERMYQCRUREDKFAERERQyhQRETEEAoUERExhAJFREQMoUARERFDKFBERMQQChQRETGEAkVERAyhQBEREUMoUERExBAKFBERMYQCRUREDKFAERERQyhQRETEEAoUERExRHQ4NtLe3k5FRQUnT57EZDLhdDq55557ePvtt/n973/P2LFjAXjggQe49dZbAXj33Xeprq7GbDazYsUKMjIyAHC73VRWVhIIBJg/fz4LFy4EoK2tjbKyMvx+P1OmTGHNmjVER4eleyIiQpgCJSoqiocffpiUlBS6urrYuHEj6enpAOTn5/ODH/zgguWbm5upra3lueeew+fz8Ytf/ILnn38egJ07d/LTn/4Uu93Opk2bcDgc3HTTTbz++uvk5+dzxx138OKLL1JdXc3dd98dju6JiAhhOuRltVpJSUkBYNSoUSQlJeH1evtd3uVyMWfOHGJiYpgwYQITJ06kqamJpqYmJk6cSGJiItHR0cyZMweXy0UwGOTgwYNkZWUBkJubi8vlCkfXRETkK2E/JtTW1sbnn3/OtGnT+OSTT3jvvfeoqakhJSWFRx55BIvFgtfrZfr06aF1bDZbKIDsdnuo3W6309jYyOnTp4mLiyMqKuqi5f9SVVUVVVVVABQVFZGQkHC9uipDTkukC7gu9B24Nv3/azy4Xc3nIqyBcvbsWUpKSli+fDlxcXHcfffdLFmyBIC33nqL1157jYKCAoLBYJ/r99VuMpmuqAan04nT6Qw9b29vv6L1RYYafQeuzVA9s+lSn4tJkyb12R62fdHT00NJSQl33XUXt99+OwDjxo3DbDZjNpuZP38+n332GXB+5OHxeELrer1ebDbbRe0ejwer1cqYMWM4c+YMvb29FywvIiLhE5ZACQaD7Nixg6SkJO69995Qu8/nCz3+4IMPSE5OBsDhcFBbW8u5c+doa2ujpaWFadOmMXXqVFpaWmhra6Onp4fa2locDgcmk4mZM2eyf/9+APbu3YvD4QhH10RE5CthOeT16aefUlNTw+TJk1m/fj1w/hTh999/n6NHj2IymRg/fjyrVq0CIDk5mezsbNatW4fZbOaxxx7DbD6ffY8++ihbt24lEAgwb968UAg99NBDlJWV8eabbzJlyhTy8vLC0TUREfmKKdjfhMUwcezYsUiXIIPEe78dmpPyC37wnUiXMKiZPy6JdAnXRWDGT/p9LeJzKCIiMrQpUERExBAKFBERMYQCRUREDKFAERERQyhQRETEEAoUERExhAJFREQMoUARERFDKFBERMQQChQRETGEAkVERAyhQBEREUMoUERExBAKFBERMYQCRUREDKFAERERQyhQRETEEAoUERExhAJFREQMoUARERFDKFBERMQQ0QNd8A9/+APZ2dkXte/fv5+srKxLrtve3k5FRQUnT57EZDLhdDq555578Pv9lJaWcuLECcaPH89TTz2FxWIhGAxSWVlJfX09I0aMoKCggJSUFAD27t3L7t27AVi0aBG5ubkAHDlyhIqKCrq7u8nMzGTFihWYTKaBdk9ERK7RgEcoO3bs6LP9hRdeuOy6UVFRPPzww5SWlrJ161bee+89mpub2bNnD7NmzaK8vJxZs2axZ88eAOrr62ltbaW8vJxVq1bx8ssvA+D3+9m1axfbtm1j27Zt7Nq1C7/fD8BLL73E448/Tnl5Oa2trbjd7oF2TUREDHDZQDl+/DjHjx8nEAjQ1tYWen78+HH+9Kc/ERsbe9mNWK3W0Ahj1KhRJCUl4fV6cblczJ07F4C5c+ficrkAOHDgADk5OZhMJlJTU+ns7MTn8+F2u0lPT8disWCxWEhPT8ftduPz+ejq6iI1NRWTyUROTk7ovUREJDwue8hr7dq1ocdr1qy54LVx48Zx3333XdEG29ra+Pzzz5k2bRqnTp3CarUC50Ono6MDAK/XS0JCQmgdu92O1+vF6/Vit9tD7Tabrc/2r5fvS1VVFVVVVQAUFRVdsB2RS2uJdAHXhb4D16bvvzSD39V8Li4bKG+99RYAhYWF/PznP7/yqr7l7NmzlJSUsHz5cuLi4vpdLhgMXtTW33yIyWTqc/n+OJ1OnE5n6Hl7e/uA1xUZivQduDZD9cymS30uJk2a1Gf7gPfFtYZJT08PJSUl3HXXXdx+++0AxMfH4/P5APD5fIwdOxY4P8L4dmc8Hg9WqxWbzYbH4wm1e71erFYrdrv9gnaPx4PNZrumekVE5MoM+CyvtrY23njjDY4ePcrZs2cveO2f/umfLrluMBhkx44dJCUlce+994baHQ4H+/btY+HChezbt4/bbrst1P4f//Ef3HHHHTQ2NhIXF4fVaiUjI4M33ngjNBHf0NDAgw8+iMViYdSoURw+fJjp06dTU1PD97///QHvBBERuXYDDpTnn3+exMREHnnkEUaMGHFFG/n000+pqalh8uTJrF+/HoAHHniAhQsXUlpaSnV1NQkJCaxbtw6AzMxM6urqWLt2LbGxsRQUFABgsVhYvHgxmzZtAmDJkiVYLBYAVq5cyfbt2+nu7iYjI4PMzMwrqlFERK6NKTjACYgf/vCHVFZWYjYPrSOGx44di3QJMki899uhOSm/4AffiXQJg5r545JIl3BdBGb8pN/XrnkOZcaMGRw9evSKixIRkeFhwIe8xo8fz9atW/ne977HuHHjLnjt/vvvN7wwEREZXAYcKF9++SWzZ8+mt7f3gjOqRERE4AoC5euJcRERkb4MOFCOHz/e72uJiYmGFCMiIoPXgAPl25dg+Utf/5peRESGrwEHyl+GxsmTJ/mXf/kXZsyYYXhRIiIy+Fz1j0rGjRvH8uXL+c1vfmNkPSIiMkhd068Ujx07xpdffmlULSIiMogN+JDXli1bLrji75dffsmf//xnlixZcl0KExGRwWXAgZKXl3fB85EjR3LzzTfzne/osg0iInIFgfL1vdtFRET6MuBA6enpYffu3dTU1ODz+bBareTk5LBo0SKiowf8NiIiMkQNOAlef/11PvvsM370ox8xfvx4Tpw4wTvvvMOZM2dYvnz5dSxRREQGgwEHyv79+ykuLmbMmDHA+csXT5kyhfXr1ytQRERk4KcNX8l920VEZPgZ8AglOzubZ555hiVLlpCQkEB7ezvvvPMOWVlZ17M+EREZJAYcKMuWLeOdd95h586d+Hw+bDYbd9xxB4sXL76e9YmIyCBx2UD55JNPOHDgAMuWLeP++++/4GZar7/+OkeOHCE1NfW6FikiIje+y86hvPvuu6SlpfX52ne/+112795teFEiIjL4XDZQjh49SkZGRp+vzZo1i88//9zwokREZPC57CGvrq4uenp6iI2Nvei13t5eurq6LruR7du3U1dXR3x8PCUlJQC8/fbb/P73v2fs2LEAPPDAA9x6663A+VFRdXU1ZrOZFStWhALN7XZTWVlJIBBg/vz5LFy4EIC2tjbKysrw+/1MmTKFNWvW6MeWIiJhdtkRSlJSEg0NDX2+1tDQQFJS0mU3kpuby+bNmy9qz8/Pp7i4mOLi4lCYNDc3U1tby3PPPcfTTz/Nzp07CQQCBAIBdu7cyebNmyktLeX999+nubkZOD+Xk5+fT3l5OaNHj6a6uvqyNYmIiLEuGyj5+fm8+OKL/PGPfyQQCAAQCAT44x//yEsvvUR+fv5lN5KWlobFYhlQQS6Xizlz5hATE8OECROYOHEiTU1NNDU1MXHiRBITE4mOjmbOnDm4XC6CwSAHDx4Mnb6cm5uLy+Ua0LZERMQ4lz0udOedd3Ly5EkqKio4d+4cY8eOpaOjg9jYWO677z7uvPPOq974e++9R01NDSkpKTzyyCNYLBa8Xi/Tp08PLWOz2fB6vQDY7fZQu91up7GxkdOnTxMXF0dUVNRFy/elqqqKqqoqAIqKikhISLjq+mW4aYl0AdeFvgPXpv+/NoPb1XwuBjTRcO+995KXl8fhw4fx+/1YLBZSU1OJi4u74g1+7e677w7dS+Wtt97itddeo6CgoN9f5PfV/u37swyU0+nE6XSGnre3t1/xe4gMJfoOXJtrukvhDexSn4tJkyb12T7gmeu4uLh+z/a6GuPGjQs9nj9/Ps888wxwfuTh8XhCr3m9Xmw2G8AF7R6PB6vVypgxYzhz5gy9vb1ERUVdsLyIiIRPxMLV5/OFHn/wwQckJycD4HA4qK2t5dy5c7S1tdHS0sK0adOYOnUqLS0ttLW10dPTQ21tLQ6HA5PJxMyZM9m/fz8Ae/fuxeFwRKRPIiLDWVjOrS0rK+PQoUOcPn2aJ554gqVLl3Lw4EGOHj2KyWRi/PjxrFq1CoDk5GSys7NZt24dZrOZxx57DLP5fO49+uijbN26lUAgwLx580Ih9NBDD1FWVsabb77JlClTLrq7pIiIXH+m4DC/jPCxY8ciXYIMEu/9dmhOyi/4gW7jfS3MH5dEuoTrIjDjJ/2+1t8cylCdTxIRkTBToIiIiCEUKCIiYggFioiIGEKBIiIihlCgiIiIIRQoIiJiCAWKiIgYQoEiIiKGUKCIiIghdJ/cPrSsXxnpEq6L7xS/HOkSRGQI0whFREQMoUARERFDKFBERMQQmkORS1r+qz9EuoTr4tUfZke6BJEhRyMUERExhAJFREQMoUARERFDKFBERMQQChQRETGEAkVERAwRltOGt2/fTl1dHfHx8ZSUlADg9/spLS3lxIkTjB8/nqeeegqLxUIwGKSyspL6+npGjBhBQUEBKSkpAOzdu5fdu3cDsGjRInJzcwE4cuQIFRUVdHd3k5mZyYoVKzCZTOHomoiIfCUsI5Tc3Fw2b958QduePXuYNWsW5eXlzJo1iz179gBQX19Pa2sr5eXlrFq1ipdfPn/9Kb/fz65du9i2bRvbtm1j165d+P1+AF566SUef/xxysvLaW1txe12h6NbIiLyLWEZoaSlpdHW1nZBm8vl4mc/+xkAc+fO5Wc/+xnLli3jwIED5OTkYDKZSE1NpbOzE5/Px8GDB0lPT8disQCQnp6O2+1m5syZdHV1kZqaCkBOTg4ul4vMzMxwdE1kWCovL490CdfF2rVrI13CoBaxX8qfOnUKq9UKgNVqpaOjAwCv10tCQkJoObvdjtfrxev1YrfbQ+02m63P9q+X709VVRVVVVUAFBUVXbCtr7VcW9duWH31dbi6un0xND8Z+lx842r2Rf9/bQa3q9kXN9ylV4LB4EVt/c2HmEymPpe/FKfTidPpDD1vb2+/sgIHseHU18vRvviG9sU3rmZfDNUzmy61LyZNmtRne8T2RXx8PD6fDwCfz8fYsWOB8yOMb3fE4/FgtVqx2Wx4PJ5Qu9frxWq1YrfbL2j3eDzYbLYw9UJERL4WsUBxOBzs27cPgH379nHbbbeF2mtqaggGgxw+fJi4uDisVisZGRk0NDTg9/vx+/00NDSQkZGB1Wpl1KhRHD58mGAwSE1NDQ6HI1LdEhEZtsJyyKusrIxDhw5x+vRpnnjiCZYuXcrChQspLS2lurqahIQE1q1bB0BmZiZ1dXWsXbuW2NhYCgoKALBYLCxevJhNmzYBsGTJktAE/cqVK9m+fTvd3d1kZGRoQl5EJALCEihPPvlkn+1btmy5qM1kMrFyZd+34M3LyyMvL++i9qlTp4Z+3yIiIpExVOeTREQkzBQoIiJiCAWKiIgYQoEiIiKGUKCIiIghFCgiImIIBYqIiBhCgSIiIoZQoIiIiCEUKCIiYggFioiIGEKBIiIihlCgiIiIIRQoIiJiCAWKiIgYQoEiIiKGUKCIiIghFCgiImIIBYqIiBhCgSIiIoZQoIiIiCGiI13A6tWrGTlyJGazmaioKIqKivD7/ZSWlnLixAnGjx/PU089hcViIRgMUllZSX19PSNGjKCgoICUlBQA9u7dy+7duwFYtGgRubm5EeyViMjwE/FAASgsLGTs2LGh53v27GHWrFksXLiQPXv2sGfPHpYtW0Z9fT2tra2Ul5fT2NjIyy+/zLZt2/D7/ezatYuioiIANm7ciMPhwGKxRKpLIiLDzg15yMvlcjF37lwA5s6di8vlAuDAgQPk5ORgMplITU2ls7MTn8+H2+0mPT0di8WCxWIhPT0dt9sdyS6IiAw7N8QIZevWrQD89V//NU6nk1OnTmG1WgGwWq10dHQA4PV6SUhICK1nt9vxer14vV7sdnuo3Waz4fV6w9gDERGJeKD84he/wGazcerUKf7u7/6OSZMm9btsMBi8qM1kMvW5bH/tVVVVVFVVAVBUVHRBQH2tZSCFD0J99XW4urp9MTQ/GfpcfONq9sVQ/df1avZFxAPFZrMBEB8fz2233UZTUxPx8fH4fD6sVis+ny80v2K322lvbw+t6/F4sFqt2Gw2Dh06FGr3er2kpaX1uT2n04nT6Qw9//b7DXXDqa+Xo33xDe2Lb1zNvrgh5w0McKl90d8//hHdF2fPnqWrqyv0+E9/+hOTJ0/G4XCwb98+APbt28dtt90GgMPhoKamhmAwyOHDh4mLi8NqtZKRkUFDQwN+vx+/309DQwMZGRkR65eIyHAU0RHKqVOnePbZZwHo7e3lzjvvJCMjg6lTp1JaWkp1dTUJCQmsW7cOgMzMTOrq6li7di2xsbEUFBQAYLFYWLx4MZs2bQJgyZIlOsNLRCTMIhooiYmJFBcXX9Q+ZswYtmzZclG7yWRi5cqVfb5XXl4eeXl5htcoIiIDM1QP/4mISJgpUERExBAKFBERMYQCRUREDKFAERERQyhQRETEEAoUERExhAJFREQMoUARERFDKFBERMQQChQRETGEAkVERAyhQBEREUMoUERExBAKFBERMYQCRUREDKFAERERQyhQRETEEAoUERExhAJFREQMoUARERFDKFBERMQQ0ZEuwEhut5vKykoCgQDz589n4cKFkS5JRGTYGDIjlEAgwM6dO9m8eTOlpaW8//77NDc3R7osEZFhY8gESlNTExMnTiQxMZHo6GjmzJmDy+WKdFkiIsOGKRgMBiNdhBH279+P2+3miSeeAKCmpobGxkYee+yxC5arqqqiqqoKgKKiorDXKSIyVA2ZEUpfuWgymS5qczqdFBUV3TBhsnHjxkiXcMPQvviG9sU3tC++caPviyETKHa7HY/HE3ru8XiwWq0RrEhEZHgZMoEydepUWlpaaGtro6enh9raWhwOR6TLEhEZNobMacNRUVE8+uijbN26lUAgwLx580hOTo50WZfldDojXcINQ/viG9oX39C++MaNvi+GzKS8iIhE1pA55CUiIpGlQBEREUMMmTmUwUiXijlv+/bt1NXVER8fT0lJSaTLiaj29nYqKio4efIkJpMJp9PJPffcE+myIqK7u5vCwkJ6enro7e0lKyuLpUuXRrqsiAkEAmzcuBGbzXbDnj6sQImQry8V89Of/hS73c6mTZtwOBzcdNNNkS4t7HJzc/n+979PRUVFpEuJuKioKB5++GFSUlLo6upi48aNpKenD8vPRUxMDIWFhYwcOZKenh62bNlCRkYGqampkS4tIn73u9+RlJREV1dXpEvplw55RYguFfONtLQ0LBZLpMu4IVitVlJSUgAYNWoUSUlJeL3eCFcVGSaTiZEjRwLQ29tLb29vnz9WHg48Hg91dXXMnz8/0qVckkYoEeL1erHb7aHndrudxsbGCFYkN5q2tjY+//xzpk2bFulSIiYQCLBhwwZaW1tZsGAB06dPj3RJEfHqq6+ybNmyG3p0AhqhRMxALxUjw9PZs2cpKSlh+fLlxMXFRbqciDGbzRQXF7Njxw4+++wzvvjii0iXFHYffvgh8fHxoZHrjUwjlAjRpWKkPz09PZSUlHDXXXdx++23R7qcG8Lo0aNJS0vD7XYzefLkSJcTVp9++ikHDhygvr6e7u5uurq6KC8vZ+3atZEu7SIKlAj59qVibDYbtbW1N+QHRMIrGAyyY8cOkpKSuPfeeyNdTkR1dHQQFRXF6NGj6e7u5qOPPuJv/uZvIl1W2D344IM8+OCDABw8eJB/+7d/u2H/VihQImSwXirmeigrK+PQoUOcPn2aJ554gqVLl5KXlxfpsiLi008/paamhsmTJ7N+/XoAHnjgAW699dYIVxZ+Pp+PiooKAoEAwWCQ7OxsZs+eHemy5BJ06RURETGEJuVFRMQQChQRETGEAkVERAyhQBEREUMoUERExBAKFBERMYR+hyJigNWrV3Py5EnMZjPR0dGkpqbyox/9iISEhEiXJhI2GqGIGGTDhg388z//My+88ALx8fG88sorkS7pAr29vZEuQYY4jVBEDBYbG0tWVha/+tWvAKirq+PNN9/k+PHjxMXFMW/evNCNotra2vjxj39MQUEBb731Ft3d3eTn57No0SIA3n77bZqbm4mNjeWDDz4gISGB1atXM3XqVOD8VatfeeUVPv74Y1GSVAYAAAKOSURBVEaOHEl+fn7ohlxvv/02f/7zn4mJieHDDz/kkUceueEvfy6Dm0YoIgb78ssvqa2tDV1qfcSIEfz4xz+msrKSjRs38l//9V988MEHF6zzySef8Pzzz/O3f/u37Nq1i+bm5tBrH374IXPmzOHVV1/F4XCERj6BQIBnnnmGW265hRdeeIEtW7bwu9/9DrfbHVr3wIEDZGVlUVlZyV133RWG3stwphGKiEGKi4uJiori7NmzxMfH8/TTTwMwc+bM0DI333wzd9xxB4cOHeJ73/teqP2+++4jNjaWW265hZtvvpn//d//Dd2l8a/+6q9C1/LKycnh3//93wH47LPP6OjoYMmSJQAkJiYyf/58amtrycjIACA1NTW0ndjY2Ou8B2S4U6CIGGT9+vWkp6cTCARwuVwUFhZSWlrKiRMn+M1vfsMXX3xBT08PPT09ZGVlXbDuuHHjQo9HjBjB2bNnQ8/j4+NDj2NjYzl37hy9vb2cOHECn8/H8uXLQ68HAgFmzJgRev7tm7iJXG8KFBGDmc1mbr/9dl588UU++eQTfv3rX7NgwQI2bdpEbGwsr776Kh0dHde8nYSEBCZMmEB5ebkBVYtcO82hiBgsGAzicrno7OwkKSmJrq4uLBYLsbGxNDU18d///d+GbGfatGmMGjWKPXv20N3dTSAQ4IsvvqCpqcmQ9xe5UhqhiBjkmWeewWw2YzKZGD9+PKtXryY5OZmVK1fy2muv8corr5CWlkZ2djadnZ3XvD2z2cyGDRt47bXXWL16NT09PUyaNIn777/fgN6IXDndD0VERAyhQ14iImIIBYqIiBhCgSIiIoZQoIiIiCEUKCIiYggFioiIGEKBIiIihlCgiIiIIf4fY7oCVy/J+GAAAAAASUVORK5CYII=\\n\",\n      \"text/plain\": [\n       \"<Figure size 432x288 with 1 Axes>\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"ax = sns.countplot(banner_selected)\\n\",\n    \"ax.set(xlabel='Banner', ylabel='Count')\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Thus, we learned how to find the best advertisement banner by framing our problem as a multi-armed bandit problem\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/6.01 .The MAB Problem.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# The MAB problem\\n\",\n    \"\\n\",\n    \"The MAB problem is one of the classic problems in reinforcement learning. A MAB\\n\",\n    \"is a slot machine where we pull the arm (lever) and get a payout (reward) based on\\n\",\n    \"some probability distribution. A single slot machine is called a one-armed bandit and\\n\",\n    \"when there are multiple slot machines it is called a MAB or k-armed bandit, where k\\n\",\n    \"denotes the number of slot machines.\\n\",\n    \"\\n\",\n    \"The following figure shows a 3-armed bandit:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/7.png)\\n\",\n    \"\\n\",\n    \"Slot machines are one of the most popular games in the casino, where we pull the\\n\",\n    \"arm and get a reward. If we get 0 reward then we lose the game, and if we get +1\\n\",\n    \"reward then we win the game. There can be several slot machines, and each slot\\n\",\n    \"machine is referred to as an arm. For instance, slot machine 1 is referred to as arm\\n\",\n    \"1, slot machine 2 is referred to as arm 2, and so on. Thus, whenever we say arm n,\\n\",\n    \"it actually means that we are referring to slot machine n.\\n\",\n    \"\\n\",\n    \"Each arm has its own probability distribution indicating the probability of winning\\n\",\n    \"and losing the game. For example, let's suppose we have two arms. Let the\\n\",\n    \"probability of winning if we pull arm 1 (slot machine 1) be 0.7 and the probability\\n\",\n    \"of winning if we pull arm 2 (slot machine 2) be 0.5.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Then, if we pull arm 1, 70% of the time we win the game and get the +1 reward, and\\n\",\n    \"if we pull arm 2, then 50% of the time we win the game and get the +1 reward.\\n\",\n    \"Thus, we can say that pulling arm 1 is desirable as it makes us win the game 70% of\\n\",\n    \"the time. However, this probability distribution of the arm (slot machine) will not\\n\",\n    \"be given to us. We need to find out which arm helps us to win the game most of the\\n\",\n    \"time and gives us a good reward.\\n\",\n    \"\\n\",\n    \"Okay, how can we find this?\\n\",\n    \"\\n\",\n    \"Say we pulled arm 1 once and received a +1 reward, and we pulled arm 2 once\\n\",\n    \"and received a 0 reward. Since arm 1 gives a +1 reward, we cannot come to the\\n\",\n    \"conclusion that arm 1 is the best arm immediately after pulling it only once. We need\\n\",\n    \"to pull both of the arms many times and compute the average reward we obtain from\\n\",\n    \"each of the arms, and then we can select the arm that gives the maximum average\\n\",\n    \"reward as the best arm.\\n\",\n    \"\\n\",\n    \"Let's denote the arm by a and define the average reward by pulling the arm a as:\\n\",\n    \"\\n\",\n    \"$$ Q(a) \\\\frac{\\\\text{Sum of rewards obtained from the arm}}{\\\\text{Number of times the arm was pulled}}$$\\n\",\n    \"\\n\",\n    \"Where $Q(a)$ denotes the average reward of arm $a$.\\n\",\n    \"The optimal arm $a^*$ is the one that gives us the maximum average reward, that is:\\n\",\n    \"\\n\",\n    \"$$ a^* = \\\\text{arg} \\\\max_a Q(a) $$\\n\",\n    \"\\n\",\n    \"Okay, we have learned that the arm that gives the maximum average reward is the\\n\",\n    \"optimal arm. But how can we find this?\\n\",\n    \"\\n\",\n    \"We play the game for several rounds and we can pull only one arm in each round.\\n\",\n    \"Say in the first round we pull arm 1 and observe the reward, and in the second round\\n\",\n    \"we pull arm 2 and observe the reward. Similarly, in every round, we keep pulling\\n\",\n    \"arm 1 or arm 2 and observe the reward. After completing several rounds of the\\n\",\n    \"game, we compute the average reward of each of the arms, and then we select the\\n\",\n    \"arm that has the maximum average reward as the best arm.\\n\",\n    \"\\n\",\n    \"But this is not a good approach to find the best arm. Say we have 20 arms; if we keep\\n\",\n    \"pulling a different arm in each round, then in most of the rounds we will lose the\\n\",\n    \"game and get a 0 reward. Along with finding the best arm, our goal should be to\\n\",\n    \"minimize the cost of identifying the best arm, and this is usually referred to as regret.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, we need to find the best arm while minimizing regret. That is, we need to find\\n\",\n    \"the best arm, but we don't want to end up selecting the arms that make us lose the\\n\",\n    \"game in most of the rounds.\\n\",\n    \"\\n\",\n    \"So, should we explore a different arm in each round, or should we select only the\\n\",\n    \"arm that got us a good reward in the previous rounds? This leads to a situation\\n\",\n    \"called the exploration-exploitation dilemma, which we learned about in Chapter\\n\",\n    \"4, Monte Carlo Methods. So, to resolve this, we use the epsilon-greedy method and\\n\",\n    \"select the arm that got us a good reward in the previous rounds with probability\\n\",\n    \"1-epsilon and select the random arm with probability epsilon. After completing\\n\",\n    \"several rounds, we select the best arm as the one that has the maximum average\\n\",\n    \"reward.\\n\",\n    \"\\n\",\n    \"Similar to the epsilon-greedy method, there are several different exploration\\n\",\n    \"strategies that help us to overcome the exploration-exploitation dilemma. In the\\n\",\n    \"upcoming section, we will learn more about several different exploration strategies\\n\",\n    \"in detail and how they help us to find the optimal arm, but first let's look at\\n\",\n    \"creating a bandit.\\n\",\n    \"\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/6.03. Epsilon-Greedy.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Epsilon-greedy\\n\",\n    \"\\n\",\n    \"We already learned about the epsilon-greedy algorithm in the previous chapters. With the epsilon-greedy, we select the best arm with probability 1-epsilon and we select the random arm with probability epsilon. Let's take a simple example and learn how we find the best arm exactly with the epsilon-greedy method in more detail. \\n\",\n    \"\\n\",\n    \"Say, we have two arms - arm 1 and arm 2. Suppose, with arm 1 we win the game 80% of the time and with arm 2 we win the game with 20% of the time. So, we can say that arm 1 is the best arm as it makes us win the game 80% of the time. Now, let's learn how to find this with the epsilon-greedy method. \\n\",\n    \"\\n\",\n    \"First, we initialize the `count` - number of times the arm is pulled, `sum_rewards` - the sum of rewards obtained from pulling the arm, `Q`- average reward obtained by pulling the arm as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/1.PNG)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Round 1:\\n\",\n    \"\\n\",\n    \"Say, in round 1 of the game, we select the random arm with probability epsilon, suppose we randomly pull the arm 1 and observe the reward. Let the reward obtained by pulling the arm 1 be 1. So, we update our table with `count` of arm 1 to 1, `sum_rewards` of arm 1 to 1 and thus the average reward `Q` of the arm 1 after round 1 will be 1 as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/2.PNG)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Round 2:\\n\",\n    \"\\n\",\n    \"Say, in round 2, we select the best arm with probability 1-epsilon. The best arm is the one which has a maximum average reward. So, we check our table as which arm has the maximum average reward, since arm 1 has the maximum average reward, we pull the arm 1 and observe the reward and let the reward obtained from pulling the arm 1 be 1. So, we update our table with `count` of arm 1 to 2, `sum_rewards` of arm 1 to 2 and thus the average reward `Q` of the arm 1 after round 2 will be 1 as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/3.PNG)\\n\",\n    \"## Round 3:\\n\",\n    \"\\n\",\n    \"Say, in round 3, we select the random arm with probability epsilon, suppose we randomly pull the arm 2 and observe the reward. Let the reward obtained by pulling the arm 2 be 0. So, we update our table with `count` of arm 2 to 1, `sum_rewards` of arm 2 to 0 and thus the average reward `Q` of the arm 2 after round 3 will be 0 as shown below:\\n\",\n    \"\\n\",\n    \"![title](Images/4.PNG)\\n\",\n    \"\\n\",\n    \"## Round 4:\\n\",\n    \"\\n\",\n    \"Say, in round 4, we select the best arm with probability 1-epsilon. So, we pull arm 1 since it has a maximum average reward. Let the reward obtained by pulling arm 1 be 0 this time. Now, we update our table with `count` of arm 1 to 3, `sum_rewards` of arm 2 to 2 and thus the average reward `Q` of the arm 1 after round 4 will be 0.66 as shown below:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/5.PNG)\\n\",\n    \"We repeat this process for several numbers of rounds, that is, for several rounds of the game, we pull the best arm with probability 1-epsilon and we pull the random arm with the probability epsilon. The updated table after some 100 rounds of game is shown below:\\n\",\n    \"\\n\",\n    \"![title](Images/6.PNG)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"From the above table, we can conclude that arm 1 is the best arm since it has the maximum average reward. \\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/6.04. Implementing epsilon-greedy .ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing epsilon-greedy \\n\",\n    \"\\n\",\n    \"Now, let's learn how to implement the epsilon-greedy method to find the best arm.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import gym_bandits\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the bandit environment\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For better understanding, let's create the bandit with only two arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"BanditTwoArmedHighLowFixed-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's check the probability distribution of the arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.8, 0.2]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.p_dist)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can observe that with arm 1 we win the game with 80% probability and with arm 2 we\\n\",\n    \"win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the\\n\",\n    \"game 80% probability. Now, let's see how to find this best arm using the epsilon-greedy\\n\",\n    \"method. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Initialize the variables\\n\",\n    \"\\n\",\n    \"First, let's initialize the variables:\\n\",\n    \"\\n\",\n    \"Initialize the `count` for storing the number of times an arm is pulled:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"count = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `sum_rewards` for storing the sum of rewards of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sum_rewards = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `Q` for storing the average reward of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define `num_rounds` - number of rounds (iterations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"num_rounds = 100\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the epsilon-greedy method\\n\",\n    \"\\n\",\n    \"First, we generate a random number from a uniform distribution, if the random number is\\n\",\n    \"less than epsilon then pull the random arm else we pull the best arm which has maximum\\n\",\n    \"average reward as shown below: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def epsilon_greedy(epsilon):\\n\",\n    \"    \\n\",\n    \"    if np.random.uniform(0,1) < epsilon:\\n\",\n    \"        return env.action_space.sample()\\n\",\n    \"    else:\\n\",\n    \"        return np.argmax(Q)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Start pulling the arm\\n\",\n    \"\\n\",\n    \"Now, let's play the game and try to find the best arm using the epsilon-greedy method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"for i in range(num_rounds):\\n\",\n    \"    \\n\",\n    \"    #select the arm based on the epsilon-greedy method\\n\",\n    \"    arm = epsilon_greedy(0.5)\\n\",\n    \"\\n\",\n    \"    #pull the arm and store the reward and next state information\\n\",\n    \"    next_state, reward, done, info = env.step(arm) \\n\",\n    \"\\n\",\n    \"    #increment the count of the arm by 1\\n\",\n    \"    count[arm] += 1\\n\",\n    \"    \\n\",\n    \"    #update the sum of rewards of the arm\\n\",\n    \"    sum_rewards[arm]+=reward\\n\",\n    \"\\n\",\n    \"    #update the average reward of the arm\\n\",\n    \"    Q[arm] = sum_rewards[arm]/count[arm]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the rounds, we look at the average reward obtained from each of the arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.77631579 0.20833333]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(Q)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be\\n\",\n    \"arm 1. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"The optimal arm is arm 1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print('The optimal arm is arm {}'.format(np.argmax(Q)+1))\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/6.06. Implementing Softmax Exploration.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing Softmax Exploration\\n\",\n    \"\\n\",\n    \"Now, let's learn how to implement the softmax exploration to find the best arm.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import gym_bandits\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the bandit environment\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's take the same two-armed bandit we saw in the epsilon-greedy section: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"BanditTwoArmedHighLowFixed-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's check the probability distribution of the arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.8, 0.2]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.p_dist)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can observe that with arm 1 we win the game with 80% probability and with arm 2 we\\n\",\n    \"win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the\\n\",\n    \"game 80% probability. Now, let's see how to find this best arm using the softmax exploration method\\n\",\n    \"method. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Initialize the variables\\n\",\n    \"\\n\",\n    \"First, let's initialize the variables:\\n\",\n    \"\\n\",\n    \"Initialize the `count` for storing the number of times an arm is pulled:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"count = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `sum_rewards` for storing the sum of rewards of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sum_rewards = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `Q` for storing the average reward of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define `num_rounds` - number of rounds (iterations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"num_rounds = 100\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the softmax exploration function\\n\",\n    \"\\n\",\n    \"Now, let's define the softmax function with temperature `T` as:\\n\",\n    \"\\n\",\n    \"$$P_t(a) = \\\\frac{\\\\text{exp}(Q_t(a)/T)} {\\\\sum_{i=1}^n \\\\text{exp}(Q_t(i)/T)} $$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def softmax(T):\\n\",\n    \"    \\n\",\n    \"    #compute the probability of each arm based on the above equation\\n\",\n    \"    denom = sum([np.exp(i/T) for i in Q]) \\n\",\n    \"    probs = [np.exp(i/T)/denom for i in Q]\\n\",\n    \"    \\n\",\n    \"    #select the arm based on the computed probability distribution of arms\\n\",\n    \"    arm = np.random.choice(env.action_space.n, p=probs)\\n\",\n    \"    \\n\",\n    \"    return arm\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Start pulling the arm\\n\",\n    \"\\n\",\n    \"Now, let's play the game and try to find the best arm using the softmax exploration method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's begin by setting the temperature `T` to a high number, say 50:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"T = 50\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"for i in range(num_rounds):\\n\",\n    \"    \\n\",\n    \"    #select the arm based on the softmax exploration method\\n\",\n    \"    arm = softmax(T)\\n\",\n    \"\\n\",\n    \"    #pull the arm and store the reward and next state information\\n\",\n    \"    next_state, reward, done, info = env.step(arm) \\n\",\n    \"\\n\",\n    \"    #increment the count of the arm by 1\\n\",\n    \"    count[arm] += 1\\n\",\n    \"    \\n\",\n    \"    #update the sum of rewards of the arm\\n\",\n    \"    sum_rewards[arm]+=reward\\n\",\n    \"\\n\",\n    \"    #update the average reward of the arm\\n\",\n    \"    Q[arm] = sum_rewards[arm]/count[arm]\\n\",\n    \"    \\n\",\n    \"    #reduce the temperature\\n\",\n    \"    T = T*0.99\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the rounds, we look at the average reward obtained from each of the arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.84090909 0.17857143]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(Q)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be\\n\",\n    \"arm 1. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"The optimal arm is arm 1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print('The optimal arm is arm {}'.format(np.argmax(Q)+1))\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/6.08. Implementing UCB.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing UCB\\n\",\n    \"\\n\",\n    \"Now, let's learn how to implement the UCB algorithm to find the best arm.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import gym_bandits\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the bandit environment\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's take the same two-armed bandit we saw in the epsilon-greedy section: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"BanditTwoArmedHighLowFixed-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's check the probability distribution of the arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.8, 0.2]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.p_dist)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can observe that with arm 1 we win the game with 80% probability and with arm 2 we\\n\",\n    \"win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the\\n\",\n    \"game 80% probability. Now, let's see how to find this best arm using the UCB method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Initialize the variables\\n\",\n    \"\\n\",\n    \"First, let's initialize the variables:\\n\",\n    \"\\n\",\n    \"Initialize the `count` for storing the number of times an arm is pulled:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"count = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `sum_rewards` for storing the sum of rewards of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sum_rewards = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize `Q` for storing the average reward of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define `num_rounds` number of rounds (iterations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"num_rounds = 100\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the UCB function\\n\",\n    \"\\n\",\n    \"Now, we define the `UCB` function which returns the best arm as the one which has the\\n\",\n    \"high upper confidence bound (UCB) arm: \\n\",\n    \"\\n\",\n    \"$$ \\\\text{UCB(a)} =Q(a) +\\\\sqrt{\\\\frac{2 \\\\log(t)}{N(a)}}  --- (1) $$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def UCB(i):\\n\",\n    \"    \\n\",\n    \"    #initialize the numpy array for storing the UCB of all the arms\\n\",\n    \"    ucb = np.zeros(2)\\n\",\n    \"    \\n\",\n    \"    #before computing the UCB, we explore all the arms at least once, so for the first 2 rounds,\\n\",\n    \"    #we directly select the arm corresponding to the round number\\n\",\n    \"    if i < 2:\\n\",\n    \"        return i\\n\",\n    \"    \\n\",\n    \"    #if the round is greater than 10 then, we compute the UCB of all the arms as specified in the\\n\",\n    \"    #equation (1) and return the arm which has the highest UCB:\\n\",\n    \"    else:\\n\",\n    \"        for arm in range(2):\\n\",\n    \"            ucb[arm] = Q[arm] + np.sqrt((2*np.log(sum(count))) / count[arm])\\n\",\n    \"        return (np.argmax(ucb))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Start pulling the arm\\n\",\n    \"\\n\",\n    \"Now, let's play the game and try to find the best arm using the UCB method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"for i in range(num_rounds):\\n\",\n    \"    \\n\",\n    \"    #select the arm based on the UCB method\\n\",\n    \"    arm = UCB(i)\\n\",\n    \"\\n\",\n    \"    #pull the arm and store the reward and next state information\\n\",\n    \"    next_state, reward, done, info = env.step(arm) \\n\",\n    \"\\n\",\n    \"    #increment the count of the arm by 1\\n\",\n    \"    count[arm] += 1\\n\",\n    \"    \\n\",\n    \"    #update the sum of rewards of the arm\\n\",\n    \"    sum_rewards[arm]+=reward\\n\",\n    \"\\n\",\n    \"    #update the average reward of the arm\\n\",\n    \"    Q[arm] = sum_rewards[arm]/count[arm]\\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the rounds, we look at the average reward obtained from each of the arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.72289157 0.33333333]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(Q)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be\\n\",\n    \"arm 1. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"The optimal arm is arm 1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print('The optimal arm is arm {}'.format(np.argmax(Q)+1))\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/6.10. Implementing Thompson Sampling.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing Thompson sampling\\n\",\n    \"\\n\",\n    \"Now, let's learn how to implement the Thompson sampling method to find the best arm.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import gym_bandits\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the bandit environment\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's take the same two-armed bandit we saw in the previous section: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"BanditTwoArmedHighLowFixed-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's check the probability distribution of the arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.8, 0.2]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(env.p_dist)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can observe that with arm 1 we win the game with 80% probability and with arm 2 we\\n\",\n    \"win the game with 20% probability. Here, the best arm is arm 1, as with arm 1 we win the\\n\",\n    \"game 80% probability. Now, let's see how to find this best arm using the thompson sampling method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Initialize the variables\\n\",\n    \"\\n\",\n    \"First, let's initialize the variables:\\n\",\n    \"\\n\",\n    \"Initialize the `count` for storing the number of times an arm is pulled:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"count = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `sum_rewards` for storing the sum of rewards of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sum_rewards = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `Q` for storing the average reward of each arm:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = np.zeros(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define `num_rounds` - number of rounds (iterations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"num_rounds = 100\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `alpha` value with 1 for both the arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"alpha = np.ones(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the `beta` value with 1 for both the arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"beta = np.ones(2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the Thompson Sampling function \\n\",\n    \"\\n\",\n    \"Now, let's define the `thompson_sampling` function.\\n\",\n    \"\\n\",\n    \"As shown below, we randomly sample value from the beta distribution of both the arms\\n\",\n    \"and return the arm which has the maximum sampled value: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def thompson_sampling(alpha,beta):\\n\",\n    \"    \\n\",\n    \"    samples = [np.random.beta(alpha[i]+1,beta[i]+1) for i in range(2)]\\n\",\n    \"\\n\",\n    \"    return np.argmax(samples)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Start pulling the arm\\n\",\n    \"\\n\",\n    \"Now, let's play the game and try to find the best arm using the Thompson sampling\\n\",\n    \"method.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"for i in range(num_rounds):\\n\",\n    \"    \\n\",\n    \"    #select the arm based on the thompson sampling method\\n\",\n    \"    arm = thompson_sampling(alpha,beta)\\n\",\n    \"\\n\",\n    \"    #pull the arm and store the reward and next state information\\n\",\n    \"    next_state, reward, done, info = env.step(arm) \\n\",\n    \"\\n\",\n    \"    #increment the count of the arm by 1\\n\",\n    \"    count[arm] += 1\\n\",\n    \"    \\n\",\n    \"    #update the sum of rewards of the arm\\n\",\n    \"    sum_rewards[arm]+=reward\\n\",\n    \"\\n\",\n    \"    #update the average reward of the arm\\n\",\n    \"    Q[arm] = sum_rewards[arm]/count[arm]\\n\",\n    \"\\n\",\n    \"    #if we win the game, that is, if the reward is equal to 1, then we update the value of alpha as \\n\",\n    \"    #alpha = alpha + 1 else we update the value of beta as beta = beta + 1\\n\",\n    \"    if reward==1:\\n\",\n    \"        alpha[arm] = alpha[arm] + 1\\n\",\n    \"    else:\\n\",\n    \"        beta[arm] = beta[arm] + 1\\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the rounds, we look at the average reward obtained from each of the arms:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[0.77659574 0.33333333]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(Q)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can select the optimal arm as the one which has a maximum average reward. Since the arm 1 has a maximum average reward than the arm 2, our optimal arm will be\\n\",\n    \"arm 1. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"The optimal arm is arm 1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print('The optimal arm is arm {}'.format(np.argmax(Q)+1))\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/6.12. Finding the Best Advertisement Banner using Bandits.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Finding the best advertisement banner using bandits\\n\",\n    \"\\n\",\n    \"In this section, let's understand how to find the best advertisement banner using the\\n\",\n    \"bandits. Suppose, we are running a website and we have five different banners for a single\\n\",\n    \"advertisement that is been showed on our website and say we want to figure out which\\n\",\n    \"advertisement banner is most liked by the user.\\n\",\n    \"\\n\",\n    \"We can frame this problem as a multi-armed bandit problem. The five advertisement\\n\",\n    \"banners represent the five arms of the bandit and we assign +1 reward if the user clicks the\\n\",\n    \"advertisement and 0 reward if the user does not click the advertisement. So, to find out\\n\",\n    \"which advertisement banner is most clicked by the user, that is which advertisement\\n\",\n    \"banner can give us a maximum reward, we can use various exploration strategies. In this\\n\",\n    \"section, let's just use an epsilon-greedy method to understand the best advertisement\\n\",\n    \"banner.\\n\",\n    \"\\n\",\n    \"First, let us import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import seaborn as sns\\n\",\n    \"\\n\",\n    \"%matplotlib inline\\n\",\n    \"plt.style.use('ggplot')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a dataset\\n\",\n    \"\\n\",\n    \"Now, let's create a dataset. We generate a dataset with five columns denoting the five\\n\",\n    \"advertisement banners and we generate 100000 rows where the values in the row will be\\n\",\n    \"either o or 1 indicating that whether the advertisement banner has been clicked (1) or not\\n\",\n    \"clicked (0) by the user:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = pd.DataFrame()\\n\",\n    \"for i in range(5):\\n\",\n    \"    df['Banner_type_'+str(i)] = np.random.randint(0,2,100000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's look at the first few rows of our dataset:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Banner_type_0</th>\\n\",\n       \"      <th>Banner_type_1</th>\\n\",\n       \"      <th>Banner_type_2</th>\\n\",\n       \"      <th>Banner_type_3</th>\\n\",\n       \"      <th>Banner_type_4</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <td>4</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Banner_type_0  Banner_type_1  Banner_type_2  Banner_type_3  Banner_type_4\\n\",\n       \"0              0              1              0              1              1\\n\",\n       \"1              1              1              1              0              0\\n\",\n       \"2              0              1              1              0              0\\n\",\n       \"3              1              1              1              0              0\\n\",\n       \"4              0              1              0              0              0\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can observe we have the 5 advertisement banners (0 to 4) and\\n\",\n    \"the rows consist of value 0 or 1 indicating that whether the banner has been clicked (0) or not clicked (1). \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Initialize the variables\\n\",\n    \"\\n\",\n    \"Now, let's initialize some of the important variables:\\n\",\n    \"\\n\",\n    \"Set the number of iterations:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_iterations = 100000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of banners:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_banner = 5\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize count for storing the number of times, the banner was clicked:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"count = np.zeros(num_banner)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize sum_rewards for storing the sum of rewards obtained from each banner: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sum_rewards = np.zeros(num_banner)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize Q for storing the mean reward of each banner:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"Q = np.zeros(num_banner)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the list for storing the selected banners:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"banner_selected = []\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Define the epsilon-greedy method\\n\",\n    \"\\n\",\n    \"Now, let's define the epsilon-greedy method. We generate a random value from a uniform\\n\",\n    \"distribution. If the random value is less than epsilon, then we select the random banner else\\n\",\n    \"we select the best banner which has a maximum average reward:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def epsilon_greedy_policy(epsilon):\\n\",\n    \"    \\n\",\n    \"    if np.random.uniform(0,1) < epsilon:\\n\",\n    \"        return  np.random.choice(num_banner)\\n\",\n    \"    else:\\n\",\n    \"        return np.argmax(Q)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Run the bandit test\\n\",\n    \"\\n\",\n    \"Now, we run the epsilon-greedy policy to understand which is the best advertisement\\n\",\n    \"banner:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each iteration\\n\",\n    \"for i in range(num_iterations):\\n\",\n    \"    \\n\",\n    \"    #select the banner using the epsilon-greedy policy\\n\",\n    \"    banner = epsilon_greedy_policy(0.5)\\n\",\n    \"    \\n\",\n    \"    #get the reward of the banner\\n\",\n    \"    reward = df.values[i, banner]\\n\",\n    \"    \\n\",\n    \"    #increment the counter\\n\",\n    \"    count[banner] += 1\\n\",\n    \"    \\n\",\n    \"    #store the sum of rewards\\n\",\n    \"    sum_rewards[banner]+=reward\\n\",\n    \"    \\n\",\n    \"    #compute the average reward\\n\",\n    \"    Q[banner] = sum_rewards[banner]/count[banner]\\n\",\n    \"    \\n\",\n    \"    #store the banner to the banner selected list\\n\",\n    \"    banner_selected.append(banner)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After all the rounds, we can select the best banner as the one which has the maximum\\n\",\n    \"average reward:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"The optimal banner is banner 4\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print( 'The optimal banner is banner {}'.format(np.argmax(Q)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also plot and see which banner is selected most of the times:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"image/png\": \"iVBORw0KGgoAAAANSUhEUgAAAZQAAAEJCAYAAACzPdE9AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3df1DTd57H8WfCDxWjmATFlWIrKnPiykBNt2BbRMzVndLZc9TaaWu72rq2g6vTuuP4o3uyO3t69CiFcotnf1i6ve72x1nr7c3tXO9YRpkr6zYWwva0rVDrdTlBJIliEIuQ3B+2aV1BUb8mAq/HX8kn32++7893El58vp9vvl9TMBgMIiIico3MkS5ARESGBgWKiIgYQoEiIiKGUKCIiIghFCgiImIIBYqIiBgiOtIFRNqxY8ciXYKIyKAyadKkPts1QhEREUOEZYTS3d1NYWEhPT099Pb2kpWVxdKlS6moqODQoUPExcUBsHr1am655RaCwSCVlZXU19czYsQICgoKSElJAWDv3r3s3r0bgEWLFpGbmwvAkSNHqKiooLu7m8zMTFasWIHJZApH90REhDAFSkxMDIWFhYwcOZKenh62bNlCRkYGAA8//DBZWVkXLF9fX09rayvl5eU0Njby8ssvs23bNvx+P7t27aKoqAiAjRs34nA4sFgsvPTSSzz++ONMnz6dv//7v8ftdpOZmRmO7omICGE65GUymRg5ciQAvb299Pb2XnL0cODAAXJycjCZTKSmptLZ2YnP58PtdpOeno7FYsFisZCeno7b7cbn89HV1UVqaiomk4mcnBxcLlc4uiYiIl8J26R8IBBgw4YNtLa2smDBAqZPn85//ud/8sYbb7Br1y6++93v8tBDDxETE4PX6yUhISG0rt1ux+v14vV6sdvtoXabzdZn+9fL96WqqoqqqioAioqKLtiOiIhcvbAFitlspri4mM7OTp599lm++OILHnzwQcaNG0dPTw8vvPAC//qv/8qSJUvo63qV/Y1oTCZTn8v3x+l04nQ6Q8/b29uvvDMiIsPYDXOW1+jRo0lLS8PtdmO1WjGZTMTExDBv3jyampqA8yOMb/+h93g8WK1WbDYbHo8n1O71erFardjt9gvaPR4PNpstfJ0SEZHwBEpHRwednZ3A+TO+PvroI5KSkvD5fAAEg0FcLhfJyckAOBwOampqCAaDHD58mLi4OKxWKxkZGTQ0NOD3+/H7/TQ0NJCRkYHVamXUqFEcPnyYYDBITU0NDocjHF0TEZGvhOWQl8/no6KigkAgQDAYJDs7m9mzZ/Pzn/+cjo4OAG6++WZWrVoFQGZmJnV1daxdu5bY2FgKCgoAsFgsLF68mE2bNgGwZMkSLBYLACtXrmT79u10d3eTkZGhM7xERMLMNNxvsKVfyovItTB/XBLpEq6LwIyf9PvaDTOHIiIiQ5MCRUREDKFAERERQyhQRETEEAoUERExhAJFREQMoUARERFDKFBERMQQChQRETGEAkVERAyhQBEREUMoUERExBAKFBERMYQCRUREDKFAERERQ4TtnvIig917v22JdAnXxYIffCfSJcgQoRGKiIgYQoEiIiKGUKCIiIghFCgiImIIBYqIiBgiLGd5dXd3U1hYSE9PD729vWRlZbF06VLa2tooKyvD7/czZcoU1qxZQ3R0NOfOneOXv/wlR44cYcyYMTz55JNMmDABgHfffZfq6mrMZjMrVqwgIyMDALfbTWVlJYFAgPnz57Nw4cJwdE1ERL4SlhFKTEwMhYWFFBcX8w//8A+43W4OHz7M66+/Tn5+PuXl5YwePZrq6moAqqurGT16NP/4j/9Ifn4+v/71rwFobm6mtraW5557jqeffpqdO3cSCAQIBALs3LmTzZs3U1payvvvv09zc3M4uiYiIl8JS6CYTCZGjhwJQG9vL729vZhMJg4ePEhWVhYAubm5uFwuAA4cOEBubi4AWVlZ/M///A/BYBCXy8WcOXOIiYlhwoQJTJw4kaamJpqampg4cSKJiYlER0czZ86c0HuJiEh4hO2HjYFAgA0bNtDa2sqCBQtITEwkLi6OqKgoAGw2G16vFwCv14vdbgcgKiqKuLg4Tp8+jdfrZfr06aH3/PY6Xy//9ePGxsY+66iqqqKqqgqAoqIiEhISjO+sDFFD84eN+g5cG2+kC7hOruZzEbZAMZvNFBcX09nZybPPPsv//d//9btsMBi8qM1kMvXZfqnl++J0OnE6naHn7e3tlytdZEjTd+DaDNUzmy71uZg0aVKf7WHfF6NHjyYtLY3GxkbOnDlDb28vcH5UYrPZgPMjDI/HA5w/RHbmzBksFssF7d9e5y/bPR4PVqs1jL0SEZGwBEpHRwednZ3A+TO+PvroI5KSkpg5cyb79+8HYO/evTgcDgBmz57N3r17Adi/fz8zZ87EZDLhcDiora3l3LlztLW10dLSwrRp05g6dSotLS20tbXR09NDbW1t6L1ERCQ8wnLIy+fzUVFRQSAQIBgMkp2dzezZs7npppsoKyvjzTffZMqUKeTl5QGQl5fHL3/5S9asWYPFYuHJJ58EIDk5mezsbNatW4fZbOaxxx7DbD6fiY8++ihbt24lEAgwb948kpOTw9E1ERH5iinY38TEMHHs2LFIlyCDhK42LH0xf1wS6RKui8CMn/T72g0zhyIiIkOTAkVERAyhQBEREUMoUERExBAKFBERMYQCRUREDKFAERERQyhQRETEEAoUERExhAJFREQMoUARERFDKFBERMQQChQRETGEAkVERAyhQBEREUMoUERExBAKFBERMYQCRUREDKFAERERQyhQRETEEAoUERExRHQ4NtLe3k5FRQUnT57EZDLhdDq55557ePvtt/n973/P2LFjAXjggQe49dZbAXj33Xeprq7GbDazYsUKMjIyAHC73VRWVhIIBJg/fz4LFy4EoK2tjbKyMvx+P1OmTGHNmjVER4eleyIiQpgCJSoqiocffpiUlBS6urrYuHEj6enpAOTn5/ODH/zgguWbm5upra3lueeew+fz8Ytf/ILnn38egJ07d/LTn/4Uu93Opk2bcDgc3HTTTbz++uvk5+dzxx138OKLL1JdXc3dd98dju6JiAhhOuRltVpJSUkBYNSoUSQlJeH1evtd3uVyMWfOHGJiYpgwYQITJ06kqamJpqYmJk6cSGJiItHR0cyZMweXy0UwGOTgwYNkZWUBkJubi8vlCkfXRETkK2E/JtTW1sbnn3/OtGnT+OSTT3jvvfeoqakhJSWFRx55BIvFgtfrZfr06aF1bDZbKIDsdnuo3W6309jYyOnTp4mLiyMqKuqi5f9SVVUVVVVVABQVFZGQkHC9uipDTkukC7gu9B24Nv3/azy4Xc3nIqyBcvbsWUpKSli+fDlxcXHcfffdLFmyBIC33nqL1157jYKCAoLBYJ/r99VuMpmuqAan04nT6Qw9b29vv6L1RYYafQeuzVA9s+lSn4tJkyb12R62fdHT00NJSQl33XUXt99+OwDjxo3DbDZjNpuZP38+n332GXB+5OHxeELrer1ebDbbRe0ejwer1cqYMWM4c+YMvb29FywvIiLhE5ZACQaD7Nixg6SkJO69995Qu8/nCz3+4IMPSE5OBsDhcFBbW8u5c+doa2ujpaWFadOmMXXqVFpaWmhra6Onp4fa2locDgcmk4mZM2eyf/9+APbu3YvD4QhH10RE5CthOeT16aefUlNTw+TJk1m/fj1w/hTh999/n6NHj2IymRg/fjyrVq0CIDk5mezsbNatW4fZbOaxxx7DbD6ffY8++ihbt24lEAgwb968UAg99NBDlJWV8eabbzJlyhTy8vLC0TUREfmKKdjfhMUwcezYsUiXIIPEe78dmpPyC37wnUiXMKiZPy6JdAnXRWDGT/p9LeJzKCIiMrQpUERExBAKFBERMYQCRUREDKFAERERQyhQRETEEAoUERExhAJFREQMoUARERFDKFBERMQQChQRETGEAkVERAyhQBEREUMoUERExBAKFBERMYQCRUREDKFAERERQyhQRETEEAoUERExhAJFREQMoUARERFDKFBERMQQ0QNd8A9/+APZ2dkXte/fv5+srKxLrtve3k5FRQUnT57EZDLhdDq555578Pv9lJaWcuLECcaPH89TTz2FxWIhGAxSWVlJfX09I0aMoKCggJSUFAD27t3L7t27AVi0aBG5ubkAHDlyhIqKCrq7u8nMzGTFihWYTKaBdk9ERK7RgEcoO3bs6LP9hRdeuOy6UVFRPPzww5SWlrJ161bee+89mpub2bNnD7NmzaK8vJxZs2axZ88eAOrr62ltbaW8vJxVq1bx8ssvA+D3+9m1axfbtm1j27Zt7Nq1C7/fD8BLL73E448/Tnl5Oa2trbjd7oF2TUREDHDZQDl+/DjHjx8nEAjQ1tYWen78+HH+9Kc/ERsbe9mNWK3W0Ahj1KhRJCUl4fV6cblczJ07F4C5c+ficrkAOHDgADk5OZhMJlJTU+ns7MTn8+F2u0lPT8disWCxWEhPT8ftduPz+ejq6iI1NRWTyUROTk7ovUREJDwue8hr7dq1ocdr1qy54LVx48Zx3333XdEG29ra+Pzzz5k2bRqnTp3CarUC50Ono6MDAK/XS0JCQmgdu92O1+vF6/Vit9tD7Tabrc/2r5fvS1VVFVVVVQAUFRVdsB2RS2uJdAHXhb4D16bvvzSD39V8Li4bKG+99RYAhYWF/PznP7/yqr7l7NmzlJSUsHz5cuLi4vpdLhgMXtTW33yIyWTqc/n+OJ1OnE5n6Hl7e/uA1xUZivQduDZD9cymS30uJk2a1Gf7gPfFtYZJT08PJSUl3HXXXdx+++0AxMfH4/P5APD5fIwdOxY4P8L4dmc8Hg9WqxWbzYbH4wm1e71erFYrdrv9gnaPx4PNZrumekVE5MoM+CyvtrY23njjDY4ePcrZs2cveO2f/umfLrluMBhkx44dJCUlce+994baHQ4H+/btY+HChezbt4/bbrst1P4f//Ef3HHHHTQ2NhIXF4fVaiUjI4M33ngjNBHf0NDAgw8+iMViYdSoURw+fJjp06dTU1PD97///QHvBBERuXYDDpTnn3+exMREHnnkEUaMGHFFG/n000+pqalh8uTJrF+/HoAHHniAhQsXUlpaSnV1NQkJCaxbtw6AzMxM6urqWLt2LbGxsRQUFABgsVhYvHgxmzZtAmDJkiVYLBYAVq5cyfbt2+nu7iYjI4PMzMwrqlFERK6NKTjACYgf/vCHVFZWYjYPrSOGx44di3QJMki899uhOSm/4AffiXQJg5r545JIl3BdBGb8pN/XrnkOZcaMGRw9evSKixIRkeFhwIe8xo8fz9atW/ne977HuHHjLnjt/vvvN7wwEREZXAYcKF9++SWzZ8+mt7f3gjOqRERE4AoC5euJcRERkb4MOFCOHz/e72uJiYmGFCMiIoPXgAPl25dg+Utf/5peRESGrwEHyl+GxsmTJ/mXf/kXZsyYYXhRIiIy+Fz1j0rGjRvH8uXL+c1vfmNkPSIiMkhd068Ujx07xpdffmlULSIiMogN+JDXli1bLrji75dffsmf//xnlixZcl0KExGRwWXAgZKXl3fB85EjR3LzzTfzne/osg0iInIFgfL1vdtFRET6MuBA6enpYffu3dTU1ODz+bBareTk5LBo0SKiowf8NiIiMkQNOAlef/11PvvsM370ox8xfvx4Tpw4wTvvvMOZM2dYvnz5dSxRREQGgwEHyv79+ykuLmbMmDHA+csXT5kyhfXr1ytQRERk4KcNX8l920VEZPgZ8AglOzubZ555hiVLlpCQkEB7ezvvvPMOWVlZ17M+EREZJAYcKMuWLeOdd95h586d+Hw+bDYbd9xxB4sXL76e9YmIyCBx2UD55JNPOHDgAMuWLeP++++/4GZar7/+OkeOHCE1NfW6FikiIje+y86hvPvuu6SlpfX52ne/+112795teFEiIjL4XDZQjh49SkZGRp+vzZo1i88//9zwokREZPC57CGvrq4uenp6iI2Nvei13t5eurq6LruR7du3U1dXR3x8PCUlJQC8/fbb/P73v2fs2LEAPPDAA9x6663A+VFRdXU1ZrOZFStWhALN7XZTWVlJIBBg/vz5LFy4EIC2tjbKysrw+/1MmTKFNWvW6MeWIiJhdtkRSlJSEg0NDX2+1tDQQFJS0mU3kpuby+bNmy9qz8/Pp7i4mOLi4lCYNDc3U1tby3PPPcfTTz/Nzp07CQQCBAIBdu7cyebNmyktLeX999+nubkZOD+Xk5+fT3l5OaNHj6a6uvqyNYmIiLEuGyj5+fm8+OKL/PGPfyQQCAAQCAT44x//yEsvvUR+fv5lN5KWlobFYhlQQS6Xizlz5hATE8OECROYOHEiTU1NNDU1MXHiRBITE4mOjmbOnDm4XC6CwSAHDx4Mnb6cm5uLy+Ua0LZERMQ4lz0udOedd3Ly5EkqKio4d+4cY8eOpaOjg9jYWO677z7uvPPOq974e++9R01NDSkpKTzyyCNYLBa8Xi/Tp08PLWOz2fB6vQDY7fZQu91up7GxkdOnTxMXF0dUVNRFy/elqqqKqqoqAIqKikhISLjq+mW4aYl0AdeFvgPXpv+/NoPb1XwuBjTRcO+995KXl8fhw4fx+/1YLBZSU1OJi4u74g1+7e677w7dS+Wtt97itddeo6CgoN9f5PfV/u37swyU0+nE6XSGnre3t1/xe4gMJfoOXJtrukvhDexSn4tJkyb12T7gmeu4uLh+z/a6GuPGjQs9nj9/Ps888wxwfuTh8XhCr3m9Xmw2G8AF7R6PB6vVypgxYzhz5gy9vb1ERUVdsLyIiIRPxMLV5/OFHn/wwQckJycD4HA4qK2t5dy5c7S1tdHS0sK0adOYOnUqLS0ttLW10dPTQ21tLQ6HA5PJxMyZM9m/fz8Ae/fuxeFwRKRPIiLDWVjOrS0rK+PQoUOcPn2aJ554gqVLl3Lw4EGOHj2KyWRi/PjxrFq1CoDk5GSys7NZt24dZrOZxx57DLP5fO49+uijbN26lUAgwLx580Ih9NBDD1FWVsabb77JlClTLrq7pIiIXH+m4DC/jPCxY8ciXYIMEu/9dmhOyi/4gW7jfS3MH5dEuoTrIjDjJ/2+1t8cylCdTxIRkTBToIiIiCEUKCIiYggFioiIGEKBIiIihlCgiIiIIRQoIiJiCAWKiIgYQoEiIiKGUKCIiIghdJ/cPrSsXxnpEq6L7xS/HOkSRGQI0whFREQMoUARERFDKFBERMQQmkORS1r+qz9EuoTr4tUfZke6BJEhRyMUERExhAJFREQMoUARERFDKFBERMQQChQRETGEAkVERAwRltOGt2/fTl1dHfHx8ZSUlADg9/spLS3lxIkTjB8/nqeeegqLxUIwGKSyspL6+npGjBhBQUEBKSkpAOzdu5fdu3cDsGjRInJzcwE4cuQIFRUVdHd3k5mZyYoVKzCZTOHomoiIfCUsI5Tc3Fw2b958QduePXuYNWsW5eXlzJo1iz179gBQX19Pa2sr5eXlrFq1ipdfPn/9Kb/fz65du9i2bRvbtm1j165d+P1+AF566SUef/xxysvLaW1txe12h6NbIiLyLWEZoaSlpdHW1nZBm8vl4mc/+xkAc+fO5Wc/+xnLli3jwIED5OTkYDKZSE1NpbOzE5/Px8GDB0lPT8disQCQnp6O2+1m5syZdHV1kZqaCkBOTg4ul4vMzMxwdE1kWCovL490CdfF2rVrI13CoBaxX8qfOnUKq9UKgNVqpaOjAwCv10tCQkJoObvdjtfrxev1YrfbQ+02m63P9q+X709VVRVVVVUAFBUVXbCtr7VcW9duWH31dbi6un0xND8Z+lx842r2Rf9/bQa3q9kXN9ylV4LB4EVt/c2HmEymPpe/FKfTidPpDD1vb2+/sgIHseHU18vRvviG9sU3rmZfDNUzmy61LyZNmtRne8T2RXx8PD6fDwCfz8fYsWOB8yOMb3fE4/FgtVqx2Wx4PJ5Qu9frxWq1YrfbL2j3eDzYbLYw9UJERL4WsUBxOBzs27cPgH379nHbbbeF2mtqaggGgxw+fJi4uDisVisZGRk0NDTg9/vx+/00NDSQkZGB1Wpl1KhRHD58mGAwSE1NDQ6HI1LdEhEZtsJyyKusrIxDhw5x+vRpnnjiCZYuXcrChQspLS2lurqahIQE1q1bB0BmZiZ1dXWsXbuW2NhYCgoKALBYLCxevJhNmzYBsGTJktAE/cqVK9m+fTvd3d1kZGRoQl5EJALCEihPPvlkn+1btmy5qM1kMrFyZd+34M3LyyMvL++i9qlTp4Z+3yIiIpExVOeTREQkzBQoIiJiCAWKiIgYQoEiIiKGUKCIiIghFCgiImIIBYqIiBhCgSIiIoZQoIiIiCEUKCIiYggFioiIGEKBIiIihlCgiIiIIRQoIiJiCAWKiIgYQoEiIiKGUKCIiIghFCgiImIIBYqIiBhCgSIiIoZQoIiIiCGiI13A6tWrGTlyJGazmaioKIqKivD7/ZSWlnLixAnGjx/PU089hcViIRgMUllZSX19PSNGjKCgoICUlBQA9u7dy+7duwFYtGgRubm5EeyViMjwE/FAASgsLGTs2LGh53v27GHWrFksXLiQPXv2sGfPHpYtW0Z9fT2tra2Ul5fT2NjIyy+/zLZt2/D7/ezatYuioiIANm7ciMPhwGKxRKpLIiLDzg15yMvlcjF37lwA5s6di8vlAuDAgQPk5ORgMplITU2ls7MTn8+H2+0mPT0di8WCxWIhPT0dt9sdyS6IiAw7N8QIZevWrQD89V//NU6nk1OnTmG1WgGwWq10dHQA4PV6SUhICK1nt9vxer14vV7sdnuo3Waz4fV6w9gDERGJeKD84he/wGazcerUKf7u7/6OSZMm9btsMBi8qM1kMvW5bH/tVVVVVFVVAVBUVHRBQH2tZSCFD0J99XW4urp9MTQ/GfpcfONq9sVQ/df1avZFxAPFZrMBEB8fz2233UZTUxPx8fH4fD6sVis+ny80v2K322lvbw+t6/F4sFqt2Gw2Dh06FGr3er2kpaX1uT2n04nT6Qw9//b7DXXDqa+Xo33xDe2Lb1zNvrgh5w0McKl90d8//hHdF2fPnqWrqyv0+E9/+hOTJ0/G4XCwb98+APbt28dtt90GgMPhoKamhmAwyOHDh4mLi8NqtZKRkUFDQwN+vx+/309DQwMZGRkR65eIyHAU0RHKqVOnePbZZwHo7e3lzjvvJCMjg6lTp1JaWkp1dTUJCQmsW7cOgMzMTOrq6li7di2xsbEUFBQAYLFYWLx4MZs2bQJgyZIlOsNLRCTMIhooiYmJFBcXX9Q+ZswYtmzZclG7yWRi5cqVfb5XXl4eeXl5htcoIiIDM1QP/4mISJgpUERExBAKFBERMYQCRUREDKFAERERQyhQRETEEAoUERExhAJFREQMoUARERFDKFBERMQQChQRETGEAkVERAyhQBEREUMoUERExBAKFBERMYQCRUREDKFAERERQyhQRETEEAoUERExhAJFREQMoUARERFDKFBERMQQ0ZEuwEhut5vKykoCgQDz589n4cKFkS5JRGTYGDIjlEAgwM6dO9m8eTOlpaW8//77NDc3R7osEZFhY8gESlNTExMnTiQxMZHo6GjmzJmDy+WKdFkiIsOGKRgMBiNdhBH279+P2+3miSeeAKCmpobGxkYee+yxC5arqqqiqqoKgKKiorDXKSIyVA2ZEUpfuWgymS5qczqdFBUV3TBhsnHjxkiXcMPQvviG9sU3tC++caPviyETKHa7HY/HE3ru8XiwWq0RrEhEZHgZMoEydepUWlpaaGtro6enh9raWhwOR6TLEhEZNobMacNRUVE8+uijbN26lUAgwLx580hOTo50WZfldDojXcINQ/viG9oX39C++MaNvi+GzKS8iIhE1pA55CUiIpGlQBEREUMMmTmUwUiXijlv+/bt1NXVER8fT0lJSaTLiaj29nYqKio4efIkJpMJp9PJPffcE+myIqK7u5vCwkJ6enro7e0lKyuLpUuXRrqsiAkEAmzcuBGbzXbDnj6sQImQry8V89Of/hS73c6mTZtwOBzcdNNNkS4t7HJzc/n+979PRUVFpEuJuKioKB5++GFSUlLo6upi48aNpKenD8vPRUxMDIWFhYwcOZKenh62bNlCRkYGqampkS4tIn73u9+RlJREV1dXpEvplw55RYguFfONtLQ0LBZLpMu4IVitVlJSUgAYNWoUSUlJeL3eCFcVGSaTiZEjRwLQ29tLb29vnz9WHg48Hg91dXXMnz8/0qVckkYoEeL1erHb7aHndrudxsbGCFYkN5q2tjY+//xzpk2bFulSIiYQCLBhwwZaW1tZsGAB06dPj3RJEfHqq6+ybNmyG3p0AhqhRMxALxUjw9PZs2cpKSlh+fLlxMXFRbqciDGbzRQXF7Njxw4+++wzvvjii0iXFHYffvgh8fHxoZHrjUwjlAjRpWKkPz09PZSUlHDXXXdx++23R7qcG8Lo0aNJS0vD7XYzefLkSJcTVp9++ikHDhygvr6e7u5uurq6KC8vZ+3atZEu7SIKlAj59qVibDYbtbW1N+QHRMIrGAyyY8cOkpKSuPfeeyNdTkR1dHQQFRXF6NGj6e7u5qOPPuJv/uZvIl1W2D344IM8+OCDABw8eJB/+7d/u2H/VihQImSwXirmeigrK+PQoUOcPn2aJ554gqVLl5KXlxfpsiLi008/paamhsmTJ7N+/XoAHnjgAW699dYIVxZ+Pp+PiooKAoEAwWCQ7OxsZs+eHemy5BJ06RURETGEJuVFRMQQChQRETGEAkVERAyhQBEREUMoUERExBAKFBERMYR+hyJigNWrV3Py5EnMZjPR0dGkpqbyox/9iISEhEiXJhI2GqGIGGTDhg388z//My+88ALx8fG88sorkS7pAr29vZEuQYY4jVBEDBYbG0tWVha/+tWvAKirq+PNN9/k+PHjxMXFMW/evNCNotra2vjxj39MQUEBb731Ft3d3eTn57No0SIA3n77bZqbm4mNjeWDDz4gISGB1atXM3XqVOD8VatfeeUVPv74Y1GSVAYAAAKOSURBVEaOHEl+fn7ohlxvv/02f/7zn4mJieHDDz/kkUceueEvfy6Dm0YoIgb78ssvqa2tDV1qfcSIEfz4xz+msrKSjRs38l//9V988MEHF6zzySef8Pzzz/O3f/u37Nq1i+bm5tBrH374IXPmzOHVV1/F4XCERj6BQIBnnnmGW265hRdeeIEtW7bwu9/9DrfbHVr3wIEDZGVlUVlZyV133RWG3stwphGKiEGKi4uJiori7NmzxMfH8/TTTwMwc+bM0DI333wzd9xxB4cOHeJ73/teqP2+++4jNjaWW265hZtvvpn//d//Dd2l8a/+6q9C1/LKycnh3//93wH47LPP6OjoYMmSJQAkJiYyf/58amtrycjIACA1NTW0ndjY2Ou8B2S4U6CIGGT9+vWkp6cTCARwuVwUFhZSWlrKiRMn+M1vfsMXX3xBT08PPT09ZGVlXbDuuHHjQo9HjBjB2bNnQ8/j4+NDj2NjYzl37hy9vb2cOHECn8/H8uXLQ68HAgFmzJgRev7tm7iJXG8KFBGDmc1mbr/9dl588UU++eQTfv3rX7NgwQI2bdpEbGwsr776Kh0dHde8nYSEBCZMmEB5ebkBVYtcO82hiBgsGAzicrno7OwkKSmJrq4uLBYLsbGxNDU18d///d+GbGfatGmMGjWKPXv20N3dTSAQ4IsvvqCpqcmQ9xe5UhqhiBjkmWeewWw2YzKZGD9+PKtXryY5OZmVK1fy2muv8corr5CWlkZ2djadnZ3XvD2z2cyGDRt47bXXWL16NT09PUyaNIn777/fgN6IXDndD0VERAyhQ14iImIIBYqIiBhCgSIiIoZQoIiIiCEUKCIiYggFioiIGEKBIiIihlCgiIiIIf4fY7oCVy/J+GAAAAAASUVORK5CYII=\\n\",\n      \"text/plain\": [\n       \"<Figure size 432x288 with 1 Axes>\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"ax = sns.countplot(banner_selected)\\n\",\n    \"ax.set(xlabel='Banner', ylabel='Count')\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Thus, we learned how to find the best advertisement banner by framing our problem as a multi-armed bandit problem\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "06. Case Study: The MAB Problem/README.md",
    "content": "# 6. Case Study: The MAB Problem\n* 6.1. The MAB Problem\n* 6.2. Creating Bandit in the Gym\n* 6.3. Epsilon-Greedy\n* 6.4. Implementing Epsilon-Greedy\n* 6.5. Softmax Exploration\n* 6.6. Implementing Softmax Exploration\n* 6.7. Upper Confidence Bound\n* 6.8. Implementing UCB\n* 6.9. Thompson Sampling\n* 6.10. Implementing Thompson Sampling\n* 6.11. Applications of MAB\n* 6.12. Finding the Best Advertisement Banner using Bandits\n* 6.13. Contextual Bandits\n"
  },
  {
    "path": "07. Deep learning foundations/.ipynb_checkpoints/7.05 Building Neural Network from scratch-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Building Neural Network from Scratch\\n\",\n    \"\\n\",\n    \"Putting all the concepts we have learned so far, we will see how to build a neural network\\n\",\n    \"from scratch. We will learn how the neural network learns to perform the XOR gate\\n\",\n    \"operation. The XOR gate returns 1 only when exactly only one of its inputs is 1 else it returns 0 as shown in\\n\",\n    \"the following figure:\\n\",\n    \"\\n\",\n    \"![image](images/1.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"To perform the XOR gate operation, we build a simple two-layer neural network as shown\\n\",\n    \"in the following figure. As you can observe, we have an input layer with two nodes, a\\n\",\n    \"hidden layer with five nodes and an output layers which consist of 1 node:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![image](images/2.png)\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, import the libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"%matplotlib inline\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Prepare the data as shown in the above XOR table:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"x = np.array([ [0, 1], [1, 0], [1, 1],[0, 0] ])\\n\",\n    \"y = np.array([ [1], [1], [0], [0]])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of nodes in each layer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"num_input = 2\\n\",\n    \"num_hidden = 5\\n\",\n    \"num_output = 1\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize weights and bias randomly. First, we initialize, input to hidden layer weights:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"Wxh = np.random.randn(num_input,num_hidden)\\n\",\n    \"bh = np.zeros((1,num_hidden))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now initialize, hidden to output layer weights:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"Why = np.random.randn (num_hidden,num_output)\\n\",\n    \"by = np.zeros((1,num_output))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the sigmoid activation function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def sigmoid(z):\\n\",\n    \"    return 1 / (1+np.exp(-z))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the derivative of the sigmoid function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def sigmoid_derivative(z):\\n\",\n    \"    return np.exp(-z)/((1+np.exp(-z))**2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the forward propagation:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def forward_prop(x,Wxh,Why):\\n\",\n    \"    z1 = np.dot(x,Wxh) + bh\\n\",\n    \"    a1 = sigmoid(z1)\\n\",\n    \"    z2 = np.dot(a1,Why) + by\\n\",\n    \"    y_hat = sigmoid(z2)\\n\",\n    \"    \\n\",\n    \"    return z1,a1,z2,y_hat\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the backward propagation:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def backword_prop(y_hat, z1, a1, z2):\\n\",\n    \"    delta2 = np.multiply(-(y-y_hat),sigmoid_derivative(z2))\\n\",\n    \"    dJ_dWhy = np.dot(a1.T, delta2)\\n\",\n    \"    delta1 = np.dot(delta2,Why.T)*sigmoid_derivative(z1)\\n\",\n    \"    dJ_dWxh = np.dot(x.T, delta1) \\n\",\n    \"\\n\",\n    \"    return dJ_dWxh, dJ_dWhy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the cost function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def cost_function(y, y_hat):\\n\",\n    \"    J = 0.5*sum((y-y_hat)**2)\\n\",\n    \"    \\n\",\n    \"    return J\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the learning rate and number of training iterations:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"alpha = 0.01\\n\",\n    \"num_iterations = 5000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now let's start training the network:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"cost = []\\n\",\n    \"for i in range(num_iterations):\\n\",\n    \"    \\n\",\n    \"    #perform forward propagation and predict output\\n\",\n    \"    z1,a1,z2,y_hat = forward_prop(x,Wxh,Why)\\n\",\n    \"    \\n\",\n    \"    #perform backward propagation and calculate gradients\\n\",\n    \"    dJ_dWxh, dJ_dWhy = backword_prop(y_hat, z1, a1, z2)\\n\",\n    \"        \\n\",\n    \"    #update the weights\\n\",\n    \"    Wxh = Wxh -alpha * dJ_dWxh\\n\",\n    \"    Why = Why -alpha * dJ_dWhy\\n\",\n    \"    \\n\",\n    \"    #compute cost\\n\",\n    \"    c = cost_function(y, y_hat)\\n\",\n    \"    \\n\",\n    \"    #store the cost\\n\",\n    \"    cost.append(c)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Plot the cost function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Text(0,0.5,'Cost')\"\n      ]\n     },\n     \"execution_count\": 13,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    },\n    {\n     \"data\": {\n      \"image/png\": \"iVBORw0KGgoAAAANSUhEUgAAAYsAAAEWCAYAAACXGLsWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3XuYXXV97/H3Z+6TmczkMpNJSLgkEEAQpDIGENB4waKnQq0exHrDVmNPS61aPQ85WmzRnnr09IitaX0QqaAV8Hi8BI1cFEaUAiZQbgkEknBJAiH3y0ySuX7PH2vNZGczt8xkzZ6Z/Xk9z372Wr912d/vEPZ3/9bltxQRmJmZDaak0AGYmdn452JhZmZDcrEwM7MhuViYmdmQXCzMzGxILhZmZjYkFwuzSUBSq6QFhY7DJi8XC5s0JP2xpFXpF+dLkn4h6YJR7vM5SW8dZPliST3pZ/a+bhvNZw4jphZJH81ti4jaiNiQ5edacSsrdABmR4OkTwNXAX8G3AF0ABcDlwK/zfjjX4yIeRl/hllBuWdhE56keuAa4C8i4kcR0RYRnRFxW0R8Nl2nUtK1kl5MX9dKqkyXNUj6maTdknZK+o2kEknfBY4Dbkt7DP/9COP6jqQv5cwvlrQpZ/45SZ+R9JikPZJulVSVs/xSSY9I2itpvaSLJf09cCHwjTSmb6TrhqSTev8ekm6StE3S85I+L6kkXXaFpN9K+t+Sdkl6VtLbR/aXt2LiYmGTwXlAFfDjQdb5HHAucBbwGmAR8Pl02V8Dm4BGoAn4H0BExAeBF4B3pod5vpJB7JeR9IDmA2cCVwBIWgTcBHwWmAa8AXguIj4H/Aa4Mo3pyn72+c9APbAAeCPwIeAjOcvPAdYCDcBXgG9L0lHPzCYVFwubDGYC2yOia5B13g9cExFbI2Ib8HfAB9NlncAc4Pi0R/KbOLJB045JeyW9r8uOYNt/iogXI2IncBtJMQP4U+CGiLgrInoiYnNEPDXUziSVApcDSyNiX0Q8B/wjh3IFeD4ivhUR3cCNJLk3HUHMVoRcLGwy2AE0SBrsHNwxwPM588+nbQBfBdYBd0raIOmqI/z8FyNiWs7rB0ew7Zac6f1AbTp9LLD+COOApLdQzitzndvfZ0bE/nSyFrNBuFjYZHA/0A784SDrvAgcnzN/XNpG+gv8ryNiAXAJ8GlJb0nXG82wzG3AlJz52Uew7UbgxAGWDRbTdpKeUn6um4/gs81ewcXCJryI2ANcDSyT9IeSpkgql/R2Sb3nGW4GPi+pUVJDuv73ACT9gaST0uP2e4BuoCfd7mWSY/8j8QjwDkkzJM0GPnkE234b+Iikt6Qn2+dKOnWomNJDSz8A/l7SVEnHA58mzdVspFwsbFKIiH8k+VL8PLCN5Jf5lcBP0lW+BKwCHgMeBx5O2wAWAr8EWkl6Kf8SEfeky/6BpMjslvSZIwzru8CjwHPAncCtR5DP70hOSn+NpID9mkO9ha8D70mvZvqnfjb/S5JezQaSy4a/D9xwhLGbHUZ++JGZmQ3FPQszMxuSi4WZmQ0p02KR3nG6VtK6gS5HlHSZpDWSVkv6ft6yOkmbeu9SNTOzwshsbKj05qBlwEUkd8eulLQ8ItbkrLMQWAqcHxG7JM3K280XgXuzitHMzIYny4EEFwHrekfClHQLyaBua3LW+RiwLCJ2AUTE1t4Fks4muav0dqB5qA9raGiIE044YcTBtrW1UVNTM+LtJ6Jiy7nY8gXnXCxGk/NDDz20PSIah1ovy2Ixl+TyxV6bSMakyXUygKT7gFLgbyPi9nTQs38EPgAMODx0rhNOOIFVq1aNONiWlhYWL1484u0nomLLudjyBedcLEaTs6Tnh16r8EOUl5Fc474YmAfcK+kMkiKxIiI2DTa+maQlwBKApqYmWlpaRhxIa2vrqLafiIot52LLF5xzsRiLnLMsFptJxrfpNY9XDjmwCXgwIjqBZyU9TVI8zgMulPTnJGPWVEhqjYjDTpJHxHXAdQDNzc0xml8T/jUy+RVbvuCci8VY5Jzl1VArgYWS5kuqIBkJc3neOj8h6VWQDsFwMrAhIt4fEcdFxAnAZ4Cb8guFmZmNncyKRTpc9JUkTy17EvhBRKyWdI2kS9LV7gB2SFoD3AN8NiJ2ZBWTmZmNTKbnLCJiBbAir+3qnOkgGc/n04Ps4zvAd7KJ0MzMhsN3cJuZ2ZBcLMzMbEhFXyz2Huzka3c9zYbd3YUOxcxs3Cr6YhE98PVfPcPTu3qGXtnMrEgVfbGoqy6jvFTs7fBzPczMBlL0xUISM2sqXSzMzAZR9MUCoGFqhYuFmdkgXCyAmTWV7Gt3sTAzG4iLBdBQW8ke9yzMzAbkYgE01CaHoZIbys3MLJ+LBTCztoKuHtjX3lXoUMzMxiUXC5LDUAA7WjsKHImZ2fjkYgHMTIvF9tb2AkdiZjY+uViQnLMA2OFiYWbWLxcLDh2G2u7DUGZm/XKxAGbUJD0LH4YyM+ufiwVQXlpCTblPcJuZDcTFIlVfIfcszMwG4GKRmloh9yzMzAbgYpGqr3TPwsxsIC4Wqak+DGVmNqBMi4WkiyWtlbRO0lUDrHOZpDWSVkv6ftp2lqT707bHJL03yzgB6irE3oNddHT5iXlmZvnKstqxpFJgGXARsAlYKWl5RKzJWWchsBQ4PyJ2SZqVLtoPfCginpF0DPCQpDsiYndW8dZXCoAdbe3Mqa/O6mPMzCakLHsWi4B1EbEhIjqAW4BL89b5GLAsInYBRMTW9P3piHgmnX4R2Ao0ZhgrUyvSYuGT3GZmr5BZzwKYC2zMmd8EnJO3zskAku4DSoG/jYjbc1eQtAioANbnf4CkJcASgKamJlpaWkYcbHn3QUDc/R8r2d6Y5Z9l/GhtbR3V32yiKbZ8wTkXi7HIudDfimXAQmAxMA+4V9IZvYebJM0Bvgt8OCJecTIhIq4DrgNobm6OxYsXjziQrSvuBg5wzIJTWXz2vBHvZyJpaWlhNH+ziabY8gXnXCzGIucsD0NtBo7NmZ+XtuXaBCyPiM6IeBZ4mqR4IKkO+DnwuYh4IMM4geQEN3jIDzOz/mRZLFYCCyXNl1QBXA4sz1vnJyS9CiQ1kByW2pCu/2Pgpoj4YYYx9qksharyEo88a2bWj8yKRUR0AVcCdwBPAj+IiNWSrpF0SbraHcAOSWuAe4DPRsQO4DLgDcAVkh5JX2dlFSuAJBpqK32C28ysH5mes4iIFcCKvLarc6YD+HT6yl3ne8D3soytPzNrK9nmnoWZ2Sv4Du4cjbUV7lmYmfXDxSLHzJpKn+A2M+uHi0WOmbUV7GzroKcnCh2Kmdm44mKRo6G2kq6eYM+BzkKHYmY2rrhY5JhZmzxedUebD0WZmeVyscjRWFsJwHaf5DYzO4yLRY6ZabHYts89CzOzXC4WOWZNdbEwM+uPi0WOaVPKKS+Vb8wzM8vjYpFDEo21lWzd62JhZpbLxSJP49RKtu47WOgwzMzGFReLPI1Tq3zOwswsj4tFnll1lS4WZmZ5XCzyNNZWsqOtg87uVzyYz8ysaLlY5JlVl1w+69FnzcwOcbHIM2tqFYBPcpuZ5XCxyNOY3pjny2fNzA5xscjTdxe3b8wzM+vjYpGnodY9CzOzfC4WeSrKSpg+pZxtrT5nYWbWy8WiH7OmVrlnYWaWI9NiIeliSWslrZN01QDrXCZpjaTVkr6f0/5hSc+krw9nGWe+WXWVbPWNeWZmfcqy2rGkUmAZcBGwCVgpaXlErMlZZyGwFDg/InZJmpW2zwC+ADQDATyUbrsrq3hzNdZWsmFb21h8lJnZhJBlz2IRsC4iNkREB3ALcGneOh8DlvUWgYjYmrb/PnBXROxMl90FXJxhrIdpTIf8iIix+kgzs3Ets54FMBfYmDO/CTgnb52TASTdB5QCfxsRtw+w7dz8D5C0BFgC0NTUREtLy4iDbW1t7dt+78uddHT38PO7Wqit0Ij3Od7l5lwMii1fcM7FYixyzrJYDPfzFwKLgXnAvZLOGO7GEXEdcB1Ac3NzLF68eMSBtLS00Lv93kdf5Oan/pNTXtPMwqapI97neJebczEotnzBOReLscg5y8NQm4Fjc+bnpW25NgHLI6IzIp4FniYpHsPZNjO9N+b5JLeZWSLLYrESWChpvqQK4HJged46PyHpVSCpgeSw1AbgDuBtkqZLmg68LW0bE31Dfnh8KDMzIMPDUBHRJelKki/5UuCGiFgt6RpgVUQs51BRWAN0A5+NiB0Akr5IUnAAromInVnFmq9vyA/3LMzMgIzPWUTECmBFXtvVOdMBfDp95W97A3BDlvENpLayjOryUt+YZ2aW8h3c/ZCUPovbxcLMDFwsBtRUV8nLe33OwswMXCwG1FRX5WJhZpZysRjA7Loqtuw96Lu4zcxwsRjQ7PoqDnb2sPdAV6FDMTMrOBeLAcyuT57F/dLeAwWOxMys8FwsBjC7LikWW/b4vIWZmYvFAHp7Fj7JbWbmYjGgWVPTw1DuWZiZuVgMpKKshIZa32thZgYuFoOaXV/pcxZmZrhYDGp2XZUPQ5mZ4WIxqNn1vovbzAxcLAY1u66KXfs7OdjZXehQzMwKysViEE11vnzWzAxcLAY1p74a8I15ZmYuFoOYXZ88MW+LexZmVuRcLAYx2z0LMzPAxWJQtZVl1FaWuWdhZkXPxWIITXW+Mc/MLNNiIeliSWslrZN0VT/Lr5C0TdIj6eujOcu+Imm1pCcl/ZMkZRnrQObUV7tnYWZFL7NiIakUWAa8HTgNeJ+k0/pZ9daIOCt9XZ9u+3rgfOBM4NXA64A3ZhXrYGbXV/HSbhcLMytuWfYsFgHrImJDRHQAtwCXDnPbAKqACqASKAdeziTKIcydVs3L+w7S0dVTiI83MxsXyjLc91xgY878JuCcftZ7t6Q3AE8Dn4qIjRFxv6R7gJcAAd+IiCfzN5S0BFgC0NTUREtLy4iDbW1t7Xf7fS93EgE/vbOFximT6xTPQDlPVsWWLzjnYjEWOWdZLIbjNuDmiGiX9HHgRuDNkk4CXgXMS9e7S9KFEfGb3I0j4jrgOoDm5uZYvHjxiANpaWmhv+3L123nhiceZN4pr+G8E2eOeP/j0UA5T1bFli8452IxFjln+VN5M3Bszvy8tK1PROyIiPZ09nrg7HT6XcADEdEaEa3AL4DzMox1QHOnJfdabN7tZ3GbWfHKslisBBZKmi+pArgcWJ67gqQ5ObOXAL2Hml4A3iipTFI5ycntVxyGGgtzpiXjQ73oYmFmRSyzw1AR0SXpSuAOoBS4ISJWS7oGWBURy4FPSLoE6AJ2Alekm/8QeDPwOMnJ7tsj4rasYh1MZVkpjVMr2bzLxcLMilem5ywiYgWwIq/t6pzppcDSfrbrBj6eZWxHYu60ah+GMrOiNrku78nI3OkuFmZW3FwshmFe2rOIiEKHYmZWEC4Ww3DMtGo6unrY3tpR6FDMzArCxWIYfPmsmRU7F4thmDs9KRa+fNbMitWwioWk7w6nbbI6prdn4ctnzaxIDbdncXruTDqi7NkDrDvp1FeXM7WyzIehzKxoDVosJC2VtA84U9Le9LUP2Ar8dEwiHCfmTq9mk3sWZlakBi0WEfEPETEV+GpE1KWvqRExM72hrmj4xjwzK2bDPQz1M0k1AJI+IOn/SDo+w7jGnXnTq9m0c7/vtTCzojTcYvGvwH5JrwH+GlgP3JRZVOPQcTNr2Nfexa79nYUOxcxszA23WHRF8pP6UpIHES0DpmYX1vhz/IwpALywc3+BIzEzG3vDLRb7JC0FPgj8XFIJyaNOi8bxM5Ni8fyOtgJHYmY29oZbLN4LtAN/EhFbSB5k9NXMohqHju3tWexwz8LMis+wikVaIP4dqJf0B8DBiCiqcxZV5aU01VXyvA9DmVkRGu4d3JcBvwP+K3AZ8KCk92QZ2Hh0/Iwa9yzMrCgN9+FHnwNeFxFbASQ1Ar8keaJd0Th2xhTuW7e90GGYmY254Z6zKOktFKkdR7DtpHH8zCls2XuQg53dhQ7FzGxMDbdncbukO4Cb0/n3kve41GLQe0XUxp37WdhUVFcOm1mRG7RYSDoJaIqIz0r6I+CCdNH9JCe8i8pxM3ovn3WxMLPiMlTP4lpgKUBE/Aj4EYCkM9Jl78w0unGmr1j4iigzKzJDnXdoiojH8xvTthOG2rmkiyWtlbRO0lX9LL9C0jZJj6Svj+YsO07SnZKelLRG0pCfl7UZNRXUVpax0cXCzIrMUD2LaYMsqx5sw/SZF8uAi4BNwEpJyyNiTd6qt0bElf3s4ibg7yPiLkm1QM8QsWZOEsfNmMJzvovbzIrMUD2LVZI+lt+Y9gAeGmLbRcC6iNgQER3ALSRjSw1J0mlAWUTcBRARrRExLn7Oz2+s4dntLhZmVlyG6ll8EvixpPdzqDg0AxXAu4bYdi6wMWd+E3BOP+u9W9IbgKeBT0XERuBkYLekHwHzSe7puCoiDrtmVdISYAlAU1MTLS0tQ4Q0sNbW1mFtX9rWwQs7Ornr7nsoL9GIP288GG7Ok0Wx5QvOuViMRc6DFouIeBl4vaQ3Aa9Om38eEXcfpc+/Dbg5ItolfRy4EXhzGteFwO8BLwC3AlcA386L7zrgOoDm5uZYvHjxiANpaWlhONvvmbaZ5esf4fjTmzl5gl8RNdycJ4tiyxecc7EYi5yHOzbUPRHxz+lruIViM3Bszvy8tC13vzsioj2dvZ5Dz/XeBDySHsLqAn4CvHaYn5upBQ21AGzY1lrgSMzMxk6Wd2GvBBZKmi+pArgcWJ67gqQ5ObOXAE/mbDstHVYEkt5G/onxgljQWAPA+m0+b2FmxWO4d3AfsYjoknQlcAdQCtwQEaslXQOsiojlwCckXQJ0ATtJDjUREd2SPgP8SpJIzpd8K6tYj0RNZRlz6qtYv9U9CzMrHpkVC4CIWEHesCARcXXO9FLSm/762fYu4Mws4xupBY01rPcVUWZWRIpuMMCj4cTGWjZsbSV50qyZ2eTnYjECCxpq2NfexbbW9qFXNjObBFwsRuDEWckVUeu3+lCUmRUHF4sROLExLRa+fNbMioSLxQjMrqtiSkUp63xFlJkVCReLESgpESc3TeWpLXsLHYqZ2ZhwsRihU2dPZe2Wfb4iysyKgovFCJ06eyq79neybZ+viDKzyc/FYoROmV0HwJNb9hU4EjOz7LlYjNCps5MRZ9f6vIWZFQEXixGaXlNBU10lT7lnYWZFwMViFE6ZXcdaFwszKwIuFqNw6uypPLO1la7ugj8e3MwsUy4Wo3BK01Q6unp4boeH/TCzyc3FYhReNSe5Imr1iz7JbWaTm4vFKCxsqqWyrITHN+0pdChmZplysRiF8tISTjumjsc2u1iY2eTmYjFKZ86tZ/XmPXT3eNgPM5u8XCxG6Yx502jr6ObZ7R6B1swmLxeLUTpzXj0Aj/m8hZlNYi4Wo3RiYy3V5aUuFmY2qWVaLCRdLGmtpHWSrupn+RWStkl6JH19NG95naRNkr6RZZyjUVoiXj23jsd9ktvMJrHMioWkUmAZ8HbgNOB9kk7rZ9VbI+Ks9HV93rIvAvdmFePRcsbcaax+cQ+dvpPbzCapLHsWi4B1EbEhIjqAW4BLh7uxpLOBJuDOjOI7as4+fjoHO3t8c56ZTVplGe57LrAxZ34TcE4/671b0huAp4FPRcRGSSXAPwIfAN460AdIWgIsAWhqaqKlpWXEwba2to54+46DSY/ill+uZPf88hHHMNZGk/NEVGz5gnMuFmORc5bFYjhuA26OiHZJHwduBN4M/DmwIiI2SRpw44i4DrgOoLm5ORYvXjziQFpaWhjN9l977B52lU1l8eLmEe9jrI0254mm2PIF51wsxiLnLIvFZuDYnPl5aVufiNiRM3s98JV0+jzgQkl/DtQCFZJaI+IVJ8nHi+YTpvPrtduICAYrcGZmE1GW5yxWAgslzZdUAVwOLM9dQdKcnNlLgCcBIuL9EXFcRJwAfAa4aTwXCoDXnTCDHW0dPLvdI9Ca2eSTWc8iIrokXQncAZQCN0TEaknXAKsiYjnwCUmXAF3ATuCKrOLJ2utOmA7Ayud2sqCxtsDRmJkdXZmes4iIFcCKvLarc6aXAkuH2Md3gO9kEN5RdWJjLdOnlPPgszt57+uOK3Q4ZmZHle/gPkok8foTG7hv3XYiPKigmU0uLhZH0YULG3h5bzvPbPWggmY2ubhYHEUXLGwA4DfPbC9wJGZmR5eLxVE0b/oUFjTU8JtnthU6FDOzo8rF4ii7YGEDD27YSXtXd6FDMTM7alwsjrILFzZyoLObVc/tKnQoZmZHjYvFUXb+STOpLCvhrjUvFzoUM7OjxsXiKJtSUcaFCxu5c/UWX0JrZpOGi0UG3nZ6Ey/uOcgTmz1kuZlNDi4WGXjrq5ooEdyxekuhQzEzOypcLDIwo6aCRfNncLsPRZnZJOFikZH/cuYxrNvaypqXfCjKzCY+F4uMvPPMOZSXih89vHnolc3MxjkXi4xMm1LBW05t4qePbKaru6fQ4ZiZjYqLRYb+6LVz2d7a4bGizGzCc7HI0OJTZjGjpoLv/+6FQodiZjYqLhYZqigr4X2LjuVXT77Mxp37Cx2OmdmIuVhk7APnHo8kvvvA84UOxcxsxFwsMjanvpqLXz2bW373Avs7ugodjpnZiLhYjIE/OX8+ew928e8P+NyFmU1MmRYLSRdLWitpnaSr+ll+haRtkh5JXx9N28+SdL+k1ZIek/TeLOPM2tnHT+fChQ1889fr3bswswkps2IhqRRYBrwdOA14n6TT+ln11og4K31dn7btBz4UEacDFwPXSpqWVaxj4ZNvXciOtg6+53MXZjYBZdmzWASsi4gNEdEB3AJcOpwNI+LpiHgmnX4R2Ao0ZhbpGDj7+BlcuLCBf2lZz+79HYUOx8zsiCirge4kvQe4OCJ6Dy19EDgnIq7MWecK4B+AbcDTwKciYmPefhYBNwKnR0RP3rIlwBKApqams2+55ZYRx9va2kptbe2Itx+Ojft6+MJ/HGDxsWV86LTKTD9rOMYi5/Gk2PIF51wsRpPzm970pocionnIFSMikxfwHuD6nPkPAt/IW2cmUJlOfxy4O2/5HGAtcO5Qn3f22WfHaNxzzz2j2n64vvDTJ2L+VT+LJzbvHpPPG8xY5TxeFFu+Ec65WIwmZ2BVDOM7PcvDUJuBY3Pm56VtuYVqR0S0p7PXA2f3LpNUB/wc+FxEPJBhnGPqU289mRk1FXz2/z5GR5fHjDKziSHLYrESWChpvqQK4HJgee4KkubkzF4CPJm2VwA/Bm6KiB9mGOOYq59Szj/80ZmseWkv1/7y6UKHY2Y2LJkVi4joAq4E7iApAj+IiNWSrpF0SbraJ9LLYx8FPgFckbZfBrwBuCLnstqzsop1rF10WhOXNc/jm79ez289yKCZTQBlWe48IlYAK/Lars6ZXgos7We77wHfyzK2Qrv6nafzyMbd/MX3H2b5ledz/MyaQodkZjYg38FdILWVZXzrQ81I8Kc3rmJnmy+nNbPxy8WigI6fWcM3P3A2G3fu5wPXP8ie/Z2FDsnMrF8uFgV27oKZXPehZtZtbeWPr3+Al/ceLHRIZmav4GIxDrzx5Ea+9eFmntvexruW3ceTL+0tdEhmZodxsRgn3nhyIz/4s/PoCfjDZffx3Qee770x0cys4FwsxpHTj6ln+V+ez7kLZvI3P3mCj3xnJc/vaCt0WGZmLhbjzaypVfzbFa/jC+88jZXP7uSir93LV+94yie/zaygXCzGoZIS8ZHz53P3ZxbzjlfPZtk96zn/f93NV25/iq0+AW5mBeBiMY411VVx7eW/xy/+6kLeeEoj//rr9Zz35btZctMq7nlqK53dHlvKzMZGpndw29Hxqjl1LPvj1/Ls9jZu+d0L/PChTdy55mXqq8t5y6tm8funz+b8kxqorfR/TjPLhr9dJpD5DTUsfcer+Ou3nULL2q3cvnoLv3pyKz96eDOlJeKMufWcd+JMzl0wk9fMq2falIpCh2xmk4SLxQRUUVbC206fzdtOn01ndw8rn93Jf6zfwf0bdvCtezfwry3rAThuxhTOmFfPGXPredWcOhbOqmVOfRWSCpyBmU00LhYTXHlpCa8/qYHXn9QAQFt7F//5wm4e37yHxzfv5tGNu/n5Yy/1rV9TUcqJs2o5aVYtJa0d7J/5EsdOn8KxM6qpry53ITGzfrlYTDI1lWVcsLCBCxY29LXtbOvg6Zf3sW5ra9/rP9btYMveTn749MN9602tLGPejCkcO72aY2dMYd70aubUV9FUV8Xs+ioaayspK/U1EWbFyMWiCMyoqeDcBcm5jFy/+OU9HHfaa9m48wCbdu1n4879bNx1gA3b27j3mW0c7Dz8aqsSQUNt5WEFpKmuitl1VTRMrWRmTQWNUyuZUVNBuYuK2aTiYlHEqsvE6cfUc/ox9a9YFhHsaOtgy56DvLz3IFv2HuTlPQd5aU8y/dyONh7YsIO9B7v63fe0KeXMrKmgobYyfVUw87DpCuqrK5g+pZz66nL3WMzGORcL65ekvi/6V899ZTHptb+ji61729nR1s62fR1sb21nR2v63tbO9n0dPPnSXra3tg9YWACmVpUxfUpaPNL36VMqqK8uT6ZrKpg2pYJp1eVMm1JOXVU5U6vKXGTMxoiLhY3KlIoyTmgo44SGoZ/0197Vzc62Drbv62BHWzt7DnSyq62DXfs7k+n96fT+Dp7b3sau/R3sG6TAAFSXl1JXXcbUtHj0FpGpVeXUVZVRV907X8bzW7uY8uzOvvnayjKmVJRRUeaCYzYUFwsbM5Vlpcypr2ZOffWwt+nq7mHPgU52H+hk9/4OdrUl0/sOdrLvYBf7Dnay90AX+9qT9937O9i4cz97D3ay92AXHV2Hn3e59uH7X/EZFaUlTKkspaaijJrKUqak78l8GVMqSvsKS/7yKZXJsuryUqrKS6muKO2bLi3xlWU2ebhY2LhWVlrCzNpKZtZWjmj7g53dfUWl5b4HWXj6mew72MXeA53s7+imrb2Lto5u9nd00drexf72bto6umhr72JHawdtHUlba3sX7V1HNrxKRWkJVeUlVFekhST2RFh1AAALlUlEQVQtIlXlJVSnhaWq/NCy6nRZbtGpLCulsqyEirKSvvdkOmmvzJmvKCtxgbLMuFjYpNb7Zdw4tZIXppVy4cLGEe+rq7uH/Z1pgWk/VGDa2rs52NnNgc7k/WBnNwc6eg6f7+zmQEfy3t7Zw/bWjrz1k2U9o3yESVmJcgpKCT2dHdQ//GsqSkuoLC9J30uT97wCVFZSQnmZqCg9NF1eUkJ5qSgrTbYtKxXlpUlbeWkJZel0Rc50srykL5ayElFeVtK3r9IS+X6eCSjTYiHpYuDrQClwfUR8OW/5FcBXgc1p0zci4vp02YeBz6ftX4qIG7OM1WwoZaUl1JWWUFdVnsn+I4LO7nhFkeno6ul7tfe9uvvmc987upNi1NHdQ3tnDy+8+CLTZ9Yetu3eA53p+t19bZ3dPXR1Bx3dyXTWz92qKD1UhHoLSFlJSfqezJeWiLJSUVpS0td2+HvaXnp4+9Yt7fxq9xOH1ivNWz9/P6X9tZf0xVBaklzwUapkvkSH2kty2gZq793PobZkZOn8/ZWIcV1EMysWkkqBZcBFwCZgpaTlEbEmb9VbI+LKvG1nAF8AmoEAHkq33ZVVvGaFJomKsuTXeH310SlILS07Wbz47CPerrsn6Ow+VEQ6u3vo7Ak608LSmbZ19fTQ0RV09eS1p4Wnb9vDlvXQ0R109bb3BD09QVdP0N33nmzbfVh7sv6BznS++1B77nr7D3bz2K6X6Oruydt+/D95skT09bxKc4rIKwvO4YWnsfQgixdnG1uWPYtFwLqI2AAg6RbgUiC/WPTn94G7ImJnuu1dwMXAzRnFamY5ki+n5BDeRNPS0sLifr45I4Ke4PDi0n14McotLp3dPfT0QE8E3ZEUtO6e3mle0RYRdPfT3pPz3hNJIe6JvOW520WyvLf9lev2fh5902rtyPzvmmWxmAtszJnfBJzTz3rvlvQG4GngUxGxcYBt5+ZvKGkJsASgqamJlpaWEQfb2to6qu0nomLLudjyBeecNZF8iY7qi7SEUT9ZqLW1I/OcC32C+zbg5ohol/Rx4EbgzcPdOCKuA64DaG5ujv5+TQzXQL9GJrNiy7nY8gXnXCzGIucs70baDBybMz+PQyeyAYiIHRHRns5eD5w93G3NzGzsZFksVgILJc2XVAFcDizPXUHSnJzZS4An0+k7gLdJmi5pOvC2tM3MzAogs8NQEdEl6UqSL/lS4IaIWC3pGmBVRCwHPiHpEqAL2AlckW67U9IXSQoOwDW9J7vNzGzsZXrOIiJWACvy2q7OmV4KLB1g2xuAG7KMz8zMhscjqJmZ2ZBcLMzMbEguFmZmNiRF1oPAjBFJ24DnR7GLBmD7UQpnoii2nIstX3DOxWI0OR8fEUOOsDlpisVoSVoVEc2FjmMsFVvOxZYvOOdiMRY5+zCUmZkNycXCzMyG5GJxyHWFDqAAii3nYssXnHOxyDxnn7MwM7MhuWdhZmZDcrEwM7MhFX2xkHSxpLWS1km6qtDxjIakGyRtlfRETtsMSXdJeiZ9n562S9I/pXk/Jum1Odt8OF3/mfRZ6OOWpGMl3SNpjaTVkv4qbZ+0eUuqkvQ7SY+mOf9d2j5f0oNpbremoz0jqTKdX5cuPyFnX0vT9rWSfr8wGQ2PpFJJ/ynpZ+n8ZM/3OUmPS3pE0qq0rXD/riN9HGAxvkhGw10PLAAqgEeB0wod1yjyeQPwWuCJnLavAFel01cB/yudfgfwC5KHfZ0LPJi2zwA2pO/T0+nphc5tkJznAK9Np6eSPHHxtMmcdxp7bTpdDjyY5vID4PK0/ZvAf0un/xz4Zjp9Oclz70n/To8ClcD89P+F0kLnN0jenwa+D/wsnZ/s+T4HNOS1FezfdbH3LPqeEx4RHUDvc8InpIi4l2So91yXkjyBkPT9D3Pab4rEA8C09Pkifc8/j4hdQO/zz8eliHgpIh5Op/eRPBNlLpM47zT21nS2PH0FyVMmf5i25+fc+7f4IfAWSUrbb4mI9oh4FlhH8v/EuCNpHvBfSB6SRhr/pM13EAX7d13sxWJYz/qe4Joi4qV0egvQlE4PlPuE/Zukhxt+j+SX9qTOOz0k8wiwleQLYD2wOyK60lVy4+/LLV2+B5jJxMr5WuC/Az3p/Ewmd76Q/AC4U9JDkpakbQX7d13oZ3DbGIqIkDQpr5WWVAv8P+CTEbE3+SGZmIx5R0Q3cJakacCPgVMLHFJmJP0BsDUiHpK0uNDxjKELImKzpFnAXZKeyl041v+ui71nUQzP+n457Y72PsZ2a9o+UO4T7m8iqZykUPx7RPwobZ70eQNExG7gHuA8kkMPvT8Ac+Pvyy1dXg/sYOLkfD5wiaTnSA4Vvxn4OpM3XwAiYnP6vpXkB8EiCvjvutiLxZDPCZ8ElgO9V0B8GPhpTvuH0qsozgX2pN3bCfX88/RY9LeBJyPi/+QsmrR5S2pMexRIqgYuIjlXcw/wnnS1/Jx7/xbvAe6O5OzncuDy9Oqh+cBC4Hdjk8XwRcTSiJgXESeQ/D96d0S8n0maL4CkGklTe6dJ/j0+QSH/XRf6jH+hXyRXETxNcsz3c4WOZ5S53Ay8BHSSHJv8U5Jjtb8CngF+CcxI1xWwLM37caA5Zz9/QnLybx3wkULnNUTOF5Ac230MeCR9vWMy5w2cCfxnmvMTwNVp+wKSL791wP8FKtP2qnR+Xbp8Qc6+Ppf+LdYCby90bsPIfTGHroaatPmmuT2avlb3fjcV8t+1h/swM7MhFfthKDMzGwYXCzMzG5KLhZmZDcnFwszMhuRiYWZmQ3KxsAlF0sx0FM5HJG2RtDlnvmKY+/g3SacMsc5fSHr/UYr5t5LOklSiozyysaQ/kTQ7Z37I3MxGwpfO2oQl6W+B1oj433ntIvm33dPvhmNM0m+BK0nuidgeEdOOcPvSSIb3GHDfEfHI6CM1G5h7FjYpSDpJyTMt/p3kJqY5kq6TtErJMx+uzlm395d+maTdkr6s5NkQ96fj8CDpS5I+mbP+l5U8Q2KtpNen7TWS/l/6uT9MP+usQcL8MjA17QXdlO7jw+l+H5H0L2nvozeuayU9BiyS9HeSVkp6QtI30zt13wucBdza27PqzS3d9weUPA/hCUn/M20bLOfL03UflXTPUf5PZBOci4VNJqcCX4uI0yIZV+eqiGgGXgNcJOm0frapB34dEa8B7ie527U/iohFwGeB3sLzl8CWiDgN+CLJiLeDuQrYFxFnRcSHJL0aeBfw+og4i2Rgz8tz4ro3Is6MiPuBr0fE64Az0mUXR8StJHesvzfdZ0dfsMmQ3l8C3pTGdb6SAfkGy/kLwFvS9ncNkYsVGRcLm0zWR8SqnPn3SXoYeBh4FcnDb/IdiIhfpNMPAScMsO8f9bPOBSQD2xERvcMyHIm3Aq8DVikZbvyNwInpsg6SweN6vUXS70iGf3gjcPoQ+z6HZEyk7RHRSfLQoDekywbK+T7gJkkfxd8NlsdDlNtk0tY7IWkh8FfAoojYLel7JGMG5evIme5m4P8n2oexzpEScENE/M1hjclIqQeid9AfaQrwDZInAm6W9CX6z2W4Bsr5YyRF5g+AhyX9XiQPzDHzrwebtOqAfcBeHXpi2NF2H3AZgKQz6L/n0ifSB/Xo0LDavwQuk9SQts+UdFw/m1aTPPRnezoS6btzlu0jeZxsvgeBN6X77D289esh8lkQyVPW/gbYxfh+MJCNMfcsbLJ6GFgDPAU8T/LFfrT9M8lhmzXpZ60heSrbYL4NPCZpVXre4u+AX0oqIRkt+M+AF3M3iIgdkm5M9/8SSSHo9W/A9ZIOkPOI0IjYJOlvgBaSHsxtEfHznELVn68pGbpbwJ0R8cQQuVgR8aWzZiOUfvGWRcTB9LDXncDCOPSoT7NJwz0Ls5GrBX6VFg0BH3ehsMnKPQszMxuST3CbmdmQXCzMzGxILhZmZjYkFwszMxuSi4WZmQ3p/wPKnDLWl/tfQQAAAABJRU5ErkJggg==\\n\",\n      \"text/plain\": [\n       \"<Figure size 432x288 with 1 Axes>\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"plt.grid()\\n\",\n    \"plt.plot(range(num_iterations),cost)\\n\",\n    \"\\n\",\n    \"plt.title('Cost Function')\\n\",\n    \"plt.xlabel('Training Iterations')\\n\",\n    \"plt.ylabel('Cost')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As you can notice, the loss decreases over the training iterations. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, we have learned how to build a neural network from scratch in the next chapter we will one of the popularly used deep learning libraries called TensorFlow. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "07. Deep learning foundations/7.05 Building Neural Network from scratch.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Building Neural Network from Scratch\\n\",\n    \"\\n\",\n    \"Putting all the concepts we have learned so far, we will see how to build a neural network\\n\",\n    \"from scratch. We will learn how the neural network learns to perform the XOR gate\\n\",\n    \"operation. The XOR gate returns 1 only when exactly only one of its inputs is 1 else it returns 0 as shown in\\n\",\n    \"the following figure:\\n\",\n    \"\\n\",\n    \"![image](images/1.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"To perform the XOR gate operation, we build a simple two-layer neural network as shown\\n\",\n    \"in the following figure. As you can observe, we have an input layer with two nodes, a\\n\",\n    \"hidden layer with five nodes and an output layers which consist of 1 node:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![image](images/2.png)\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, import the libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"%matplotlib inline\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Prepare the data as shown in the above XOR table:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"x = np.array([ [0, 1], [1, 0], [1, 1],[0, 0] ])\\n\",\n    \"y = np.array([ [1], [1], [0], [0]])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of nodes in each layer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"num_input = 2\\n\",\n    \"num_hidden = 5\\n\",\n    \"num_output = 1\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize weights and bias randomly. First, we initialize, input to hidden layer weights:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"Wxh = np.random.randn(num_input,num_hidden)\\n\",\n    \"bh = np.zeros((1,num_hidden))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now initialize, hidden to output layer weights:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"Why = np.random.randn (num_hidden,num_output)\\n\",\n    \"by = np.zeros((1,num_output))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the sigmoid activation function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def sigmoid(z):\\n\",\n    \"    return 1 / (1+np.exp(-z))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the derivative of the sigmoid function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def sigmoid_derivative(z):\\n\",\n    \"    return np.exp(-z)/((1+np.exp(-z))**2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the forward propagation:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def forward_prop(x,Wxh,Why):\\n\",\n    \"    z1 = np.dot(x,Wxh) + bh\\n\",\n    \"    a1 = sigmoid(z1)\\n\",\n    \"    z2 = np.dot(a1,Why) + by\\n\",\n    \"    y_hat = sigmoid(z2)\\n\",\n    \"    \\n\",\n    \"    return z1,a1,z2,y_hat\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the backward propagation:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def backword_prop(y_hat, z1, a1, z2):\\n\",\n    \"    delta2 = np.multiply(-(y-y_hat),sigmoid_derivative(z2))\\n\",\n    \"    dJ_dWhy = np.dot(a1.T, delta2)\\n\",\n    \"    delta1 = np.dot(delta2,Why.T)*sigmoid_derivative(z1)\\n\",\n    \"    dJ_dWxh = np.dot(x.T, delta1) \\n\",\n    \"\\n\",\n    \"    return dJ_dWxh, dJ_dWhy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the cost function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def cost_function(y, y_hat):\\n\",\n    \"    J = 0.5*sum((y-y_hat)**2)\\n\",\n    \"    \\n\",\n    \"    return J\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the learning rate and number of training iterations:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"alpha = 0.01\\n\",\n    \"num_iterations = 5000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now let's start training the network:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"cost = []\\n\",\n    \"for i in range(num_iterations):\\n\",\n    \"    \\n\",\n    \"    #perform forward propagation and predict output\\n\",\n    \"    z1,a1,z2,y_hat = forward_prop(x,Wxh,Why)\\n\",\n    \"    \\n\",\n    \"    #perform backward propagation and calculate gradients\\n\",\n    \"    dJ_dWxh, dJ_dWhy = backword_prop(y_hat, z1, a1, z2)\\n\",\n    \"        \\n\",\n    \"    #update the weights\\n\",\n    \"    Wxh = Wxh -alpha * dJ_dWxh\\n\",\n    \"    Why = Why -alpha * dJ_dWhy\\n\",\n    \"    \\n\",\n    \"    #compute cost\\n\",\n    \"    c = cost_function(y, y_hat)\\n\",\n    \"    \\n\",\n    \"    #store the cost\\n\",\n    \"    cost.append(c)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Plot the cost function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Text(0,0.5,'Cost')\"\n      ]\n     },\n     \"execution_count\": 13,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    },\n    {\n     \"data\": {\n      \"image/png\": \"iVBORw0KGgoAAAANSUhEUgAAAYsAAAEWCAYAAACXGLsWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3XuYXXV97/H3Z+6TmczkMpNJSLgkEEAQpDIGENB4waKnQq0exHrDVmNPS61aPQ85WmzRnnr09IitaX0QqaAV8Hi8BI1cFEaUAiZQbgkEknBJAiH3y0ySuX7PH2vNZGczt8xkzZ6Z/Xk9z372Wr912d/vEPZ3/9bltxQRmJmZDaak0AGYmdn452JhZmZDcrEwM7MhuViYmdmQXCzMzGxILhZmZjYkFwuzSUBSq6QFhY7DJi8XC5s0JP2xpFXpF+dLkn4h6YJR7vM5SW8dZPliST3pZ/a+bhvNZw4jphZJH81ti4jaiNiQ5edacSsrdABmR4OkTwNXAX8G3AF0ABcDlwK/zfjjX4yIeRl/hllBuWdhE56keuAa4C8i4kcR0RYRnRFxW0R8Nl2nUtK1kl5MX9dKqkyXNUj6maTdknZK+o2kEknfBY4Dbkt7DP/9COP6jqQv5cwvlrQpZ/45SZ+R9JikPZJulVSVs/xSSY9I2itpvaSLJf09cCHwjTSmb6TrhqSTev8ekm6StE3S85I+L6kkXXaFpN9K+t+Sdkl6VtLbR/aXt2LiYmGTwXlAFfDjQdb5HHAucBbwGmAR8Pl02V8Dm4BGoAn4H0BExAeBF4B3pod5vpJB7JeR9IDmA2cCVwBIWgTcBHwWmAa8AXguIj4H/Aa4Mo3pyn72+c9APbAAeCPwIeAjOcvPAdYCDcBXgG9L0lHPzCYVFwubDGYC2yOia5B13g9cExFbI2Ib8HfAB9NlncAc4Pi0R/KbOLJB045JeyW9r8uOYNt/iogXI2IncBtJMQP4U+CGiLgrInoiYnNEPDXUziSVApcDSyNiX0Q8B/wjh3IFeD4ivhUR3cCNJLk3HUHMVoRcLGwy2AE0SBrsHNwxwPM588+nbQBfBdYBd0raIOmqI/z8FyNiWs7rB0ew7Zac6f1AbTp9LLD+COOApLdQzitzndvfZ0bE/nSyFrNBuFjYZHA/0A784SDrvAgcnzN/XNpG+gv8ryNiAXAJ8GlJb0nXG82wzG3AlJz52Uew7UbgxAGWDRbTdpKeUn6um4/gs81ewcXCJryI2ANcDSyT9IeSpkgql/R2Sb3nGW4GPi+pUVJDuv73ACT9gaST0uP2e4BuoCfd7mWSY/8j8QjwDkkzJM0GPnkE234b+Iikt6Qn2+dKOnWomNJDSz8A/l7SVEnHA58mzdVspFwsbFKIiH8k+VL8PLCN5Jf5lcBP0lW+BKwCHgMeBx5O2wAWAr8EWkl6Kf8SEfeky/6BpMjslvSZIwzru8CjwHPAncCtR5DP70hOSn+NpID9mkO9ha8D70mvZvqnfjb/S5JezQaSy4a/D9xwhLGbHUZ++JGZmQ3FPQszMxuSi4WZmQ0p02KR3nG6VtK6gS5HlHSZpDWSVkv6ft6yOkmbeu9SNTOzwshsbKj05qBlwEUkd8eulLQ8ItbkrLMQWAqcHxG7JM3K280XgXuzitHMzIYny4EEFwHrekfClHQLyaBua3LW+RiwLCJ2AUTE1t4Fks4muav0dqB5qA9raGiIE044YcTBtrW1UVNTM+LtJ6Jiy7nY8gXnXCxGk/NDDz20PSIah1ovy2Ixl+TyxV6bSMakyXUygKT7gFLgbyPi9nTQs38EPgAMODx0rhNOOIFVq1aNONiWlhYWL1484u0nomLLudjyBedcLEaTs6Tnh16r8EOUl5Fc474YmAfcK+kMkiKxIiI2DTa+maQlwBKApqYmWlpaRhxIa2vrqLafiIot52LLF5xzsRiLnLMsFptJxrfpNY9XDjmwCXgwIjqBZyU9TVI8zgMulPTnJGPWVEhqjYjDTpJHxHXAdQDNzc0xml8T/jUy+RVbvuCci8VY5Jzl1VArgYWS5kuqIBkJc3neOj8h6VWQDsFwMrAhIt4fEcdFxAnAZ4Cb8guFmZmNncyKRTpc9JUkTy17EvhBRKyWdI2kS9LV7gB2SFoD3AN8NiJ2ZBWTmZmNTKbnLCJiBbAir+3qnOkgGc/n04Ps4zvAd7KJ0MzMhsN3cJuZ2ZBcLMzMbEhFXyz2Huzka3c9zYbd3YUOxcxs3Cr6YhE98PVfPcPTu3qGXtnMrEgVfbGoqy6jvFTs7fBzPczMBlL0xUISM2sqXSzMzAZR9MUCoGFqhYuFmdkgXCyAmTWV7Gt3sTAzG4iLBdBQW8ke9yzMzAbkYgE01CaHoZIbys3MLJ+LBTCztoKuHtjX3lXoUMzMxiUXC5LDUAA7WjsKHImZ2fjkYgHMTIvF9tb2AkdiZjY+uViQnLMA2OFiYWbWLxcLDh2G2u7DUGZm/XKxAGbUJD0LH4YyM+ufiwVQXlpCTblPcJuZDcTFIlVfIfcszMwG4GKRmloh9yzMzAbgYpGqr3TPwsxsIC4Wqak+DGVmNqBMi4WkiyWtlbRO0lUDrHOZpDWSVkv6ftp2lqT707bHJL03yzgB6irE3oNddHT5iXlmZvnKstqxpFJgGXARsAlYKWl5RKzJWWchsBQ4PyJ2SZqVLtoPfCginpF0DPCQpDsiYndW8dZXCoAdbe3Mqa/O6mPMzCakLHsWi4B1EbEhIjqAW4BL89b5GLAsInYBRMTW9P3piHgmnX4R2Ao0ZhgrUyvSYuGT3GZmr5BZzwKYC2zMmd8EnJO3zskAku4DSoG/jYjbc1eQtAioANbnf4CkJcASgKamJlpaWkYcbHn3QUDc/R8r2d6Y5Z9l/GhtbR3V32yiKbZ8wTkXi7HIudDfimXAQmAxMA+4V9IZvYebJM0Bvgt8OCJecTIhIq4DrgNobm6OxYsXjziQrSvuBg5wzIJTWXz2vBHvZyJpaWlhNH+ziabY8gXnXCzGIucsD0NtBo7NmZ+XtuXaBCyPiM6IeBZ4mqR4IKkO+DnwuYh4IMM4geQEN3jIDzOz/mRZLFYCCyXNl1QBXA4sz1vnJyS9CiQ1kByW2pCu/2Pgpoj4YYYx9qksharyEo88a2bWj8yKRUR0AVcCdwBPAj+IiNWSrpF0SbraHcAOSWuAe4DPRsQO4DLgDcAVkh5JX2dlFSuAJBpqK32C28ysH5mes4iIFcCKvLarc6YD+HT6yl3ne8D3soytPzNrK9nmnoWZ2Sv4Du4cjbUV7lmYmfXDxSLHzJpKn+A2M+uHi0WOmbUV7GzroKcnCh2Kmdm44mKRo6G2kq6eYM+BzkKHYmY2rrhY5JhZmzxedUebD0WZmeVyscjRWFsJwHaf5DYzO4yLRY6ZabHYts89CzOzXC4WOWZNdbEwM+uPi0WOaVPKKS+Vb8wzM8vjYpFDEo21lWzd62JhZpbLxSJP49RKtu47WOgwzMzGFReLPI1Tq3zOwswsj4tFnll1lS4WZmZ5XCzyNNZWsqOtg87uVzyYz8ysaLlY5JlVl1w+69FnzcwOcbHIM2tqFYBPcpuZ5XCxyNOY3pjny2fNzA5xscjTdxe3b8wzM+vjYpGnodY9CzOzfC4WeSrKSpg+pZxtrT5nYWbWy8WiH7OmVrlnYWaWI9NiIeliSWslrZN01QDrXCZpjaTVkr6f0/5hSc+krw9nGWe+WXWVbPWNeWZmfcqy2rGkUmAZcBGwCVgpaXlErMlZZyGwFDg/InZJmpW2zwC+ADQDATyUbrsrq3hzNdZWsmFb21h8lJnZhJBlz2IRsC4iNkREB3ALcGneOh8DlvUWgYjYmrb/PnBXROxMl90FXJxhrIdpTIf8iIix+kgzs3Ets54FMBfYmDO/CTgnb52TASTdB5QCfxsRtw+w7dz8D5C0BFgC0NTUREtLy4iDbW1t7dt+78uddHT38PO7Wqit0Ij3Od7l5lwMii1fcM7FYixyzrJYDPfzFwKLgXnAvZLOGO7GEXEdcB1Ac3NzLF68eMSBtLS00Lv93kdf5Oan/pNTXtPMwqapI97neJebczEotnzBOReLscg5y8NQm4Fjc+bnpW25NgHLI6IzIp4FniYpHsPZNjO9N+b5JLeZWSLLYrESWChpvqQK4HJged46PyHpVSCpgeSw1AbgDuBtkqZLmg68LW0bE31Dfnh8KDMzIMPDUBHRJelKki/5UuCGiFgt6RpgVUQs51BRWAN0A5+NiB0Akr5IUnAAromInVnFmq9vyA/3LMzMgIzPWUTECmBFXtvVOdMBfDp95W97A3BDlvENpLayjOryUt+YZ2aW8h3c/ZCUPovbxcLMDFwsBtRUV8nLe33OwswMXCwG1FRX5WJhZpZysRjA7Loqtuw96Lu4zcxwsRjQ7PoqDnb2sPdAV6FDMTMrOBeLAcyuT57F/dLeAwWOxMys8FwsBjC7LikWW/b4vIWZmYvFAHp7Fj7JbWbmYjGgWVPTw1DuWZiZuVgMpKKshIZa32thZgYuFoOaXV/pcxZmZrhYDGp2XZUPQ5mZ4WIxqNn1vovbzAxcLAY1u66KXfs7OdjZXehQzMwKysViEE11vnzWzAxcLAY1p74a8I15ZmYuFoOYXZ88MW+LexZmVuRcLAYx2z0LMzPAxWJQtZVl1FaWuWdhZkXPxWIITXW+Mc/MLNNiIeliSWslrZN0VT/Lr5C0TdIj6eujOcu+Imm1pCcl/ZMkZRnrQObUV7tnYWZFL7NiIakUWAa8HTgNeJ+k0/pZ9daIOCt9XZ9u+3rgfOBM4NXA64A3ZhXrYGbXV/HSbhcLMytuWfYsFgHrImJDRHQAtwCXDnPbAKqACqASKAdeziTKIcydVs3L+w7S0dVTiI83MxsXyjLc91xgY878JuCcftZ7t6Q3AE8Dn4qIjRFxv6R7gJcAAd+IiCfzN5S0BFgC0NTUREtLy4iDbW1t7Xf7fS93EgE/vbOFximT6xTPQDlPVsWWLzjnYjEWOWdZLIbjNuDmiGiX9HHgRuDNkk4CXgXMS9e7S9KFEfGb3I0j4jrgOoDm5uZYvHjxiANpaWmhv+3L123nhiceZN4pr+G8E2eOeP/j0UA5T1bFli8452IxFjln+VN5M3Bszvy8tK1PROyIiPZ09nrg7HT6XcADEdEaEa3AL4DzMox1QHOnJfdabN7tZ3GbWfHKslisBBZKmi+pArgcWJ67gqQ5ObOXAL2Hml4A3iipTFI5ycntVxyGGgtzpiXjQ73oYmFmRSyzw1AR0SXpSuAOoBS4ISJWS7oGWBURy4FPSLoE6AJ2Alekm/8QeDPwOMnJ7tsj4rasYh1MZVkpjVMr2bzLxcLMilem5ywiYgWwIq/t6pzppcDSfrbrBj6eZWxHYu60ah+GMrOiNrku78nI3OkuFmZW3FwshmFe2rOIiEKHYmZWEC4Ww3DMtGo6unrY3tpR6FDMzArCxWIYfPmsmRU7F4thmDs9KRa+fNbMitWwioWk7w6nbbI6prdn4ctnzaxIDbdncXruTDqi7NkDrDvp1FeXM7WyzIehzKxoDVosJC2VtA84U9Le9LUP2Ar8dEwiHCfmTq9mk3sWZlakBi0WEfEPETEV+GpE1KWvqRExM72hrmj4xjwzK2bDPQz1M0k1AJI+IOn/SDo+w7jGnXnTq9m0c7/vtTCzojTcYvGvwH5JrwH+GlgP3JRZVOPQcTNr2Nfexa79nYUOxcxszA23WHRF8pP6UpIHES0DpmYX1vhz/IwpALywc3+BIzEzG3vDLRb7JC0FPgj8XFIJyaNOi8bxM5Ni8fyOtgJHYmY29oZbLN4LtAN/EhFbSB5k9NXMohqHju3tWexwz8LMis+wikVaIP4dqJf0B8DBiCiqcxZV5aU01VXyvA9DmVkRGu4d3JcBvwP+K3AZ8KCk92QZ2Hh0/Iwa9yzMrCgN9+FHnwNeFxFbASQ1Ar8keaJd0Th2xhTuW7e90GGYmY254Z6zKOktFKkdR7DtpHH8zCls2XuQg53dhQ7FzGxMDbdncbukO4Cb0/n3kve41GLQe0XUxp37WdhUVFcOm1mRG7RYSDoJaIqIz0r6I+CCdNH9JCe8i8pxM3ovn3WxMLPiMlTP4lpgKUBE/Aj4EYCkM9Jl78w0unGmr1j4iigzKzJDnXdoiojH8xvTthOG2rmkiyWtlbRO0lX9LL9C0jZJj6Svj+YsO07SnZKelLRG0pCfl7UZNRXUVpax0cXCzIrMUD2LaYMsqx5sw/SZF8uAi4BNwEpJyyNiTd6qt0bElf3s4ibg7yPiLkm1QM8QsWZOEsfNmMJzvovbzIrMUD2LVZI+lt+Y9gAeGmLbRcC6iNgQER3ALSRjSw1J0mlAWUTcBRARrRExLn7Oz2+s4dntLhZmVlyG6ll8EvixpPdzqDg0AxXAu4bYdi6wMWd+E3BOP+u9W9IbgKeBT0XERuBkYLekHwHzSe7puCoiDrtmVdISYAlAU1MTLS0tQ4Q0sNbW1mFtX9rWwQs7Ornr7nsoL9GIP288GG7Ok0Wx5QvOuViMRc6DFouIeBl4vaQ3Aa9Om38eEXcfpc+/Dbg5ItolfRy4EXhzGteFwO8BLwC3AlcA386L7zrgOoDm5uZYvHjxiANpaWlhONvvmbaZ5esf4fjTmzl5gl8RNdycJ4tiyxecc7EYi5yHOzbUPRHxz+lruIViM3Bszvy8tC13vzsioj2dvZ5Dz/XeBDySHsLqAn4CvHaYn5upBQ21AGzY1lrgSMzMxk6Wd2GvBBZKmi+pArgcWJ67gqQ5ObOXAE/mbDstHVYEkt5G/onxgljQWAPA+m0+b2FmxWO4d3AfsYjoknQlcAdQCtwQEaslXQOsiojlwCckXQJ0ATtJDjUREd2SPgP8SpJIzpd8K6tYj0RNZRlz6qtYv9U9CzMrHpkVC4CIWEHesCARcXXO9FLSm/762fYu4Mws4xupBY01rPcVUWZWRIpuMMCj4cTGWjZsbSV50qyZ2eTnYjECCxpq2NfexbbW9qFXNjObBFwsRuDEWckVUeu3+lCUmRUHF4sROLExLRa+fNbMioSLxQjMrqtiSkUp63xFlJkVCReLESgpESc3TeWpLXsLHYqZ2ZhwsRihU2dPZe2Wfb4iysyKgovFCJ06eyq79neybZ+viDKzyc/FYoROmV0HwJNb9hU4EjOz7LlYjNCps5MRZ9f6vIWZFQEXixGaXlNBU10lT7lnYWZFwMViFE6ZXcdaFwszKwIuFqNw6uypPLO1la7ugj8e3MwsUy4Wo3BK01Q6unp4boeH/TCzyc3FYhReNSe5Imr1iz7JbWaTm4vFKCxsqqWyrITHN+0pdChmZplysRiF8tISTjumjsc2u1iY2eTmYjFKZ86tZ/XmPXT3eNgPM5u8XCxG6Yx502jr6ObZ7R6B1swmLxeLUTpzXj0Aj/m8hZlNYi4Wo3RiYy3V5aUuFmY2qWVaLCRdLGmtpHWSrupn+RWStkl6JH19NG95naRNkr6RZZyjUVoiXj23jsd9ktvMJrHMioWkUmAZ8HbgNOB9kk7rZ9VbI+Ks9HV93rIvAvdmFePRcsbcaax+cQ+dvpPbzCapLHsWi4B1EbEhIjqAW4BLh7uxpLOBJuDOjOI7as4+fjoHO3t8c56ZTVplGe57LrAxZ34TcE4/671b0huAp4FPRcRGSSXAPwIfAN460AdIWgIsAWhqaqKlpWXEwba2to54+46DSY/ill+uZPf88hHHMNZGk/NEVGz5gnMuFmORc5bFYjhuA26OiHZJHwduBN4M/DmwIiI2SRpw44i4DrgOoLm5ORYvXjziQFpaWhjN9l977B52lU1l8eLmEe9jrI0254mm2PIF51wsxiLnLIvFZuDYnPl5aVufiNiRM3s98JV0+jzgQkl/DtQCFZJaI+IVJ8nHi+YTpvPrtduICAYrcGZmE1GW5yxWAgslzZdUAVwOLM9dQdKcnNlLgCcBIuL9EXFcRJwAfAa4aTwXCoDXnTCDHW0dPLvdI9Ca2eSTWc8iIrokXQncAZQCN0TEaknXAKsiYjnwCUmXAF3ATuCKrOLJ2utOmA7Ayud2sqCxtsDRmJkdXZmes4iIFcCKvLarc6aXAkuH2Md3gO9kEN5RdWJjLdOnlPPgszt57+uOK3Q4ZmZHle/gPkok8foTG7hv3XYiPKigmU0uLhZH0YULG3h5bzvPbPWggmY2ubhYHEUXLGwA4DfPbC9wJGZmR5eLxVE0b/oUFjTU8JtnthU6FDOzo8rF4ii7YGEDD27YSXtXd6FDMTM7alwsjrILFzZyoLObVc/tKnQoZmZHjYvFUXb+STOpLCvhrjUvFzoUM7OjxsXiKJtSUcaFCxu5c/UWX0JrZpOGi0UG3nZ6Ey/uOcgTmz1kuZlNDi4WGXjrq5ooEdyxekuhQzEzOypcLDIwo6aCRfNncLsPRZnZJOFikZH/cuYxrNvaypqXfCjKzCY+F4uMvPPMOZSXih89vHnolc3MxjkXi4xMm1LBW05t4qePbKaru6fQ4ZiZjYqLRYb+6LVz2d7a4bGizGzCc7HI0OJTZjGjpoLv/+6FQodiZjYqLhYZqigr4X2LjuVXT77Mxp37Cx2OmdmIuVhk7APnHo8kvvvA84UOxcxsxFwsMjanvpqLXz2bW373Avs7ugodjpnZiLhYjIE/OX8+ew928e8P+NyFmU1MmRYLSRdLWitpnaSr+ll+haRtkh5JXx9N28+SdL+k1ZIek/TeLOPM2tnHT+fChQ1889fr3bswswkps2IhqRRYBrwdOA14n6TT+ln11og4K31dn7btBz4UEacDFwPXSpqWVaxj4ZNvXciOtg6+53MXZjYBZdmzWASsi4gNEdEB3AJcOpwNI+LpiHgmnX4R2Ao0ZhbpGDj7+BlcuLCBf2lZz+79HYUOx8zsiCirge4kvQe4OCJ6Dy19EDgnIq7MWecK4B+AbcDTwKciYmPefhYBNwKnR0RP3rIlwBKApqams2+55ZYRx9va2kptbe2Itx+Ojft6+MJ/HGDxsWV86LTKTD9rOMYi5/Gk2PIF51wsRpPzm970pocionnIFSMikxfwHuD6nPkPAt/IW2cmUJlOfxy4O2/5HGAtcO5Qn3f22WfHaNxzzz2j2n64vvDTJ2L+VT+LJzbvHpPPG8xY5TxeFFu+Ec65WIwmZ2BVDOM7PcvDUJuBY3Pm56VtuYVqR0S0p7PXA2f3LpNUB/wc+FxEPJBhnGPqU289mRk1FXz2/z5GR5fHjDKziSHLYrESWChpvqQK4HJgee4KkubkzF4CPJm2VwA/Bm6KiB9mGOOYq59Szj/80ZmseWkv1/7y6UKHY2Y2LJkVi4joAq4E7iApAj+IiNWSrpF0SbraJ9LLYx8FPgFckbZfBrwBuCLnstqzsop1rF10WhOXNc/jm79ez289yKCZTQBlWe48IlYAK/Lars6ZXgos7We77wHfyzK2Qrv6nafzyMbd/MX3H2b5ledz/MyaQodkZjYg38FdILWVZXzrQ81I8Kc3rmJnmy+nNbPxy8WigI6fWcM3P3A2G3fu5wPXP8ie/Z2FDsnMrF8uFgV27oKZXPehZtZtbeWPr3+Al/ceLHRIZmav4GIxDrzx5Ea+9eFmntvexruW3ceTL+0tdEhmZodxsRgn3nhyIz/4s/PoCfjDZffx3Qee770x0cys4FwsxpHTj6ln+V+ez7kLZvI3P3mCj3xnJc/vaCt0WGZmLhbjzaypVfzbFa/jC+88jZXP7uSir93LV+94yie/zaygXCzGoZIS8ZHz53P3ZxbzjlfPZtk96zn/f93NV25/iq0+AW5mBeBiMY411VVx7eW/xy/+6kLeeEoj//rr9Zz35btZctMq7nlqK53dHlvKzMZGpndw29Hxqjl1LPvj1/Ls9jZu+d0L/PChTdy55mXqq8t5y6tm8funz+b8kxqorfR/TjPLhr9dJpD5DTUsfcer+Ou3nULL2q3cvnoLv3pyKz96eDOlJeKMufWcd+JMzl0wk9fMq2falIpCh2xmk4SLxQRUUVbC206fzdtOn01ndw8rn93Jf6zfwf0bdvCtezfwry3rAThuxhTOmFfPGXPredWcOhbOqmVOfRWSCpyBmU00LhYTXHlpCa8/qYHXn9QAQFt7F//5wm4e37yHxzfv5tGNu/n5Yy/1rV9TUcqJs2o5aVYtJa0d7J/5EsdOn8KxM6qpry53ITGzfrlYTDI1lWVcsLCBCxY29LXtbOvg6Zf3sW5ra9/rP9btYMveTn749MN9602tLGPejCkcO72aY2dMYd70aubUV9FUV8Xs+ioaayspK/U1EWbFyMWiCMyoqeDcBcm5jFy/+OU9HHfaa9m48wCbdu1n4879bNx1gA3b27j3mW0c7Dz8aqsSQUNt5WEFpKmuitl1VTRMrWRmTQWNUyuZUVNBuYuK2aTiYlHEqsvE6cfUc/ox9a9YFhHsaOtgy56DvLz3IFv2HuTlPQd5aU8y/dyONh7YsIO9B7v63fe0KeXMrKmgobYyfVUw87DpCuqrK5g+pZz66nL3WMzGORcL65ekvi/6V899ZTHptb+ji61729nR1s62fR1sb21nR2v63tbO9n0dPPnSXra3tg9YWACmVpUxfUpaPNL36VMqqK8uT6ZrKpg2pYJp1eVMm1JOXVU5U6vKXGTMxoiLhY3KlIoyTmgo44SGoZ/0197Vzc62Drbv62BHWzt7DnSyq62DXfs7k+n96fT+Dp7b3sau/R3sG6TAAFSXl1JXXcbUtHj0FpGpVeXUVZVRV907X8bzW7uY8uzOvvnayjKmVJRRUeaCYzYUFwsbM5Vlpcypr2ZOffWwt+nq7mHPgU52H+hk9/4OdrUl0/sOdrLvYBf7Dnay90AX+9qT9937O9i4cz97D3ay92AXHV2Hn3e59uH7X/EZFaUlTKkspaaijJrKUqak78l8GVMqSvsKS/7yKZXJsuryUqrKS6muKO2bLi3xlWU2ebhY2LhWVlrCzNpKZtZWjmj7g53dfUWl5b4HWXj6mew72MXeA53s7+imrb2Lto5u9nd00drexf72bto6umhr72JHawdtHUlba3sX7V1HNrxKRWkJVeUlVFekhST2RFh1AAALlUlEQVQtIlXlJVSnhaWq/NCy6nRZbtGpLCulsqyEirKSvvdkOmmvzJmvKCtxgbLMuFjYpNb7Zdw4tZIXppVy4cLGEe+rq7uH/Z1pgWk/VGDa2rs52NnNgc7k/WBnNwc6eg6f7+zmQEfy3t7Zw/bWjrz1k2U9o3yESVmJcgpKCT2dHdQ//GsqSkuoLC9J30uT97wCVFZSQnmZqCg9NF1eUkJ5qSgrTbYtKxXlpUlbeWkJZel0Rc50srykL5ayElFeVtK3r9IS+X6eCSjTYiHpYuDrQClwfUR8OW/5FcBXgc1p0zci4vp02YeBz6ftX4qIG7OM1WwoZaUl1JWWUFdVnsn+I4LO7nhFkeno6ul7tfe9uvvmc987upNi1NHdQ3tnDy+8+CLTZ9Yetu3eA53p+t19bZ3dPXR1Bx3dyXTWz92qKD1UhHoLSFlJSfqezJeWiLJSUVpS0td2+HvaXnp4+9Yt7fxq9xOH1ivNWz9/P6X9tZf0xVBaklzwUapkvkSH2kty2gZq793PobZkZOn8/ZWIcV1EMysWkkqBZcBFwCZgpaTlEbEmb9VbI+LKvG1nAF8AmoEAHkq33ZVVvGaFJomKsuTXeH310SlILS07Wbz47CPerrsn6Ow+VEQ6u3vo7Ak608LSmbZ19fTQ0RV09eS1p4Wnb9vDlvXQ0R109bb3BD09QVdP0N33nmzbfVh7sv6BznS++1B77nr7D3bz2K6X6Oruydt+/D95skT09bxKc4rIKwvO4YWnsfQgixdnG1uWPYtFwLqI2AAg6RbgUiC/WPTn94G7ImJnuu1dwMXAzRnFamY5ki+n5BDeRNPS0sLifr45I4Ke4PDi0n14McotLp3dPfT0QE8E3ZEUtO6e3mle0RYRdPfT3pPz3hNJIe6JvOW520WyvLf9lev2fh5902rtyPzvmmWxmAtszJnfBJzTz3rvlvQG4GngUxGxcYBt5+ZvKGkJsASgqamJlpaWEQfb2to6qu0nomLLudjyBeecNZF8iY7qi7SEUT9ZqLW1I/OcC32C+zbg5ohol/Rx4EbgzcPdOCKuA64DaG5ujv5+TQzXQL9GJrNiy7nY8gXnXCzGIucs70baDBybMz+PQyeyAYiIHRHRns5eD5w93G3NzGzsZFksVgILJc2XVAFcDizPXUHSnJzZS4An0+k7gLdJmi5pOvC2tM3MzAogs8NQEdEl6UqSL/lS4IaIWC3pGmBVRCwHPiHpEqAL2AlckW67U9IXSQoOwDW9J7vNzGzsZXrOIiJWACvy2q7OmV4KLB1g2xuAG7KMz8zMhscjqJmZ2ZBcLMzMbEguFmZmNiRF1oPAjBFJ24DnR7GLBmD7UQpnoii2nIstX3DOxWI0OR8fEUOOsDlpisVoSVoVEc2FjmMsFVvOxZYvOOdiMRY5+zCUmZkNycXCzMyG5GJxyHWFDqAAii3nYssXnHOxyDxnn7MwM7MhuWdhZmZDcrEwM7MhFX2xkHSxpLWS1km6qtDxjIakGyRtlfRETtsMSXdJeiZ9n562S9I/pXk/Jum1Odt8OF3/mfRZ6OOWpGMl3SNpjaTVkv4qbZ+0eUuqkvQ7SY+mOf9d2j5f0oNpbremoz0jqTKdX5cuPyFnX0vT9rWSfr8wGQ2PpFJJ/ynpZ+n8ZM/3OUmPS3pE0qq0rXD/riN9HGAxvkhGw10PLAAqgEeB0wod1yjyeQPwWuCJnLavAFel01cB/yudfgfwC5KHfZ0LPJi2zwA2pO/T0+nphc5tkJznAK9Np6eSPHHxtMmcdxp7bTpdDjyY5vID4PK0/ZvAf0un/xz4Zjp9Oclz70n/To8ClcD89P+F0kLnN0jenwa+D/wsnZ/s+T4HNOS1FezfdbH3LPqeEx4RHUDvc8InpIi4l2So91yXkjyBkPT9D3Pab4rEA8C09Pkifc8/j4hdQO/zz8eliHgpIh5Op/eRPBNlLpM47zT21nS2PH0FyVMmf5i25+fc+7f4IfAWSUrbb4mI9oh4FlhH8v/EuCNpHvBfSB6SRhr/pM13EAX7d13sxWJYz/qe4Joi4qV0egvQlE4PlPuE/Zukhxt+j+SX9qTOOz0k8wiwleQLYD2wOyK60lVy4+/LLV2+B5jJxMr5WuC/Az3p/Ewmd76Q/AC4U9JDkpakbQX7d13oZ3DbGIqIkDQpr5WWVAv8P+CTEbE3+SGZmIx5R0Q3cJakacCPgVMLHFJmJP0BsDUiHpK0uNDxjKELImKzpFnAXZKeyl041v+ui71nUQzP+n457Y72PsZ2a9o+UO4T7m8iqZykUPx7RPwobZ70eQNExG7gHuA8kkMPvT8Ac+Pvyy1dXg/sYOLkfD5wiaTnSA4Vvxn4OpM3XwAiYnP6vpXkB8EiCvjvutiLxZDPCZ8ElgO9V0B8GPhpTvuH0qsozgX2pN3bCfX88/RY9LeBJyPi/+QsmrR5S2pMexRIqgYuIjlXcw/wnnS1/Jx7/xbvAe6O5OzncuDy9Oqh+cBC4Hdjk8XwRcTSiJgXESeQ/D96d0S8n0maL4CkGklTe6dJ/j0+QSH/XRf6jH+hXyRXETxNcsz3c4WOZ5S53Ay8BHSSHJv8U5Jjtb8CngF+CcxI1xWwLM37caA5Zz9/QnLybx3wkULnNUTOF5Ac230MeCR9vWMy5w2cCfxnmvMTwNVp+wKSL791wP8FKtP2qnR+Xbp8Qc6+Ppf+LdYCby90bsPIfTGHroaatPmmuT2avlb3fjcV8t+1h/swM7MhFfthKDMzGwYXCzMzG5KLhZmZDcnFwszMhuRiYWZmQ3KxsAlF0sx0FM5HJG2RtDlnvmKY+/g3SacMsc5fSHr/UYr5t5LOklSiozyysaQ/kTQ7Z37I3MxGwpfO2oQl6W+B1oj433ntIvm33dPvhmNM0m+BK0nuidgeEdOOcPvSSIb3GHDfEfHI6CM1G5h7FjYpSDpJyTMt/p3kJqY5kq6TtErJMx+uzlm395d+maTdkr6s5NkQ96fj8CDpS5I+mbP+l5U8Q2KtpNen7TWS/l/6uT9MP+usQcL8MjA17QXdlO7jw+l+H5H0L2nvozeuayU9BiyS9HeSVkp6QtI30zt13wucBdza27PqzS3d9weUPA/hCUn/M20bLOfL03UflXTPUf5PZBOci4VNJqcCX4uI0yIZV+eqiGgGXgNcJOm0frapB34dEa8B7ie527U/iohFwGeB3sLzl8CWiDgN+CLJiLeDuQrYFxFnRcSHJL0aeBfw+og4i2Rgz8tz4ro3Is6MiPuBr0fE64Az0mUXR8StJHesvzfdZ0dfsMmQ3l8C3pTGdb6SAfkGy/kLwFvS9ncNkYsVGRcLm0zWR8SqnPn3SXoYeBh4FcnDb/IdiIhfpNMPAScMsO8f9bPOBSQD2xERvcMyHIm3Aq8DVikZbvyNwInpsg6SweN6vUXS70iGf3gjcPoQ+z6HZEyk7RHRSfLQoDekywbK+T7gJkkfxd8NlsdDlNtk0tY7IWkh8FfAoojYLel7JGMG5evIme5m4P8n2oexzpEScENE/M1hjclIqQeid9AfaQrwDZInAm6W9CX6z2W4Bsr5YyRF5g+AhyX9XiQPzDHzrwebtOqAfcBeHXpi2NF2H3AZgKQz6L/n0ifSB/Xo0LDavwQuk9SQts+UdFw/m1aTPPRnezoS6btzlu0jeZxsvgeBN6X77D289esh8lkQyVPW/gbYxfh+MJCNMfcsbLJ6GFgDPAU8T/LFfrT9M8lhmzXpZ60heSrbYL4NPCZpVXre4u+AX0oqIRkt+M+AF3M3iIgdkm5M9/8SSSHo9W/A9ZIOkPOI0IjYJOlvgBaSHsxtEfHznELVn68pGbpbwJ0R8cQQuVgR8aWzZiOUfvGWRcTB9LDXncDCOPSoT7NJwz0Ls5GrBX6VFg0BH3ehsMnKPQszMxuST3CbmdmQXCzMzGxILhZmZjYkFwszMxuSi4WZmQ3p/wPKnDLWl/tfQQAAAABJRU5ErkJggg==\\n\",\n      \"text/plain\": [\n       \"<Figure size 432x288 with 1 Axes>\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"plt.grid()\\n\",\n    \"plt.plot(range(num_iterations),cost)\\n\",\n    \"\\n\",\n    \"plt.title('Cost Function')\\n\",\n    \"plt.xlabel('Training Iterations')\\n\",\n    \"plt.ylabel('Cost')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As you can notice, the loss decreases over the training iterations. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, we have learned how to build a neural network from scratch in the next chapter we will one of the popularly used deep learning libraries called TensorFlow. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "07. Deep learning foundations/README.md",
    "content": "# [Chapter 7. Deep Learning Foundations](#)\n\n* 7.1. Biological and artifical neurons\n* 7.2. ANN and its layers \n* 7.3. Exploring activation functions \n* 7.4. Forward and backward propagation in ANN\n\t* 7.4.1. Forward propgation in ANN \n\t* 7.4.2. Backward propagation in ANN\n* 7.5. Building neural network from scratch \n* 7.6. Recurrent neural networks \n\t* 7.6.1. Difference between feedforward networks and RNN\n\t* 7.6.2. Forward propagation in RNN \n\t* 7.6.3. Backpropagation through time \n* 7.7. LSTM-RNN\n\t* 7.7.1. Understanding the LSTM cells\n* 7.8. Convolutional neural networks\n\t* 7.8.1. Understanding every layers in CNN \n\t* 7.8.2. Architecture of CNN\n* 7.9. Generative adversarial networks \n\t* 7.9.1. Understanding generator \n\t* 7.9.2. Understanding discriminator \n\t* 7.9.3. Architecture of GAN \n\t* 9.9.4. Understanding the loss function "
  },
  {
    "path": "08. A primer on TensorFlow/.ipynb_checkpoints/8.05 Handwritten digits classification using TensorFlow-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Handwritten digits classification using TensorFlow\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Putting all the concepts we have learned so far, we will see how can use tensorflow to\\n\",\n    \"build a neural network to recognize handwritten digits. If you are playing around deep\\n\",\n    \"learning off late then you must have come across MNIST dataset. It is being called the hello\\n\",\n    \"world of deep learning. \\n\",\n    \"\\n\",\n    \"It consists of 55,000 data points of handwritten digits (0 to 9).\\n\",\n    \"In this section, we will see how can we use our neural network to recognize the\\n\",\n    \"handwritten digits and also we will get hang of tensorflow and tensorboard.\\n\",\n    \" \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Import required libraries\\n\",\n    \"\\n\",\n    \"As a first step, let us import all the required libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"import tensorflow as tf\\n\",\n    \"from tensorflow.examples.tutorials.mnist import input_data\\n\",\n    \"tf.logging.set_verbosity(tf.logging.ERROR)\\n\",\n    \"\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"%matplotlib inline\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"1.13.1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print tf.__version__\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Load the Dataset\\n\",\n    \"\\n\",\n    \"In the below code, \\\"data/mnist\\\" implies the location where we store the MNIST dataset.\\n\",\n    \"one_hot=True implies we are one-hot encoding the labels (0 to 9):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.\\n\",\n      \"Extracting data/mnist/train-images-idx3-ubyte.gz\\n\",\n      \"Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.\\n\",\n      \"Extracting data/mnist/train-labels-idx1-ubyte.gz\\n\",\n      \"Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.\\n\",\n      \"Extracting data/mnist/t10k-images-idx3-ubyte.gz\\n\",\n      \"Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.\\n\",\n      \"Extracting data/mnist/t10k-labels-idx1-ubyte.gz\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"mnist = input_data.read_data_sets(\\\"data/mnist\\\", one_hot=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's check what we got in our data:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"No of images in training set (55000, 784)\\n\",\n      \"No of labels in training set (55000, 10)\\n\",\n      \"No of images in test set (10000, 784)\\n\",\n      \"No of labels in test set (10000, 10)\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"No of images in training set {}\\\".format(mnist.train.images.shape))\\n\",\n    \"print(\\\"No of labels in training set {}\\\".format(mnist.train.labels.shape))\\n\",\n    \"\\n\",\n    \"print(\\\"No of images in test set {}\\\".format(mnist.test.images.shape))\\n\",\n    \"print(\\\"No of labels in test set {}\\\".format(mnist.test.labels.shape))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We have 55,000 images in the training set and each image is of size 784 and we have 10 labels which are actually 0 to 9. Similarly, we have 10000 images in the test set.\\n\",\n    \"\\n\",\n    \"Now we plot one image to see how it looks like:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<matplotlib.image.AxesImage at 0x7f7bfa160bd0>\"\n      ]\n     },\n     \"execution_count\": 5,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    },\n    {\n     \"data\": {\n      \"image/png\": \"iVBORw0KGgoAAAANSUhEUgAAAP8AAAD8CAYAAAC4nHJkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAADetJREFUeJzt3X+o1XWex/HXO2fsh4roertJo3snuSxUtI4cLBtZZmlnamLAJqImQQxCIybYoRGmHGGjP+KyrA1Cy5CzyWi4OUsqSsSuJUsmbIMnszJt14o7qPnjasFk/uF65z1/3K/Dre73c07n+z3ne+59Px9wued8398fb7718nvO+Zz7/Zi7C0A8l1XdAIBqEH4gKMIPBEX4gaAIPxAU4QeCIvxAUIQfCIrwA0F9o5MHmzVrlvf19XXykEAog4ODOnPmjDWzbqHwm9kdktZJmiTp39x9ILV+X1+f6vV6kUMCSKjVak2v2/LLfjObJOlfJf1Q0vWS7jez61vdH4DOKvKef6GkD9z9I3e/IGmLpCXltAWg3YqE/1pJR0c9P5Yt+wIzW2lmdTOrDw0NFTgcgDK1/dN+d1/v7jV3r/X09LT7cACaVCT8xyXNGfX8W9kyAONAkfDvk9RvZt82s8mSfiJpZzltAWi3lof63P2imT0i6b80MtS3wd3fK60zAG1VaJzf3V+W9HJJvQDoIL7eCwRF+IGgCD8QFOEHgiL8QFCEHwiK8ANBEX4gKMIPBEX4gaAIPxAU4QeCIvxAUIQfCIrwA0ERfiAowg8ERfiBoAg/EBThB4Ii/EBQhB8IivADQRF+ICjCDwRF+IGgCD8QFOEHgiL8QFCFZuk1s0FJn0kalnTR3WtlNAWg/QqFP/P37n6mhP0A6CBe9gNBFQ2/S9plZm+a2coyGgLQGUVf9i929+NmdrWkV8zsfXffM3qF7B+FlZI0d+7cgocDUJZCV353P579Pi1pu6SFY6yz3t1r7l7r6ekpcjgAJWo5/GY2xcymXXos6QeSDpbVGID2KvKyv1fSdjO7tJ9/d/f/LKUrAG3Xcvjd/SNJf1tiLwA6iKE+ICjCDwRF+IGgCD8QFOEHgiL8QFBl/FUfKvbqq6/m1rLvYeSaMWNGsn7wYPp7W4sWLUrW+/v7k3VUhys/EBThB4Ii/EBQhB8IivADQRF+ICjCDwQ1Ycb59+zZk6y/8cYbyfratWvLbKejzp492/K2kyZNStYvXLiQrF911VXJ+tSpU3NrixcvTm77/PPPFzo20rjyA0ERfiAowg8ERfiBoAg/EBThB4Ii/EBQ42qcf2BgILe2Zs2a5LbDw8NltzMhFD0v58+fb7m+bdu25LaN7kWwcePGZH3KlCnJenRc+YGgCD8QFOEHgiL8QFCEHwiK8ANBEX4gqIbj/Ga2QdKPJJ129xuzZTMl/U5Sn6RBSfe6+6fta3PEs88+m1trNF59yy23JOvTpk1rqacy3Hbbbcn63Xff3aFOvr5du3Yl6+vWrcutHTlyJLnt1q1bW+rpkk2bNuXWuBdAc1f+30q640vLHpO02937Je3OngMYRxqG3933SPrkS4uXSLr09aqNku4quS8Abdbqe/5edz+RPT4pqbekfgB0SOEP/NzdJXle3cxWmlndzOpDQ0NFDwegJK2G/5SZzZak7PfpvBXdfb2719y91tPT0+LhAJSt1fDvlLQ8e7xc0o5y2gHQKQ3Db2YvSPofSX9jZsfM7EFJA5K+b2ZHJP1D9hzAOGIjb9k7o1areb1eb3n7M2fO5NY+/PDD5Lbz589P1i+//PKWekLap5/mf/2j0fcb3nrrrULH3rx5c25t6dKlhfbdrWq1mur1evpGCBm+4QcERfiBoAg/EBThB4Ii/EBQhB8IalwN9WFiaTRt+qJFiwrtv7c3/09OTp48WWjf3YqhPgANEX4gKMIPBEX4gaAIPxAU4QeCIvxAUIQfCIrwA0ERfiAowg8ERfiBoAg/EBThB4Ii/EBQDafoBorYsSN/Ppe9e/e29diff/55bu3o0aPJbefMmVN2O12HKz8QFOEHgiL8QFCEHwiK8ANBEX4gKMIPBNVwnN/MNkj6kaTT7n5jtuwJSSskDWWrrXb3l9vVJNLOnTuXW9u+fXty2zVr1pTdzhekxtPbPWdE6rzcdNNNyW1TU4tPFM1c+X8r6Y4xlv/K3ednPwQfGGcaht/d90j6pAO9AOigIu/5HzGzd8xsg5nNKK0jAB3Ravh/LWmepPmSTkham7eima00s7qZ1YeGhvJWA9BhLYXf3U+5+7C7/0nSbyQtTKy73t1r7l7r6elptU8AJWsp/GY2e9TTH0s6WE47ADqlmaG+FyR9T9IsMzsm6Z8kfc/M5ktySYOSHmpjjwDaoGH43f3+MRY/14Zewjp06FCyvm/fvmR9YGAgt/b++++31NNEt2rVqqpbqBzf8AOCIvxAUIQfCIrwA0ERfiAowg8Exa27S3D27Nlk/eGHH07WX3zxxWS9nX/6Om/evGT9mmuuKbT/Z555Jrc2efLk5LZLly5N1t9+++2WepKkuXPntrztRMGVHwiK8ANBEX4gKMIPBEX4gaAIPxAU4QeCYpy/SVu2bMmtPfnkk8ltDx8+nKxPmzYtWZ85c2ay/tRTT+XWGk013egW1tOnT0/W26nonZ9Svd9+++2F9j0RcOUHgiL8QFCEHwiK8ANBEX4gKMIPBEX4gaAY52/Sa6+9lltrNI7/wAMPJOurV69O1vv7+5P18er48ePJeqNbmjdyxRVX5NauvvrqQvueCLjyA0ERfiAowg8ERfiBoAg/EBThB4Ii/EBQDcf5zWyOpE2SeiW5pPXuvs7MZkr6naQ+SYOS7nX3T9vXarWefvrp3NqCBQuS265YsaLsdiaEo0ePJusff/xxof3fc889hbaf6Jq58l+U9HN3v17SLZJ+ambXS3pM0m5375e0O3sOYJxoGH53P+Hu+7PHn0k6LOlaSUskbcxW2yjprnY1CaB8X+s9v5n1SfqOpN9L6nX3E1nppEbeFgAYJ5oOv5lNlbRV0s/c/Y+jaz4ymdyYE8qZ2Uozq5tZfWhoqFCzAMrTVPjN7JsaCf5md9+WLT5lZrOz+mxJp8fa1t3Xu3vN3WtFb8gIoDwNw29mJuk5SYfdffRH3jslLc8eL5e0o/z2ALRLM3/S+11JyyS9a2YHsmWrJQ1I+g8ze1DSHyTd254Wu8OVV16ZW2MorzWpP5NuRqNbmj/66KOF9j/RNQy/u++VZDnl28ptB0Cn8A0/ICjCDwRF+IGgCD8QFOEHgiL8QFDcuhttdfPNN+fW9u/fX2jf9913X7J+3XXXFdr/RMeVHwiK8ANBEX4gKMIPBEX4gaAIPxAU4QeCYpwfbZWavvzixYvJbWfMmJGsr1q1qqWeMIIrPxAU4QeCIvxAUIQfCIrwA0ERfiAowg8ExTg/Cnn99deT9fPnz+fWpk+fntz2pZdeStb5e/1iuPIDQRF+ICjCDwRF+IGgCD8QFOEHgiL8QFANx/nNbI6kTZJ6Jbmk9e6+zsyekLRC0lC26mp3f7ldjaIaw8PDyfrjjz+erE+ePDm3tmLFiuS2t956a7KOYpr5ks9FST939/1mNk3Sm2b2Slb7lbv/S/vaA9AuDcPv7ickncgef2ZmhyVd2+7GALTX13rPb2Z9kr4j6ffZokfM7B0z22BmY95zycxWmlndzOpDQ0NjrQKgAk2H38ymStoq6Wfu/kdJv5Y0T9J8jbwyWDvWdu6+3t1r7l7r6ekpoWUAZWgq/Gb2TY0Ef7O7b5Mkdz/l7sPu/idJv5G0sH1tAihbw/CbmUl6TtJhd3961PLZo1b7saSD5bcHoF2a+bT/u5KWSXrXzA5ky1ZLut/M5mtk+G9Q0kNt6RCVGvm3P99DD6X/sy9YsCC3dsMNN7TUE8rRzKf9eyWN9X8AY/rAOMY3/ICgCD8QFOEHgiL8QFCEHwiK8ANBcetuJF12Wfr6sGzZsg51grJx5QeCIvxAUIQfCIrwA0ERfiAowg8ERfiBoMzdO3cwsyFJfxi1aJakMx1r4Ovp1t66tS+J3lpVZm9/7e5N3S+vo+H/ysHN6u5eq6yBhG7trVv7kuitVVX1xst+ICjCDwRVdfjXV3z8lG7trVv7kuitVZX0Vul7fgDVqfrKD6AilYTfzO4ws/81sw/M7LEqeshjZoNm9q6ZHTCzesW9bDCz02Z2cNSymWb2ipkdyX6POU1aRb09YWbHs3N3wMzurKi3OWb232Z2yMzeM7N/zJZXeu4SfVVy3jr+st/MJkn6P0nfl3RM0j5J97v7oY42ksPMBiXV3L3yMWEz+ztJ5yRtcvcbs2X/LOkTdx/I/uGc4e6/6JLenpB0ruqZm7MJZWaPnlla0l2SHlCF5y7R172q4LxVceVfKOkDd//I3S9I2iJpSQV9dD133yPpky8tXiJpY/Z4o0b+5+m4nN66grufcPf92ePPJF2aWbrSc5foqxJVhP9aSUdHPT+m7pry2yXtMrM3zWxl1c2MoTebNl2STkrqrbKZMTScubmTvjSzdNecu1ZmvC4bH/h91WJ3XyDph5J+mr287Uo+8p6tm4Zrmpq5uVPGmFn6L6o8d63OeF22KsJ/XNKcUc+/lS3rCu5+PPt9WtJ2dd/sw6cuTZKa/T5dcT9/0U0zN481s7S64Nx104zXVYR/n6R+M/u2mU2W9BNJOyvo4yvMbEr2QYzMbIqkH6j7Zh/eKWl59ni5pB0V9vIF3TJzc97M0qr43HXdjNfu3vEfSXdq5BP/DyX9sooecvq6TtLb2c97Vfcm6QWNvAz8f418NvKgpL+StFvSEUmvSprZRb09L+ldSe9oJGizK+ptsUZe0r8j6UD2c2fV5y7RVyXnjW/4AUHxgR8QFOEHgiL8QFCEHwiK8ANBEX4gKMIPBEX4gaD+DP+BRwSusE7dAAAAAElFTkSuQmCC\\n\",\n      \"text/plain\": [\n       \"<Figure size 432x288 with 1 Axes>\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"img1 = mnist.train.images[0].reshape(28,28)\\n\",\n    \"plt.imshow(img1, cmap='Greys')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Define the number of neurons in each layer\\n\",\n    \"\\n\",\n    \"We build a 4 layer neural network with 3 hidden layers and 1 output layer. As the size of\\n\",\n    \"the input image is 784. We set the num_input to 784 and since we have 10 handwritten\\n\",\n    \"digits (0 to 9), We set 10 neurons in the output layer. We define the number of neurons in\\n\",\n    \"each layer as follows,\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#number of neurons in input layer\\n\",\n    \"num_input = 784  \\n\",\n    \"\\n\",\n    \"#number of neurons in hidden layer 1\\n\",\n    \"num_hidden1 = 512  \\n\",\n    \"\\n\",\n    \"#number of neurons in hidden layer 2\\n\",\n    \"num_hidden2 = 256  \\n\",\n    \"\\n\",\n    \"#number of neurons in hidden layer 3\\n\",\n    \"num_hidden_3 = 128  \\n\",\n    \"\\n\",\n    \"#number of neurons in output layer\\n\",\n    \"num_output = 10  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining placeholders\\n\",\n    \"\\n\",\n    \"As we learned, we first need to define the placeholders for input and output. Values for\\n\",\n    \"the placeholders will be feed at the run time through feed_dict:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.name_scope('input'):\\n\",\n    \"    X = tf.placeholder(\\\"float\\\", [None, num_input])\\n\",\n    \"\\n\",\n    \"with tf.name_scope('output'):\\n\",\n    \"    Y = tf.placeholder(\\\"float\\\", [None, num_output])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Since we have a 4 layer network, we have 4 weights and 4 baises. We initialize our weights\\n\",\n    \"by drawing values from the truncated normal distribution with a standard deviation of\\n\",\n    \"0.1. \\n\",\n    \"\\n\",\n    \"Remember, the dimensions of the weights matrix should be a number of neurons in the\\n\",\n    \"previous layer x number of neurons in the current layer. For instance, the dimension of\\n\",\n    \"weight matrix w3 should be the number of neurons in the hidden layer 2 x number of\\n\",\n    \"neurons in hidden layer 3.\\n\",\n    \"\\n\",\n    \"We often define all the weights in a dictionary as given below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.name_scope('weights'):\\n\",\n    \"    \\n\",\n    \"        weights = {\\n\",\n    \"        'w1': tf.Variable(tf.truncated_normal([num_input, num_hidden1], stddev=0.1),name='weight_1'),\\n\",\n    \"        'w2': tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1),name='weight_2'),\\n\",\n    \"        'w3': tf.Variable(tf.truncated_normal([num_hidden2, num_hidden_3], stddev=0.1),name='weight_3'),\\n\",\n    \"        'out': tf.Variable(tf.truncated_normal([num_hidden_3, num_output], stddev=0.1),name='weight_4'),\\n\",\n    \"    }\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The dimension of bias should be a number of neurons in the current layer. For instance, the\\n\",\n    \"dimension of bias b2 is the number of neurons in the hidden layer 2. We set the bias value\\n\",\n    \"as constant 0.1 in all layers:\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.name_scope('biases'):\\n\",\n    \"\\n\",\n    \"    biases = {\\n\",\n    \"        'b1': tf.Variable(tf.constant(0.1, shape=[num_hidden1]),name='bias_1'),\\n\",\n    \"        'b2': tf.Variable(tf.constant(0.1, shape=[num_hidden2]),name='bias_2'),\\n\",\n    \"        'b3': tf.Variable(tf.constant(0.1, shape=[num_hidden_3]),name='bias_3'),\\n\",\n    \"        'out': tf.Variable(tf.constant(0.1, shape=[num_output]),name='bias_4')\\n\",\n    \"    }\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Forward Propagation\\n\",\n    \"\\n\",\n    \"Now, we define the forward propagation operation. We use relu activations in all layers\\n\",\n    \"and in the last layer we use sigmoid activation as defined below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.name_scope('Model'):\\n\",\n    \"    \\n\",\n    \"    with tf.name_scope('layer1'):\\n\",\n    \"        layer_1 = tf.nn.relu(tf.add(tf.matmul(X, weights['w1']), biases['b1']) )   \\n\",\n    \"    \\n\",\n    \"    with tf.name_scope('layer2'):\\n\",\n    \"        layer_2 = tf.nn.relu(tf.add(tf.matmul(layer_1, weights['w2']), biases['b2']))\\n\",\n    \"        \\n\",\n    \"    with tf.name_scope('layer3'):\\n\",\n    \"        layer_3 = tf.nn.relu(tf.add(tf.matmul(layer_2, weights['w3']), biases['b3']))\\n\",\n    \"        \\n\",\n    \"    with tf.name_scope('output_layer'):\\n\",\n    \"         y_hat = tf.nn.sigmoid(tf.matmul(layer_3, weights['out']) + biases['out'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Compute Loss and Backpropagate\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Next, we define our loss function. We use softmax cross-entropy as our loss\\n\",\n    \"function. Tensorflow\\n\",\n    \"provides tf.nn.softmax_cross_entropy_with_logits() function for computing the\\n\",\n    \"softmax cross entropy loss. It takes the two parameters as inputs logits and labels.\\n\",\n    \"\\n\",\n    \"* logits implies the logits predicted by our network. That is, y_hat\\n\",\n    \"\\n\",\n    \"* labels imply the actual labels. That is, true labels y\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"We take mean of the loss using tf.reduce_mean()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.name_scope('Loss'):\\n\",\n    \"        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_hat,labels=Y))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we need to minimize the loss using backpropagation. Don't worry! We don't have to\\n\",\n    \"calculate derivatives of all the weights manually. Instead, we can use tensorflow's\\n\",\n    \"optimizer. In this section, we use Adam optimizer. It is a variant of gradient descent\\n\",\n    \"optimization technique we learned in the previous chapter. In the next chapter, we will\\n\",\n    \"dive into detail and see how exactly all the Adam and several other optimizers work. For\\n\",\n    \"now, let's say we use Adam optimizer as our backpropagation algorithm,\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"tf.train.AdamOptimizer() requires the learning rate as input. So we set 1e-4 as the learning rate and we minimize the loss with minimize() function. It computes the gradients and updates the parameters (weights and biases) of our network.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"optimizer = tf.train.AdamOptimizer(1e-4).minimize(loss)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"##  Compute Accuracy\\n\",\n    \"\\n\",\n    \"We calculate the accuracy of our model as follows. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"* y_hat denotes the predicted probability for each class by our model. Since we have 10 classes we will have 10 probabilities. If the probability is high at position 7, then it means that our network predicts the input image as digit 7 with high probability.  tf.argmax() returns the index of the largest value. Thus, tf.argmax(y_hat,1) gives the index where the probability is high. Thus, if the probability is high at index 7, then it returns 7\\n\",\n    \"<br>\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"* Y denotes the actual labels and it is the one hot encoded values. That is, it consists of zeros everywhere except at the position of the actual image where it consists of 1. For instance, if the input image is 7, then Y has 0 at all indices except at index 7 where it has 1. Thus, tf.argmax(Y,1) returns 7 because that is where we have high value i.e 1. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, tf.armax(y_hat,1) gives the predicted digit and tf.argmax(Y,1) gives us the actual digit. \\n\",\n    \"\\n\",\n    \"tf.equal(x, y) takes x and y as inputs and returns the truth value of (x == y) element-wise. Thus, correct_pred = tf.equal(predicted_digit,actual_digit) consists of True where the actual and predicted digits are same and False where the actual and predicted digits are not the same. We convert the boolean values in correct_pred into float using tensorflow's cast operation. That is, tf.cast(correct_pred, tf.float32). After converting into float values, take the average using tf.treduce_mean().\\n\",\n    \"\\n\",\n    \"Thus, tf.reduce_mean(tf.cast(correct_pred, tf.float32)) gives us the average correct predictions.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.name_scope('Accuracy'):\\n\",\n    \"    \\n\",\n    \"    predicted_digit = tf.argmax(y_hat, 1)\\n\",\n    \"    actual_digit = tf.argmax(Y, 1)\\n\",\n    \"    \\n\",\n    \"    correct_pred = tf.equal(predicted_digit,actual_digit)\\n\",\n    \"    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Create Summary\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also visualize how the loss and accuracy of our model change during several\\n\",\n    \"iterations in tensorboard. So, we use tf.summary() to get the summary of the variable.\\n\",\n    \"Since the loss and accuracy are scalar variables, we use tf.summary.scalar() to store the summary as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<tf.Tensor 'Loss_1:0' shape=() dtype=string>\"\n      ]\n     },\n     \"execution_count\": 14,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.summary.scalar(\\\"Accuracy\\\", accuracy)\\n\",\n    \"\\n\",\n    \"tf.summary.scalar(\\\"Loss\\\", loss)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Next, we merge all the summaries we use in our graph using tf.summary.merge_all(). We merge all summaries because when we have many summaries running and storing them would become inefficient, so we merge all the summaries and run them once in our session instead of running multiple times. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"merge_summary = tf.summary.merge_all()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Train the Model\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now it is time to train our model. As we learned, first we need to initialize all the variables:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"init = tf.global_variables_initializer()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the batch size and number of iterations:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 128\\n\",\n    \"num_iterations = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Start the tensorflow session and perform training\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Iteration: 0, Loss: 2.30700993538, Accuracy: 0.1328125\\n\",\n      \"Iteration: 100, Loss: 1.76781439781, Accuracy: 0.7890625\\n\",\n      \"Iteration: 200, Loss: 1.6294002533, Accuracy: 0.8671875\\n\",\n      \"Iteration: 300, Loss: 1.56720340252, Accuracy: 0.9453125\\n\",\n      \"Iteration: 400, Loss: 1.55666518211, Accuracy: 0.9140625\\n\",\n      \"Iteration: 500, Loss: 1.54010999203, Accuracy: 0.9140625\\n\",\n      \"Iteration: 600, Loss: 1.54285383224, Accuracy: 0.9296875\\n\",\n      \"Iteration: 700, Loss: 1.52447938919, Accuracy: 0.9375\\n\",\n      \"Iteration: 800, Loss: 1.50830471516, Accuracy: 0.953125\\n\",\n      \"Iteration: 900, Loss: 1.55391788483, Accuracy: 0.921875\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"with tf.Session() as sess:\\n\",\n    \"\\n\",\n    \"    #run the initializer\\n\",\n    \"    sess.run(init)\\n\",\n    \"\\n\",\n    \"    #save the event files\\n\",\n    \"    summary_writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())\\n\",\n    \"\\n\",\n    \"    #train for some n number of iterations\\n\",\n    \"    for i in range(num_iterations):\\n\",\n    \"        \\n\",\n    \"        #get batch of data according to batch size\\n\",\n    \"        batch_x, batch_y = mnist.train.next_batch(batch_size)\\n\",\n    \"        \\n\",\n    \"        #train the network\\n\",\n    \"        sess.run(optimizer, feed_dict={\\n\",\n    \"            X: batch_x, Y: batch_y\\n\",\n    \"            })\\n\",\n    \"\\n\",\n    \"        #print loss and accuracy on every 100th iteration\\n\",\n    \"        if i % 100 == 0:\\n\",\n    \"            \\n\",\n    \"            #compute loss, accuracy and summary\\n\",\n    \"            batch_loss, batch_accuracy,summary = sess.run(\\n\",\n    \"                [loss, accuracy, merge_summary], feed_dict={X: batch_x, Y: batch_y}\\n\",\n    \"                )\\n\",\n    \"\\n\",\n    \"            #store all the summaries\\n\",\n    \"            summary_writer.add_summary(summary, i)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"            print('Iteration: {}, Loss: {}, Accuracy: {}'.format(i,batch_loss,batch_accuracy))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As you may observe, the loss decreases and the accuracy increases over the training iterations. Now that we have learned how to build the neural network using tensorflow, in the next section we will see how can we visualize the computational graph of our model in tensorboard. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "08. A primer on TensorFlow/.ipynb_checkpoints/8.10 MNIST digits classification in TensorFlow 2.0-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# MNIST digit classification in TensorFlow 2.0\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we will see how can we perform the MNIST handwritten digits classification using\\n\",\n    \"tensorflow 2.0. It hardly a few lines of code compared to the tensorflow 1.x. As we learned,\\n\",\n    \"tensorflow 2.0 uses as keras as its high-level API, we just need to add tf.keras to the keras\\n\",\n    \"code.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Import the libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import tensorflow as tf\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Load the dataset:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mnist =  tf.keras.datasets.mnist\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a train and test set:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz\\n\",\n      \"11493376/11490434 [==============================] - 64s 6us/step\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"(x_train,y_train), (x_test, y_test) = mnist.load_data()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Normalize the x values by diving with maximum value of x which is 255 and convert them to float:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"x_train, x_test = tf.cast(x_train/255.0, tf.float32), tf.cast(x_test/255.0, tf.float32)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"convert y values to int:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"y_train, y_test = tf.cast(y_train,tf.int64),tf.cast(y_test,tf.int64)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the sequential model:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"model = tf.keras.models.Sequential()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Add the layers - We use a three-layered network. We apply ReLU activation at the first two layers and in the final output layer we apply softmax function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"model.add(tf.keras.layers.Flatten())\\n\",\n    \"model.add(tf.keras.layers.Dense(256, activation=\\\"relu\\\"))\\n\",\n    \"model.add(tf.keras.layers.Dense(128, activation=\\\"relu\\\"))\\n\",\n    \"model.add(tf.keras.layers.Dense(10, activation=\\\"softmax\\\"))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Compile the model with Stochastic Gradient Descent, that is 'sgd' (we will learn about this in the next chapter) as optimizer and sparse_categorical_crossentropy as loss function and with accuracy as a metric:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy', metrics=['accuracy'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the model for 10 epochs with batch_size as 32:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Train on 60000 samples\\n\",\n      \"Epoch 1/10\\n\",\n      \"60000/60000 [==============================] - 7s 121us/sample - loss: 0.5882 - accuracy: 0.8490\\n\",\n      \"Epoch 2/10\\n\",\n      \"60000/60000 [==============================] - 5s 80us/sample - loss: 0.2747 - accuracy: 0.9221\\n\",\n      \"Epoch 3/10\\n\",\n      \"60000/60000 [==============================] - 5s 78us/sample - loss: 0.2232 - accuracy: 0.9371\\n\",\n      \"Epoch 4/10\\n\",\n      \"60000/60000 [==============================] - 5s 75us/sample - loss: 0.1900 - accuracy: 0.9464\\n\",\n      \"Epoch 5/10\\n\",\n      \"60000/60000 [==============================] - 4s 72us/sample - loss: 0.1660 - accuracy: 0.9528\\n\",\n      \"Epoch 6/10\\n\",\n      \"60000/60000 [==============================] - 5s 84us/sample - loss: 0.1471 - accuracy: 0.9579\\n\",\n      \"Epoch 7/10\\n\",\n      \"60000/60000 [==============================] - 5s 78us/sample - loss: 0.1323 - accuracy: 0.9624\\n\",\n      \"Epoch 8/10\\n\",\n      \"60000/60000 [==============================] - 4s 75us/sample - loss: 0.1197 - accuracy: 0.9660\\n\",\n      \"Epoch 9/10\\n\",\n      \"60000/60000 [==============================] - 5s 79us/sample - loss: 0.1089 - accuracy: 0.9689\\n\",\n      \"Epoch 10/10\\n\",\n      \"60000/60000 [==============================] - 6s 97us/sample - loss: 0.0999 - accuracy: 0.9719\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<tensorflow.python.keras.callbacks.History at 0x7f4a7c2d5fd0>\"\n      ]\n     },\n     \"execution_count\": 11,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"model.fit(x_train, y_train, batch_size=32, epochs=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Evaluate the model on test sets:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"model.evaluate(x_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"That's it! Writing code with Keras API is that simple.\\n\",\n    \"\\n\",\n    \"From the next chapters, we will use TensorFlow 2.0\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "08. A primer on TensorFlow/8.05 Handwritten digits classification using TensorFlow.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Handwritten digits classification using TensorFlow\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Putting all the concepts we have learned so far, we will see how can use tensorflow to\\n\",\n    \"build a neural network to recognize handwritten digits. If you are playing around deep\\n\",\n    \"learning off late then you must have come across MNIST dataset. It is being called the hello\\n\",\n    \"world of deep learning. \\n\",\n    \"\\n\",\n    \"It consists of 55,000 data points of handwritten digits (0 to 9).\\n\",\n    \"In this section, we will see how can we use our neural network to recognize the\\n\",\n    \"handwritten digits and also we will get hang of tensorflow and tensorboard.\\n\",\n    \" \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Import required libraries\\n\",\n    \"\\n\",\n    \"As a first step, let us import all the required libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"import tensorflow as tf\\n\",\n    \"from tensorflow.examples.tutorials.mnist import input_data\\n\",\n    \"tf.logging.set_verbosity(tf.logging.ERROR)\\n\",\n    \"\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"%matplotlib inline\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"1.13.1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print tf.__version__\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Load the Dataset\\n\",\n    \"\\n\",\n    \"In the below code, \\\"data/mnist\\\" implies the location where we store the MNIST dataset.\\n\",\n    \"one_hot=True implies we are one-hot encoding the labels (0 to 9):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.\\n\",\n      \"Extracting data/mnist/train-images-idx3-ubyte.gz\\n\",\n      \"Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.\\n\",\n      \"Extracting data/mnist/train-labels-idx1-ubyte.gz\\n\",\n      \"Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.\\n\",\n      \"Extracting data/mnist/t10k-images-idx3-ubyte.gz\\n\",\n      \"Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.\\n\",\n      \"Extracting data/mnist/t10k-labels-idx1-ubyte.gz\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"mnist = input_data.read_data_sets(\\\"data/mnist\\\", one_hot=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's check what we got in our data:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"No of images in training set (55000, 784)\\n\",\n      \"No of labels in training set (55000, 10)\\n\",\n      \"No of images in test set (10000, 784)\\n\",\n      \"No of labels in test set (10000, 10)\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"No of images in training set {}\\\".format(mnist.train.images.shape))\\n\",\n    \"print(\\\"No of labels in training set {}\\\".format(mnist.train.labels.shape))\\n\",\n    \"\\n\",\n    \"print(\\\"No of images in test set {}\\\".format(mnist.test.images.shape))\\n\",\n    \"print(\\\"No of labels in test set {}\\\".format(mnist.test.labels.shape))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We have 55,000 images in the training set and each image is of size 784 and we have 10 labels which are actually 0 to 9. Similarly, we have 10000 images in the test set.\\n\",\n    \"\\n\",\n    \"Now we plot one image to see how it looks like:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<matplotlib.image.AxesImage at 0x7f7bfa160bd0>\"\n      ]\n     },\n     \"execution_count\": 5,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    },\n    {\n     \"data\": {\n      \"image/png\": \"iVBORw0KGgoAAAANSUhEUgAAAP8AAAD8CAYAAAC4nHJkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAADetJREFUeJzt3X+o1XWex/HXO2fsh4roertJo3snuSxUtI4cLBtZZmlnamLAJqImQQxCIybYoRGmHGGjP+KyrA1Cy5CzyWi4OUsqSsSuJUsmbIMnszJt14o7qPnjasFk/uF65z1/3K/Dre73c07n+z3ne+59Px9wued8398fb7718nvO+Zz7/Zi7C0A8l1XdAIBqEH4gKMIPBEX4gaAIPxAU4QeCIvxAUIQfCIrwA0F9o5MHmzVrlvf19XXykEAog4ODOnPmjDWzbqHwm9kdktZJmiTp39x9ILV+X1+f6vV6kUMCSKjVak2v2/LLfjObJOlfJf1Q0vWS7jez61vdH4DOKvKef6GkD9z9I3e/IGmLpCXltAWg3YqE/1pJR0c9P5Yt+wIzW2lmdTOrDw0NFTgcgDK1/dN+d1/v7jV3r/X09LT7cACaVCT8xyXNGfX8W9kyAONAkfDvk9RvZt82s8mSfiJpZzltAWi3lof63P2imT0i6b80MtS3wd3fK60zAG1VaJzf3V+W9HJJvQDoIL7eCwRF+IGgCD8QFOEHgiL8QFCEHwiK8ANBEX4gKMIPBEX4gaAIPxAU4QeCIvxAUIQfCIrwA0ERfiAowg8ERfiBoAg/EBThB4Ii/EBQhB8IivADQRF+ICjCDwRF+IGgCD8QFOEHgiL8QFCFZuk1s0FJn0kalnTR3WtlNAWg/QqFP/P37n6mhP0A6CBe9gNBFQ2/S9plZm+a2coyGgLQGUVf9i929+NmdrWkV8zsfXffM3qF7B+FlZI0d+7cgocDUJZCV353P579Pi1pu6SFY6yz3t1r7l7r6ekpcjgAJWo5/GY2xcymXXos6QeSDpbVGID2KvKyv1fSdjO7tJ9/d/f/LKUrAG3Xcvjd/SNJf1tiLwA6iKE+ICjCDwRF+IGgCD8QFOEHgiL8QFBl/FUfKvbqq6/m1rLvYeSaMWNGsn7wYPp7W4sWLUrW+/v7k3VUhys/EBThB4Ii/EBQhB8IivADQRF+ICjCDwQ1Ycb59+zZk6y/8cYbyfratWvLbKejzp492/K2kyZNStYvXLiQrF911VXJ+tSpU3NrixcvTm77/PPPFzo20rjyA0ERfiAowg8ERfiBoAg/EBThB4Ii/EBQ42qcf2BgILe2Zs2a5LbDw8NltzMhFD0v58+fb7m+bdu25LaN7kWwcePGZH3KlCnJenRc+YGgCD8QFOEHgiL8QFCEHwiK8ANBEX4gqIbj/Ga2QdKPJJ129xuzZTMl/U5Sn6RBSfe6+6fta3PEs88+m1trNF59yy23JOvTpk1rqacy3Hbbbcn63Xff3aFOvr5du3Yl6+vWrcutHTlyJLnt1q1bW+rpkk2bNuXWuBdAc1f+30q640vLHpO02937Je3OngMYRxqG3933SPrkS4uXSLr09aqNku4quS8Abdbqe/5edz+RPT4pqbekfgB0SOEP/NzdJXle3cxWmlndzOpDQ0NFDwegJK2G/5SZzZak7PfpvBXdfb2719y91tPT0+LhAJSt1fDvlLQ8e7xc0o5y2gHQKQ3Db2YvSPofSX9jZsfM7EFJA5K+b2ZHJP1D9hzAOGIjb9k7o1areb1eb3n7M2fO5NY+/PDD5Lbz589P1i+//PKWekLap5/mf/2j0fcb3nrrrULH3rx5c25t6dKlhfbdrWq1mur1evpGCBm+4QcERfiBoAg/EBThB4Ii/EBQhB8IalwN9WFiaTRt+qJFiwrtv7c3/09OTp48WWjf3YqhPgANEX4gKMIPBEX4gaAIPxAU4QeCIvxAUIQfCIrwA0ERfiAowg8ERfiBoAg/EBThB4Ii/EBQDafoBorYsSN/Ppe9e/e29diff/55bu3o0aPJbefMmVN2O12HKz8QFOEHgiL8QFCEHwiK8ANBEX4gKMIPBNVwnN/MNkj6kaTT7n5jtuwJSSskDWWrrXb3l9vVJNLOnTuXW9u+fXty2zVr1pTdzhekxtPbPWdE6rzcdNNNyW1TU4tPFM1c+X8r6Y4xlv/K3ednPwQfGGcaht/d90j6pAO9AOigIu/5HzGzd8xsg5nNKK0jAB3Ravh/LWmepPmSTkham7eima00s7qZ1YeGhvJWA9BhLYXf3U+5+7C7/0nSbyQtTKy73t1r7l7r6elptU8AJWsp/GY2e9TTH0s6WE47ADqlmaG+FyR9T9IsMzsm6Z8kfc/M5ktySYOSHmpjjwDaoGH43f3+MRY/14Zewjp06FCyvm/fvmR9YGAgt/b++++31NNEt2rVqqpbqBzf8AOCIvxAUIQfCIrwA0ERfiAowg8Exa27S3D27Nlk/eGHH07WX3zxxWS9nX/6Om/evGT9mmuuKbT/Z555Jrc2efLk5LZLly5N1t9+++2WepKkuXPntrztRMGVHwiK8ANBEX4gKMIPBEX4gaAIPxAU4QeCYpy/SVu2bMmtPfnkk8ltDx8+nKxPmzYtWZ85c2ay/tRTT+XWGk013egW1tOnT0/W26nonZ9Svd9+++2F9j0RcOUHgiL8QFCEHwiK8ANBEX4gKMIPBEX4gaAY52/Sa6+9lltrNI7/wAMPJOurV69O1vv7+5P18er48ePJeqNbmjdyxRVX5NauvvrqQvueCLjyA0ERfiAowg8ERfiBoAg/EBThB4Ii/EBQDcf5zWyOpE2SeiW5pPXuvs7MZkr6naQ+SYOS7nX3T9vXarWefvrp3NqCBQuS265YsaLsdiaEo0ePJusff/xxof3fc889hbaf6Jq58l+U9HN3v17SLZJ+ambXS3pM0m5375e0O3sOYJxoGH53P+Hu+7PHn0k6LOlaSUskbcxW2yjprnY1CaB8X+s9v5n1SfqOpN9L6nX3E1nppEbeFgAYJ5oOv5lNlbRV0s/c/Y+jaz4ymdyYE8qZ2Uozq5tZfWhoqFCzAMrTVPjN7JsaCf5md9+WLT5lZrOz+mxJp8fa1t3Xu3vN3WtFb8gIoDwNw29mJuk5SYfdffRH3jslLc8eL5e0o/z2ALRLM3/S+11JyyS9a2YHsmWrJQ1I+g8ze1DSHyTd254Wu8OVV16ZW2MorzWpP5NuRqNbmj/66KOF9j/RNQy/u++VZDnl28ptB0Cn8A0/ICjCDwRF+IGgCD8QFOEHgiL8QFDcuhttdfPNN+fW9u/fX2jf9913X7J+3XXXFdr/RMeVHwiK8ANBEX4gKMIPBEX4gaAIPxAU4QeCYpwfbZWavvzixYvJbWfMmJGsr1q1qqWeMIIrPxAU4QeCIvxAUIQfCIrwA0ERfiAowg8ExTg/Cnn99deT9fPnz+fWpk+fntz2pZdeStb5e/1iuPIDQRF+ICjCDwRF+IGgCD8QFOEHgiL8QFANx/nNbI6kTZJ6Jbmk9e6+zsyekLRC0lC26mp3f7ldjaIaw8PDyfrjjz+erE+ePDm3tmLFiuS2t956a7KOYpr5ks9FST939/1mNk3Sm2b2Slb7lbv/S/vaA9AuDcPv7ickncgef2ZmhyVd2+7GALTX13rPb2Z9kr4j6ffZokfM7B0z22BmY95zycxWmlndzOpDQ0NjrQKgAk2H38ymStoq6Wfu/kdJv5Y0T9J8jbwyWDvWdu6+3t1r7l7r6ekpoWUAZWgq/Gb2TY0Ef7O7b5Mkdz/l7sPu/idJv5G0sH1tAihbw/CbmUl6TtJhd3961PLZo1b7saSD5bcHoF2a+bT/u5KWSXrXzA5ky1ZLut/M5mtk+G9Q0kNt6RCVGvm3P99DD6X/sy9YsCC3dsMNN7TUE8rRzKf9eyWN9X8AY/rAOMY3/ICgCD8QFOEHgiL8QFCEHwiK8ANBcetuJF12Wfr6sGzZsg51grJx5QeCIvxAUIQfCIrwA0ERfiAowg8ERfiBoMzdO3cwsyFJfxi1aJakMx1r4Ovp1t66tS+J3lpVZm9/7e5N3S+vo+H/ysHN6u5eq6yBhG7trVv7kuitVVX1xst+ICjCDwRVdfjXV3z8lG7trVv7kuitVZX0Vul7fgDVqfrKD6AilYTfzO4ws/81sw/M7LEqeshjZoNm9q6ZHTCzesW9bDCz02Z2cNSymWb2ipkdyX6POU1aRb09YWbHs3N3wMzurKi3OWb232Z2yMzeM7N/zJZXeu4SfVVy3jr+st/MJkn6P0nfl3RM0j5J97v7oY42ksPMBiXV3L3yMWEz+ztJ5yRtcvcbs2X/LOkTdx/I/uGc4e6/6JLenpB0ruqZm7MJZWaPnlla0l2SHlCF5y7R172q4LxVceVfKOkDd//I3S9I2iJpSQV9dD133yPpky8tXiJpY/Z4o0b+5+m4nN66grufcPf92ePPJF2aWbrSc5foqxJVhP9aSUdHPT+m7pry2yXtMrM3zWxl1c2MoTebNl2STkrqrbKZMTScubmTvjSzdNecu1ZmvC4bH/h91WJ3XyDph5J+mr287Uo+8p6tm4Zrmpq5uVPGmFn6L6o8d63OeF22KsJ/XNKcUc+/lS3rCu5+PPt9WtJ2dd/sw6cuTZKa/T5dcT9/0U0zN481s7S64Nx104zXVYR/n6R+M/u2mU2W9BNJOyvo4yvMbEr2QYzMbIqkH6j7Zh/eKWl59ni5pB0V9vIF3TJzc97M0qr43HXdjNfu3vEfSXdq5BP/DyX9sooecvq6TtLb2c97Vfcm6QWNvAz8f418NvKgpL+StFvSEUmvSprZRb09L+ldSe9oJGizK+ptsUZe0r8j6UD2c2fV5y7RVyXnjW/4AUHxgR8QFOEHgiL8QFCEHwiK8ANBEX4gKMIPBEX4gaD+DP+BRwSusE7dAAAAAElFTkSuQmCC\\n\",\n      \"text/plain\": [\n       \"<Figure size 432x288 with 1 Axes>\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"img1 = mnist.train.images[0].reshape(28,28)\\n\",\n    \"plt.imshow(img1, cmap='Greys')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Define the number of neurons in each layer\\n\",\n    \"\\n\",\n    \"We build a 4 layer neural network with 3 hidden layers and 1 output layer. As the size of\\n\",\n    \"the input image is 784. We set the num_input to 784 and since we have 10 handwritten\\n\",\n    \"digits (0 to 9), We set 10 neurons in the output layer. We define the number of neurons in\\n\",\n    \"each layer as follows,\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#number of neurons in input layer\\n\",\n    \"num_input = 784  \\n\",\n    \"\\n\",\n    \"#number of neurons in hidden layer 1\\n\",\n    \"num_hidden1 = 512  \\n\",\n    \"\\n\",\n    \"#number of neurons in hidden layer 2\\n\",\n    \"num_hidden2 = 256  \\n\",\n    \"\\n\",\n    \"#number of neurons in hidden layer 3\\n\",\n    \"num_hidden_3 = 128  \\n\",\n    \"\\n\",\n    \"#number of neurons in output layer\\n\",\n    \"num_output = 10  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining placeholders\\n\",\n    \"\\n\",\n    \"As we learned, we first need to define the placeholders for input and output. Values for\\n\",\n    \"the placeholders will be feed at the run time through feed_dict:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.name_scope('input'):\\n\",\n    \"    X = tf.placeholder(\\\"float\\\", [None, num_input])\\n\",\n    \"\\n\",\n    \"with tf.name_scope('output'):\\n\",\n    \"    Y = tf.placeholder(\\\"float\\\", [None, num_output])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Since we have a 4 layer network, we have 4 weights and 4 baises. We initialize our weights\\n\",\n    \"by drawing values from the truncated normal distribution with a standard deviation of\\n\",\n    \"0.1. \\n\",\n    \"\\n\",\n    \"Remember, the dimensions of the weights matrix should be a number of neurons in the\\n\",\n    \"previous layer x number of neurons in the current layer. For instance, the dimension of\\n\",\n    \"weight matrix w3 should be the number of neurons in the hidden layer 2 x number of\\n\",\n    \"neurons in hidden layer 3.\\n\",\n    \"\\n\",\n    \"We often define all the weights in a dictionary as given below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.name_scope('weights'):\\n\",\n    \"    \\n\",\n    \"        weights = {\\n\",\n    \"        'w1': tf.Variable(tf.truncated_normal([num_input, num_hidden1], stddev=0.1),name='weight_1'),\\n\",\n    \"        'w2': tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1),name='weight_2'),\\n\",\n    \"        'w3': tf.Variable(tf.truncated_normal([num_hidden2, num_hidden_3], stddev=0.1),name='weight_3'),\\n\",\n    \"        'out': tf.Variable(tf.truncated_normal([num_hidden_3, num_output], stddev=0.1),name='weight_4'),\\n\",\n    \"    }\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The dimension of bias should be a number of neurons in the current layer. For instance, the\\n\",\n    \"dimension of bias b2 is the number of neurons in the hidden layer 2. We set the bias value\\n\",\n    \"as constant 0.1 in all layers:\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.name_scope('biases'):\\n\",\n    \"\\n\",\n    \"    biases = {\\n\",\n    \"        'b1': tf.Variable(tf.constant(0.1, shape=[num_hidden1]),name='bias_1'),\\n\",\n    \"        'b2': tf.Variable(tf.constant(0.1, shape=[num_hidden2]),name='bias_2'),\\n\",\n    \"        'b3': tf.Variable(tf.constant(0.1, shape=[num_hidden_3]),name='bias_3'),\\n\",\n    \"        'out': tf.Variable(tf.constant(0.1, shape=[num_output]),name='bias_4')\\n\",\n    \"    }\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Forward Propagation\\n\",\n    \"\\n\",\n    \"Now, we define the forward propagation operation. We use relu activations in all layers\\n\",\n    \"and in the last layer we use sigmoid activation as defined below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.name_scope('Model'):\\n\",\n    \"    \\n\",\n    \"    with tf.name_scope('layer1'):\\n\",\n    \"        layer_1 = tf.nn.relu(tf.add(tf.matmul(X, weights['w1']), biases['b1']) )   \\n\",\n    \"    \\n\",\n    \"    with tf.name_scope('layer2'):\\n\",\n    \"        layer_2 = tf.nn.relu(tf.add(tf.matmul(layer_1, weights['w2']), biases['b2']))\\n\",\n    \"        \\n\",\n    \"    with tf.name_scope('layer3'):\\n\",\n    \"        layer_3 = tf.nn.relu(tf.add(tf.matmul(layer_2, weights['w3']), biases['b3']))\\n\",\n    \"        \\n\",\n    \"    with tf.name_scope('output_layer'):\\n\",\n    \"         y_hat = tf.nn.sigmoid(tf.matmul(layer_3, weights['out']) + biases['out'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Compute Loss and Backpropagate\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Next, we define our loss function. We use softmax cross-entropy as our loss\\n\",\n    \"function. Tensorflow\\n\",\n    \"provides tf.nn.softmax_cross_entropy_with_logits() function for computing the\\n\",\n    \"softmax cross entropy loss. It takes the two parameters as inputs logits and labels.\\n\",\n    \"\\n\",\n    \"* logits implies the logits predicted by our network. That is, y_hat\\n\",\n    \"\\n\",\n    \"* labels imply the actual labels. That is, true labels y\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"We take mean of the loss using tf.reduce_mean()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.name_scope('Loss'):\\n\",\n    \"        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_hat,labels=Y))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we need to minimize the loss using backpropagation. Don't worry! We don't have to\\n\",\n    \"calculate derivatives of all the weights manually. Instead, we can use tensorflow's\\n\",\n    \"optimizer. In this section, we use Adam optimizer. It is a variant of gradient descent\\n\",\n    \"optimization technique we learned in the previous chapter. In the next chapter, we will\\n\",\n    \"dive into detail and see how exactly all the Adam and several other optimizers work. For\\n\",\n    \"now, let's say we use Adam optimizer as our backpropagation algorithm,\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"tf.train.AdamOptimizer() requires the learning rate as input. So we set 1e-4 as the learning rate and we minimize the loss with minimize() function. It computes the gradients and updates the parameters (weights and biases) of our network.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"optimizer = tf.train.AdamOptimizer(1e-4).minimize(loss)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"##  Compute Accuracy\\n\",\n    \"\\n\",\n    \"We calculate the accuracy of our model as follows. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"* y_hat denotes the predicted probability for each class by our model. Since we have 10 classes we will have 10 probabilities. If the probability is high at position 7, then it means that our network predicts the input image as digit 7 with high probability.  tf.argmax() returns the index of the largest value. Thus, tf.argmax(y_hat,1) gives the index where the probability is high. Thus, if the probability is high at index 7, then it returns 7\\n\",\n    \"<br>\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"* Y denotes the actual labels and it is the one hot encoded values. That is, it consists of zeros everywhere except at the position of the actual image where it consists of 1. For instance, if the input image is 7, then Y has 0 at all indices except at index 7 where it has 1. Thus, tf.argmax(Y,1) returns 7 because that is where we have high value i.e 1. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, tf.armax(y_hat,1) gives the predicted digit and tf.argmax(Y,1) gives us the actual digit. \\n\",\n    \"\\n\",\n    \"tf.equal(x, y) takes x and y as inputs and returns the truth value of (x == y) element-wise. Thus, correct_pred = tf.equal(predicted_digit,actual_digit) consists of True where the actual and predicted digits are same and False where the actual and predicted digits are not the same. We convert the boolean values in correct_pred into float using tensorflow's cast operation. That is, tf.cast(correct_pred, tf.float32). After converting into float values, take the average using tf.treduce_mean().\\n\",\n    \"\\n\",\n    \"Thus, tf.reduce_mean(tf.cast(correct_pred, tf.float32)) gives us the average correct predictions.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.name_scope('Accuracy'):\\n\",\n    \"    \\n\",\n    \"    predicted_digit = tf.argmax(y_hat, 1)\\n\",\n    \"    actual_digit = tf.argmax(Y, 1)\\n\",\n    \"    \\n\",\n    \"    correct_pred = tf.equal(predicted_digit,actual_digit)\\n\",\n    \"    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Create Summary\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also visualize how the loss and accuracy of our model change during several\\n\",\n    \"iterations in tensorboard. So, we use tf.summary() to get the summary of the variable.\\n\",\n    \"Since the loss and accuracy are scalar variables, we use tf.summary.scalar() to store the summary as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<tf.Tensor 'Loss_1:0' shape=() dtype=string>\"\n      ]\n     },\n     \"execution_count\": 14,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.summary.scalar(\\\"Accuracy\\\", accuracy)\\n\",\n    \"\\n\",\n    \"tf.summary.scalar(\\\"Loss\\\", loss)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Next, we merge all the summaries we use in our graph using tf.summary.merge_all(). We merge all summaries because when we have many summaries running and storing them would become inefficient, so we merge all the summaries and run them once in our session instead of running multiple times. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"merge_summary = tf.summary.merge_all()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Train the Model\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now it is time to train our model. As we learned, first we need to initialize all the variables:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"init = tf.global_variables_initializer()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the batch size and number of iterations:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 128\\n\",\n    \"num_iterations = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Start the tensorflow session and perform training\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Iteration: 0, Loss: 2.30700993538, Accuracy: 0.1328125\\n\",\n      \"Iteration: 100, Loss: 1.76781439781, Accuracy: 0.7890625\\n\",\n      \"Iteration: 200, Loss: 1.6294002533, Accuracy: 0.8671875\\n\",\n      \"Iteration: 300, Loss: 1.56720340252, Accuracy: 0.9453125\\n\",\n      \"Iteration: 400, Loss: 1.55666518211, Accuracy: 0.9140625\\n\",\n      \"Iteration: 500, Loss: 1.54010999203, Accuracy: 0.9140625\\n\",\n      \"Iteration: 600, Loss: 1.54285383224, Accuracy: 0.9296875\\n\",\n      \"Iteration: 700, Loss: 1.52447938919, Accuracy: 0.9375\\n\",\n      \"Iteration: 800, Loss: 1.50830471516, Accuracy: 0.953125\\n\",\n      \"Iteration: 900, Loss: 1.55391788483, Accuracy: 0.921875\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"with tf.Session() as sess:\\n\",\n    \"\\n\",\n    \"    #run the initializer\\n\",\n    \"    sess.run(init)\\n\",\n    \"\\n\",\n    \"    #save the event files\\n\",\n    \"    summary_writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())\\n\",\n    \"\\n\",\n    \"    #train for some n number of iterations\\n\",\n    \"    for i in range(num_iterations):\\n\",\n    \"        \\n\",\n    \"        #get batch of data according to batch size\\n\",\n    \"        batch_x, batch_y = mnist.train.next_batch(batch_size)\\n\",\n    \"        \\n\",\n    \"        #train the network\\n\",\n    \"        sess.run(optimizer, feed_dict={\\n\",\n    \"            X: batch_x, Y: batch_y\\n\",\n    \"            })\\n\",\n    \"\\n\",\n    \"        #print loss and accuracy on every 100th iteration\\n\",\n    \"        if i % 100 == 0:\\n\",\n    \"            \\n\",\n    \"            #compute loss, accuracy and summary\\n\",\n    \"            batch_loss, batch_accuracy,summary = sess.run(\\n\",\n    \"                [loss, accuracy, merge_summary], feed_dict={X: batch_x, Y: batch_y}\\n\",\n    \"                )\\n\",\n    \"\\n\",\n    \"            #store all the summaries\\n\",\n    \"            summary_writer.add_summary(summary, i)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"            print('Iteration: {}, Loss: {}, Accuracy: {}'.format(i,batch_loss,batch_accuracy))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As you may observe, the loss decreases and the accuracy increases over the training iterations. Now that we have learned how to build the neural network using tensorflow, in the next section we will see how can we visualize the computational graph of our model in tensorboard. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "08. A primer on TensorFlow/8.08 Math operations in TensorFlow.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Math operations in TensorFlow\\n\",\n    \"\\n\",\n    \"We will explore some of the math operations in Tensorflow using Eager execution mode. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import tensorflow as tf\\n\",\n    \"tf.logging.set_verbosity(tf.logging.ERROR)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Enable Eage Execution mode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"tf.enable_eager_execution()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"Define x and y:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"x = tf.constant([1., 2., 3.])\\n\",\n    \"y = tf.constant([3., 2., 1.])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Basic Operations\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Addition\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([4., 4., 4.], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 4,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"sum = tf.add(x,y)\\n\",\n    \"sum.numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Subtraction\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([-2.,  0.,  2.], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 5,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"difference = tf.subtract(x,y)\\n\",\n    \"difference.numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Multiplication\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([3., 4., 3.], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"product = tf.multiply(x,y)\\n\",\n    \"product.numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Division\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([0.33333334, 1.        , 3.        ], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"division = tf.divide(x,y)\\n\",\n    \"division.numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Square\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([1., 4., 9.], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 8,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"square = tf.square(x)\\n\",\n    \"square.numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Dot Product\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"10.0\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dot_product = tf.reduce_sum(tf.multiply(x, y))\\n\",\n    \"\\n\",\n    \"dot_product.numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Finding the index of min and max element\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"x = tf.constant([10, 0, 13, 9])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Index of minimum value:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"1\"\n      ]\n     },\n     \"execution_count\": 11,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.argmin(x).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Index of maximum value:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"2\"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.argmax(x).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Squared Difference \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([  0,   4,  16,  36, 100], dtype=int32)\"\n      ]\n     },\n     \"execution_count\": 13,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"x = tf.Variable([1,3,5,7,11])\\n\",\n    \"y = tf.Variable([1])\\n\",\n    \"\\n\",\n    \"tf.math.squared_difference(x,y).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Power\\n\",\n    \"\\n\",\n    \"x^x\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([  1,   4,  27, 256], dtype=int32)\"\n      ]\n     },\n     \"execution_count\": 14,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"x = tf.Variable([1,2,3,4])\\n\",\n    \"tf.pow(x, x).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Rank of the matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"TensorShape([Dimension(2), Dimension(2), Dimension(3)])\"\n      ]\n     },\n     \"execution_count\": 15,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"x = tf.constant([[[1,2,4],[3,4,5]],\\n\",\n    \"             [[1,2,4],[3,4,5]]])\\n\",\n    \"\\n\",\n    \"x.shape\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[[1, 2, 4],\\n\",\n       \"        [3, 4, 5]],\\n\",\n       \"\\n\",\n       \"       [[1, 2, 4],\\n\",\n       \"        [3, 4, 5]]], dtype=int32)\"\n      ]\n     },\n     \"execution_count\": 16,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"x.numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Reshape the matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"x = tf.constant([[1,2,3,4], [5,6,7,8]])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[1],\\n\",\n       \"       [2],\\n\",\n       \"       [3],\\n\",\n       \"       [4],\\n\",\n       \"       [5],\\n\",\n       \"       [6],\\n\",\n       \"       [7],\\n\",\n       \"       [8]], dtype=int32)\"\n      ]\n     },\n     \"execution_count\": 18,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.reshape(x,[8,1]).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Transpose the matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<tf.Tensor: id=60, shape=(4, 2), dtype=int32, numpy=\\n\",\n       \"array([[1, 5],\\n\",\n       \"       [2, 6],\\n\",\n       \"       [3, 7],\\n\",\n       \"       [4, 8]], dtype=int32)>\"\n      ]\n     },\n     \"execution_count\": 19,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.transpose(x)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Typecasting \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"tf.int32\"\n      ]\n     },\n     \"execution_count\": 20,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"x.dtype\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"tf.float32\"\n      ]\n     },\n     \"execution_count\": 21,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"x = tf.cast(x, dtype=tf.float32)\\n\",\n    \"\\n\",\n    \"x.dtype\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Concatenating two matrices\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"x = [[3,6,9], [7,7,7]]\\n\",\n    \"y = [[4,5,6], [5,5,5]]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Concatenate row-wise:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[3, 6, 9],\\n\",\n       \"       [7, 7, 7],\\n\",\n       \"       [4, 5, 6],\\n\",\n       \"       [5, 5, 5]], dtype=int32)\"\n      ]\n     },\n     \"execution_count\": 23,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.concat([x, y], 0).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Concatenate column-wise:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 24,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[3, 6, 9, 4, 5, 6],\\n\",\n       \"       [7, 7, 7, 5, 5, 5]], dtype=int32)\"\n      ]\n     },\n     \"execution_count\": 24,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.concat([x, y], 1).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Stack x matrix:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 25,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[3, 7],\\n\",\n       \"       [6, 7],\\n\",\n       \"       [9, 7]], dtype=int32)\"\n      ]\n     },\n     \"execution_count\": 25,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.stack(x, axis=1).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Reduce Mean\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 26,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"x = tf.Variable([[1.0, 5.0], [2.0, 3.0]])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 27,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[1., 5.],\\n\",\n       \"       [2., 3.]], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 27,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"x.numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Compute average values i.e (1.0 + 5.0 + 2.0 + 3.0) / 4\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 28,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"2.75\"\n      ]\n     },\n     \"execution_count\": 28,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.reduce_mean(input_tensor=x).numpy() \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Average across the row i.e, [ (1.0+5.0)/2, (2.0+3.0)/2]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 29,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([1.5, 4. ], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 29,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.reduce_mean(input_tensor=x, axis=0).numpy() \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Average across the column i.e [(1.0+5.0)/2.0, (2.0+3.0)/2.0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 30,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[3. ],\\n\",\n       \"       [2.5]], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 30,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.reduce_mean(input_tensor=x, axis=1, keepdims=True).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Reduce Sum\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 31,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[1., 5.],\\n\",\n       \"       [2., 3.]], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 31,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"x.numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Sum values across the rows i.e  [(1.0+2.0),(5.0 + 3.0)]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 32,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([3., 8.], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 32,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.reduce_sum(x, 0).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Sum values across the columns i.e  [(1.0+5.0),(2.0 + 3.0)]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 33,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([6., 5.], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 33,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.reduce_sum(x, 1).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Sum all the values i.e 1.0 + 5.0 + 2.0 + 3.0\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 34,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"11.0\"\n      ]\n     },\n     \"execution_count\": 34,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.reduce_sum(x, [0, 1]).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"## Drawing Random values\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Drawing values from the normal distribution:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 35,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[ 9.965095,  8.426764],\\n\",\n       \"       [10.50976 ,  9.429149],\\n\",\n       \"       [11.404266,  7.635334]], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 35,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.random.normal(shape=(3,2), mean=10.0, stddev=2.0).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Drawing values from the uniform distribution:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 36,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[0.1445148 , 0.13028955],\\n\",\n       \"       [0.8927735 , 0.89294124],\\n\",\n       \"       [0.65974724, 0.7600925 ]], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 36,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.random.uniform(shape = (3,2),  minval=0, maxval=None, dtype=tf.float32,).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Create 0's and 1's\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 37,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[0., 0., 0., 0., 0.],\\n\",\n       \"       [0., 0., 0., 0., 0.],\\n\",\n       \"       [0., 0., 0., 0., 0.],\\n\",\n       \"       [0., 0., 0., 0., 0.],\\n\",\n       \"       [0., 0., 0., 0., 0.]], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 37,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.zeros([5,5]).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 38,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[0., 0.],\\n\",\n       \"       [0., 0.]], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 38,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.zeros_like(x).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 39,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[1., 1., 1., 1., 1.],\\n\",\n       \"       [1., 1., 1., 1., 1.],\\n\",\n       \"       [1., 1., 1., 1., 1.],\\n\",\n       \"       [1., 1., 1., 1., 1.],\\n\",\n       \"       [1., 1., 1., 1., 1.]], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 39,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.ones([5,5]).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 40,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[1., 1.],\\n\",\n       \"       [1., 1.]], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 40,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.ones_like(x).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"## Compute Softmax Probabilities\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 41,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([0.8756006 , 0.00589975, 0.11849965], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 41,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"x = tf.constant([7., 2., 5.])\\n\",\n    \"\\n\",\n    \"tf.nn.softmax(x).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Create Indentity matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 42,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[[1. 0. 0. 0. 0. 0. 0.]\\n\",\n      \" [0. 1. 0. 0. 0. 0. 0.]\\n\",\n      \" [0. 0. 1. 0. 0. 0. 0.]\\n\",\n      \" [0. 0. 0. 1. 0. 0. 0.]\\n\",\n      \" [0. 0. 0. 0. 1. 0. 0.]\\n\",\n      \" [0. 0. 0. 0. 0. 1. 0.]\\n\",\n      \" [0. 0. 0. 0. 0. 0. 1.]]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"i_matrix = tf.eye(7)\\n\",\n    \"print i_matrix.numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"## L2 Normalization\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 43,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([0.79259396, 0.22645542, 0.56613857], dtype=float32)\"\n      ]\n     },\n     \"execution_count\": 43,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.math.l2_normalize(x,axis=None, epsilon=1e-12,).numpy()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Gradient Computation\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 44,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"36.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"def square(x):\\n\",\n    \"  return tf.multiply(x, x)\\n\",\n    \"\\n\",\n    \"with tf.GradientTape(persistent=True) as tape:\\n\",\n    \"     print square(6.).numpy()\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"TensorFlow is a lot more than this. We will learn and explore various important functionalities of TensorFlow as we move forward through the book. In the next section, we will learn about a new version of TensorFlow called TensorFlow 2.0.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 2\",\n   \"language\": \"python\",\n   \"name\": \"python2\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 2\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython2\",\n   \"version\": \"2.7.12\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "08. A primer on TensorFlow/8.10 MNIST digits classification in TensorFlow 2.0.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# MNIST digit classification in TensorFlow 2.0\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we will see how can we perform the MNIST handwritten digits classification using\\n\",\n    \"tensorflow 2.0. It hardly a few lines of code compared to the tensorflow 1.x. As we learned,\\n\",\n    \"tensorflow 2.0 uses as keras as its high-level API, we just need to add tf.keras to the keras\\n\",\n    \"code.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Import the libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import tensorflow as tf\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Load the dataset:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mnist =  tf.keras.datasets.mnist\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a train and test set:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz\\n\",\n      \"11493376/11490434 [==============================] - 64s 6us/step\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"(x_train,y_train), (x_test, y_test) = mnist.load_data()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Normalize the x values by diving with maximum value of x which is 255 and convert them to float:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"x_train, x_test = tf.cast(x_train/255.0, tf.float32), tf.cast(x_test/255.0, tf.float32)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"convert y values to int:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"y_train, y_test = tf.cast(y_train,tf.int64),tf.cast(y_test,tf.int64)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the sequential model:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"model = tf.keras.models.Sequential()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Add the layers - We use a three-layered network. We apply ReLU activation at the first two layers and in the final output layer we apply softmax function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"model.add(tf.keras.layers.Flatten())\\n\",\n    \"model.add(tf.keras.layers.Dense(256, activation=\\\"relu\\\"))\\n\",\n    \"model.add(tf.keras.layers.Dense(128, activation=\\\"relu\\\"))\\n\",\n    \"model.add(tf.keras.layers.Dense(10, activation=\\\"softmax\\\"))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Compile the model with Stochastic Gradient Descent, that is 'sgd' (we will learn about this in the next chapter) as optimizer and sparse_categorical_crossentropy as loss function and with accuracy as a metric:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy', metrics=['accuracy'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the model for 10 epochs with batch_size as 32:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Train on 60000 samples\\n\",\n      \"Epoch 1/10\\n\",\n      \"60000/60000 [==============================] - 7s 121us/sample - loss: 0.5882 - accuracy: 0.8490\\n\",\n      \"Epoch 2/10\\n\",\n      \"60000/60000 [==============================] - 5s 80us/sample - loss: 0.2747 - accuracy: 0.9221\\n\",\n      \"Epoch 3/10\\n\",\n      \"60000/60000 [==============================] - 5s 78us/sample - loss: 0.2232 - accuracy: 0.9371\\n\",\n      \"Epoch 4/10\\n\",\n      \"60000/60000 [==============================] - 5s 75us/sample - loss: 0.1900 - accuracy: 0.9464\\n\",\n      \"Epoch 5/10\\n\",\n      \"60000/60000 [==============================] - 4s 72us/sample - loss: 0.1660 - accuracy: 0.9528\\n\",\n      \"Epoch 6/10\\n\",\n      \"60000/60000 [==============================] - 5s 84us/sample - loss: 0.1471 - accuracy: 0.9579\\n\",\n      \"Epoch 7/10\\n\",\n      \"60000/60000 [==============================] - 5s 78us/sample - loss: 0.1323 - accuracy: 0.9624\\n\",\n      \"Epoch 8/10\\n\",\n      \"60000/60000 [==============================] - 4s 75us/sample - loss: 0.1197 - accuracy: 0.9660\\n\",\n      \"Epoch 9/10\\n\",\n      \"60000/60000 [==============================] - 5s 79us/sample - loss: 0.1089 - accuracy: 0.9689\\n\",\n      \"Epoch 10/10\\n\",\n      \"60000/60000 [==============================] - 6s 97us/sample - loss: 0.0999 - accuracy: 0.9719\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<tensorflow.python.keras.callbacks.History at 0x7f4a7c2d5fd0>\"\n      ]\n     },\n     \"execution_count\": 11,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"model.fit(x_train, y_train, batch_size=32, epochs=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Evaluate the model on test sets:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"model.evaluate(x_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"That's it! Writing code with Keras API is that simple.\\n\",\n    \"\\n\",\n    \"From the next chapters, we will use TensorFlow 2.0\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "08. A primer on TensorFlow/README.md",
    "content": "\n\n# [Chapter 8. Getting to Know TensorFlow](#)\n\n* 8.1. What is TensorFlow?\n* 8.2. Understanding Computational Graphs and Sessions\n\t* 8.2.1 Computational Graphs\n\t* 8.2.2 Sessions\n* 8.3. Variables, Constants, and Placeholders\n\t* 8.3.1. Variables\n\t* 8.3.2. Constants\n\t* 8.3.3. Placeholders and Feed Dictionaries\n* 8.4. Introducing TensorBoard\n\t* 8.4.1 Creating Name Scope in TensorBoard\n* 8.5. Handwritten digits classification using Tensorflow \n* 8.6. Visualizing Computational graph in TensorBord\n* 8.7. Introducing Eager execution\n* 8.8. Math operations in TensorFlow\n* 8.9. Tensorflow 8.0 and Keras\n\t* 8.9.1. Bonjour Keras\n\t* 8.9.2. Defining models in Keras\n\t* 8.9.3. Compiling the model\n\t* 8.9.4. Training the model\n\t* 8.9.5. Evaluating the model\n* 8.10. MNIST digits classification in Tensorflow 2.0"
  },
  {
    "path": "09.  Deep Q Network and its Variants/.ipynb_checkpoints/7.03. Playing Atari Games using DQN-Copy1-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"ename\": \"SyntaxError\",\n     \"evalue\": \"invalid syntax (<ipython-input-1-07de69cd5288>, line 3)\",\n     \"output_type\": \"error\",\n     \"traceback\": [\n      \"\\u001b[0;36m  File \\u001b[0;32m\\\"<ipython-input-1-07de69cd5288>\\\"\\u001b[0;36m, line \\u001b[0;32m3\\u001b[0m\\n\\u001b[0;31m    Atari 2600 is a popular video game console from a game company called Atari. The Atari game console provides several popular games such as pong, space invaders, Ms Pacman, break out, centipede and many more. In this section, we will learn how to build the deep Q network for playing the Atari games. First, let's understand the architecture of DQN for playing the Atari games.\\u001b[0m\\n\\u001b[0m             ^\\u001b[0m\\n\\u001b[0;31mSyntaxError\\u001b[0m\\u001b[0;31m:\\u001b[0m invalid syntax\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Playing Atari games using DQN \\n\",\n    \"\\n\",\n    \"Atari 2600 is a popular video game console from a game company called Atari. The Atari game console provides several popular games such as pong, space invaders, Ms Pacman, break out, centipede and many more. In this section, we will learn how to build the deep Q network for playing the Atari games. First, let's understand the architecture of DQN for playing the Atari games. \\n\",\n    \"\\n\",\n    \"## Architecture of DQN\\n\",\n    \"In the Atari environment, the image of the game screen is the state of the environment. So, we just feed the image of the game screen as an input to the DQN and it returns the Q value of all the actions in the state. Since we are dealing with the images, instead of using the vanilla deep neural network for approximating the Q value, we can use the convolutional neural network (CNN) since the convolutional neural network is very effective for handling images.\\n\",\n    \"\\n\",\n    \"Thus, now our DQN is the convolutional neural network. We feed the image of the game screen (game state) as an input to the convolutional neural network and it outputs the Q value of all the actions in the state.\\n\",\n    \"\\n\",\n    \"As shown in the below figure, given the image of the game screen (game state), the convolutional layers extract features from the image and produce a feature map. Next, we flatten the feature map and feed the flattened feature map as an input to the feedforward network. The feedforward network takes this flattened feature map as an input and returns the Q value of all the actions in the state: \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/4.png)\\n\",\n    \"\\n\",\n    \"Note that here we don't perform pooling operation. A pooling operation is useful when we perform tasks such as object detection, image classification and so on where we don't consider the position of the object in the image and we just want to know whether the desired object is present in the image. For example, if we want to identify whether there is a dog in an image, we only look for whether a dog is present in the image and we don't check the position of the dog in the image. Thus, in this case, pooling operation is used to identify whether there is a dog in the image irrespective of the position of the dog.\\n\",\n    \"\\n\",\n    \"But in our setting, pooling operation should not be performed. Because to understand the current game screen (state), the position is very important. For example, in a Pong game, we just don't want to classify if there is a ball in the game screen. We want to know the position of the ball so that we can take better action. Thus, we don't include the pooling operation in our DQN architecture. \\n\",\n    \"\\n\",\n    \"Now that we have understood the architecture of DQN to play the Atari games, in the next section, we will start implementing them. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Getting hands-on with DQN\\n\",\n    \"\\n\",\n    \"Let's implement DQN to play the Ms Pacman game. \\n\",\n    \"\\n\",\n    \"First, let's import the necessary\\n\",\n    \"libraries:\\n\",\n    \"\\n\",\n    \"import random\\n\",\n    \"import gym\\n\",\n    \"import numpy as np\\n\",\n    \"from collections import deque\\n\",\n    \"from tensorflow.keras.models import Sequential\\n\",\n    \"from tensorflow.keras.layers import Dense, Activation, Flatten, Conv2D, MaxPooling2D\\n\",\n    \"from tensorflow.keras.optimizers import Adam\\n\",\n    \"\\n\",\n    \"Now, let's create the Ms Pacman game environment using gym:\\n\",\n    \"\\n\",\n    \"env = gym.make(\\\"MsPacman-v0\\\")\\n\",\n    \"\\n\",\n    \"Set the state size:\\n\",\n    \"\\n\",\n    \"state_size = (88, 80, 1)\\n\",\n    \"\\n\",\n    \"Get the number of actions:\\n\",\n    \"\\n\",\n    \"action_size = env.action_space.n\\n\",\n    \"\\n\",\n    \"## Preprocess the game screen\\n\",\n    \"\\n\",\n    \"We learned that we feed the game state (image of the game screen) as an input to the DQN which is the convolutional neural network and it outputs the Q value of all the actions in the state. However, directly feeding the raw game screen image is not efficient, since the raw game screen size will be 210 x 160 x 3. This will be computationally expensive.\\n\",\n    \"\\n\",\n    \"To avoid this, we preprocess the game screen and then feed the preprocessed game screen to the DQN. First, we crop and resize the game screen image, convert the image to grayscale, normalize and then reshape the image to 88 x 80 x 1. Next, we feed this pre-processed game screen image as an input to the convolutional network (DQN) which returns the Q value.\\n\",\n    \"\\n\",\n    \"Now, let's define a function called preprocess_state which takes the game state (image of the game screen) as an input and returns the preprocessed game state (image of the game screen):\\n\",\n    \"\\n\",\n    \"color = np.array([210, 164, 74]).mean()\\n\",\n    \"\\n\",\n    \"def preprocess_state(state):\\n\",\n    \"\\n\",\n    \"    #crop and resize the image\\n\",\n    \"    image = state[1:176:2, ::2]\\n\",\n    \"\\n\",\n    \"    #convert the image to greyscale\\n\",\n    \"    image = image.mean(axis=2)\\n\",\n    \"\\n\",\n    \"    #improve image contrast\\n\",\n    \"    image[image==color] = 0\\n\",\n    \"\\n\",\n    \"    #normalize the image\\n\",\n    \"    image = (image - 128) / 128 - 1\\n\",\n    \"    \\n\",\n    \"    #reshape the image\\n\",\n    \"    image = np.expand_dims(image.reshape(88, 80, 1), axis=0)\\n\",\n    \"\\n\",\n    \"    return image\\n\",\n    \"\\n\",\n    \"## Building the DQN \\n\",\n    \"\\n\",\n    \"Now, let's build the deep Q network. We learned that for playing atari games we use the\\n\",\n    \"convolutional neural network as the DQN which takes the image of the game screen as an\\n\",\n    \"input and returns the Q values.\\n\",\n    \"\\n\",\n    \"We define the DQN with three convolutional layers. The convolutional layers extract the\\n\",\n    \"features from the image and output the feature maps and then we flattened the feature map\\n\",\n    \"obtained by the convolutional layers and feed the flattened feature maps to the feedforward\\n\",\n    \"network (fully connected layer) which returns the Q value:\\n\",\n    \"\\n\",\n    \"class DQN:\\n\",\n    \"    def __init__(self, state_size, action_size):\\n\",\n    \"        \\n\",\n    \"        #define the state size\\n\",\n    \"        self.state_size = state_size\\n\",\n    \"        \\n\",\n    \"        #define the action size\\n\",\n    \"        self.action_size = action_size\\n\",\n    \"        \\n\",\n    \"        #define the replay buffer\\n\",\n    \"        self.replay_buffer = deque(maxlen=5000)\\n\",\n    \"        \\n\",\n    \"        #define the discount factor\\n\",\n    \"        self.gamma = 0.9  \\n\",\n    \"        \\n\",\n    \"        #define the epsilon value\\n\",\n    \"        self.epsilon = 0.8   \\n\",\n    \"        \\n\",\n    \"        #define the update rate at which we want to update the target network\\n\",\n    \"        self.update_rate = 1000    \\n\",\n    \"        \\n\",\n    \"        #define the main network\\n\",\n    \"        self.main_network = self.build_network()\\n\",\n    \"        \\n\",\n    \"        #define the target network\\n\",\n    \"        self.target_network = self.build_network()\\n\",\n    \"        \\n\",\n    \"        #copy the weights of the main network to the target network\\n\",\n    \"        self.target_network.set_weights(self.main_network.get_weights())\\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"    #Let's define a function called build_network which is essentially our DQN. \\n\",\n    \"\\n\",\n    \"    def build_network(self):\\n\",\n    \"        model = Sequential()\\n\",\n    \"        model.add(Conv2D(32, (8, 8), strides=4, padding='same', input_shape=self.state_size))\\n\",\n    \"        model.add(Activation('relu'))\\n\",\n    \"        \\n\",\n    \"        model.add(Conv2D(64, (4, 4), strides=2, padding='same'))\\n\",\n    \"        model.add(Activation('relu'))\\n\",\n    \"        \\n\",\n    \"        model.add(Conv2D(64, (3, 3), strides=1, padding='same'))\\n\",\n    \"        model.add(Activation('relu'))\\n\",\n    \"        model.add(Flatten())\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"        model.add(Dense(512, activation='relu'))\\n\",\n    \"        model.add(Dense(self.action_size, activation='linear'))\\n\",\n    \"        \\n\",\n    \"        model.compile(loss='mse', optimizer=Adam())\\n\",\n    \"\\n\",\n    \"        return model\\n\",\n    \"\\n\",\n    \"    #We learned that we train DQN by randomly sampling a minibatch of transitions from the\\n\",\n    \"    #replay buffer. So, we define a function called store_transition which stores the transition information\\n\",\n    \"    #into the replay buffer\\n\",\n    \"\\n\",\n    \"    def store_transistion(self, state, action, reward, next_state, done):\\n\",\n    \"        self.replay_buffer.append((state, action, reward, next_state, done))\\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"    #We learned that in DQN, to take care of exploration-exploitation trade off, we select action\\n\",\n    \"    #using the epsilon-greedy policy. So, now we define the function called epsilon_greedy\\n\",\n    \"    #for selecting action using the epsilon-greedy policy.\\n\",\n    \"    \\n\",\n    \"    def epsilon_greedy(self, state):\\n\",\n    \"        if random.uniform(0,1) < self.epsilon:\\n\",\n    \"            return np.random.randint(self.action_size)\\n\",\n    \"        \\n\",\n    \"        Q_values = self.main_network.predict(state)\\n\",\n    \"        \\n\",\n    \"        return np.argmax(Q_values[0])\\n\",\n    \"\\n\",\n    \"    \\n\",\n    \"    #train the network\\n\",\n    \"    def train(self, batch_size):\\n\",\n    \"        \\n\",\n    \"        #sample a mini batch of transition from the replay buffer\\n\",\n    \"        minibatch = random.sample(self.replay_buffer, batch_size)\\n\",\n    \"        \\n\",\n    \"        #compute the Q value using the target network\\n\",\n    \"        for state, action, reward, next_state, done in minibatch:\\n\",\n    \"            if not done:\\n\",\n    \"                target_Q = (reward + self.gamma * np.amax(self.target_network.predict(next_state)))\\n\",\n    \"            else:\\n\",\n    \"                target_Q = reward\\n\",\n    \"                \\n\",\n    \"            #compute the Q value using the main network \\n\",\n    \"            Q_values = self.main_network.predict(state)\\n\",\n    \"            \\n\",\n    \"            Q_values[0][action] = target_Q\\n\",\n    \"            \\n\",\n    \"            #train the main network\\n\",\n    \"            self.main_network.fit(state, Q_values, epochs=1, verbose=0)\\n\",\n    \"            \\n\",\n    \"    #update the target network weights by copying from the main network\\n\",\n    \"    def update_target_network(self):\\n\",\n    \"        self.target_network.set_weights(self.main_network.get_weights())\\n\",\n    \"\\n\",\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's train the network. First, let's set the number of episodes we want to train the\\n\",\n    \"network:\\n\",\n    \"\\n\",\n    \"num_episodes = 500\\n\",\n    \"\\n\",\n    \"Define the number of time steps\\n\",\n    \"\\n\",\n    \"num_timesteps = 20000\\n\",\n    \"\\n\",\n    \"Define the batch size:\\n\",\n    \"\\n\",\n    \"batch_size = 8\\n\",\n    \"\\n\",\n    \"Set the number of past game screens we want to consider:\\n\",\n    \"\\n\",\n    \"num_screens = 4     \\n\",\n    \"\\n\",\n    \"Instantiate the DQN class\\n\",\n    \"\\n\",\n    \"dqn = DQN(state_size, action_size)\\n\",\n    \"\\n\",\n    \"done = False\\n\",\n    \"time_step = 0\\n\",\n    \"\\n\",\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #set return to 0\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #preprocess the game screen\\n\",\n    \"    state = preprocess_state(env.reset())\\n\",\n    \"\\n\",\n    \"    #for each step in the episode\\n\",\n    \"    for t in range(num_timesteps):\\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"        \\n\",\n    \"        #update the time step\\n\",\n    \"        time_step += 1\\n\",\n    \"        \\n\",\n    \"        #update the target network\\n\",\n    \"        if time_step % dqn.update_rate == 0:\\n\",\n    \"            dqn.update_target_network()\\n\",\n    \"        \\n\",\n    \"        #select the action\\n\",\n    \"        action = dqn.epsilon_greedy(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, _ = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #preprocess the next state\\n\",\n    \"        next_state = preprocess_state(next_state)\\n\",\n    \"        \\n\",\n    \"        #store the transition information\\n\",\n    \"        dqn.store_transistion(state, action, reward, next_state, done)\\n\",\n    \"        \\n\",\n    \"        #update current state to next state\\n\",\n    \"        state = next_state\\n\",\n    \"        \\n\",\n    \"        #update the return\\n\",\n    \"        Return += reward\\n\",\n    \"        \\n\",\n    \"        #if the episode is done then print the return\\n\",\n    \"        if done:\\n\",\n    \"            print('Episode: ',i, ',' 'Return', Return)\\n\",\n    \"            break\\n\",\n    \"            \\n\",\n    \"        #if the number of transistions in the replay buffer is greater than batch size\\n\",\n    \"        #then train the network\\n\",\n    \"        if len(dqn.replay_buffer) > batch_size:\\n\",\n    \"            dqn.train(batch_size)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Now that we have learned how deep Q network works and how to build the DQN to play the Atari games, in the next section, we will learn an interesting variant of DQN called\\n\",\n    \"double DQN. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "09.  Deep Q Network and its Variants/.ipynb_checkpoints/7.03. Playing Atari Games using DQN-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using DQN \\n\",\n    \"\\n\",\n    \"Atari 2600 is a popular video game console from a game company called Atari. The Atari game console provides several popular games such as pong, space invaders, Ms Pacman, break out, centipede and many more. In this section, we will learn how to build the deep Q network for playing the Atari games. First, let's understand the architecture of DQN for playing the Atari games. \\n\",\n    \"\\n\",\n    \"## Architecture of DQN\\n\",\n    \"In the Atari environment, the image of the game screen is the state of the environment. So, we just feed the image of the game screen as an input to the DQN and it returns the Q value of all the actions in the state. Since we are dealing with the images, instead of using the vanilla deep neural network for approximating the Q value, we can use the convolutional neural network (CNN) since the convolutional neural network is very effective for handling images.\\n\",\n    \"\\n\",\n    \"Thus, now our DQN is the convolutional neural network. We feed the image of the game screen (game state) as an input to the convolutional neural network and it outputs the Q value of all the actions in the state.\\n\",\n    \"\\n\",\n    \"As shown in the below figure, given the image of the game screen (game state), the convolutional layers extract features from the image and produce a feature map. Next, we flatten the feature map and feed the flattened feature map as an input to the feedforward network. The feedforward network takes this flattened feature map as an input and returns the Q value of all the actions in the state: \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/4.png)\\n\",\n    \"\\n\",\n    \"Note that here we don't perform pooling operation. A pooling operation is useful when we perform tasks such as object detection, image classification and so on where we don't consider the position of the object in the image and we just want to know whether the desired object is present in the image. For example, if we want to identify whether there is a dog in an image, we only look for whether a dog is present in the image and we don't check the position of the dog in the image. Thus, in this case, pooling operation is used to identify whether there is a dog in the image irrespective of the position of the dog.\\n\",\n    \"\\n\",\n    \"But in our setting, pooling operation should not be performed. Because to understand the current game screen (state), the position is very important. For example, in a Pong game, we just don't want to classify if there is a ball in the game screen. We want to know the position of the ball so that we can take better action. Thus, we don't include the pooling operation in our DQN architecture. \\n\",\n    \"\\n\",\n    \"Now that we have understood the architecture of DQN to play the Atari games, in the next section, we will start implementing them. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Getting hands-on with DQN\\n\",\n    \"\\n\",\n    \"Let's implement DQN to play the Ms Pacman game. \\n\",\n    \"\\n\",\n    \"First, let's import the necessary\\n\",\n    \"libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import random\\n\",\n    \"import gym\\n\",\n    \"import numpy as np\\n\",\n    \"from collections import deque\\n\",\n    \"from tensorflow.keras.models import Sequential\\n\",\n    \"from tensorflow.keras.layers import Dense, Activation, Flatten, Conv2D, MaxPooling2D\\n\",\n    \"from tensorflow.keras.optimizers import Adam\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create the Ms Pacman game environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"MsPacman-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the state size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_size = (88, 80, 1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the number of actions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_size = env.action_space.n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Preprocess the game screen\\n\",\n    \"\\n\",\n    \"We learned that we feed the game state (image of the game screen) as an input to the DQN which is the convolutional neural network and it outputs the Q value of all the actions in the state. However, directly feeding the raw game screen image is not efficient, since the raw game screen size will be 210 x 160 x 3. This will be computationally expensive.\\n\",\n    \"\\n\",\n    \"To avoid this, we preprocess the game screen and then feed the preprocessed game screen to the DQN. First, we crop and resize the game screen image, convert the image to grayscale, normalize and then reshape the image to 88 x 80 x 1. Next, we feed this pre-processed game screen image as an input to the convolutional network (DQN) which returns the Q value.\\n\",\n    \"\\n\",\n    \"Now, let's define a function called preprocess_state which takes the game state (image of the game screen) as an input and returns the preprocessed game state (image of the game screen):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"color = np.array([210, 164, 74]).mean()\\n\",\n    \"\\n\",\n    \"def preprocess_state(state):\\n\",\n    \"\\n\",\n    \"    #crop and resize the image\\n\",\n    \"    image = state[1:176:2, ::2]\\n\",\n    \"\\n\",\n    \"    #convert the image to greyscale\\n\",\n    \"    image = image.mean(axis=2)\\n\",\n    \"\\n\",\n    \"    #improve image contrast\\n\",\n    \"    image[image==color] = 0\\n\",\n    \"\\n\",\n    \"    #normalize the image\\n\",\n    \"    image = (image - 128) / 128 - 1\\n\",\n    \"    \\n\",\n    \"    #reshape the image\\n\",\n    \"    image = np.expand_dims(image.reshape(88, 80, 1), axis=0)\\n\",\n    \"\\n\",\n    \"    return image\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Building the DQN \\n\",\n    \"\\n\",\n    \"Now, let's build the deep Q network. We learned that for playing atari games we use the\\n\",\n    \"convolutional neural network as the DQN which takes the image of the game screen as an\\n\",\n    \"input and returns the Q values.\\n\",\n    \"\\n\",\n    \"We define the DQN with three convolutional layers. The convolutional layers extract the\\n\",\n    \"features from the image and output the feature maps and then we flattened the feature map\\n\",\n    \"obtained by the convolutional layers and feed the flattened feature maps to the feedforward\\n\",\n    \"network (fully connected layer) which returns the Q value:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class DQN:\\n\",\n    \"    def __init__(self, state_size, action_size):\\n\",\n    \"        \\n\",\n    \"        #define the state size\\n\",\n    \"        self.state_size = state_size\\n\",\n    \"        \\n\",\n    \"        #define the action size\\n\",\n    \"        self.action_size = action_size\\n\",\n    \"        \\n\",\n    \"        #define the replay buffer\\n\",\n    \"        self.replay_buffer = deque(maxlen=5000)\\n\",\n    \"        \\n\",\n    \"        #define the discount factor\\n\",\n    \"        self.gamma = 0.9  \\n\",\n    \"        \\n\",\n    \"        #define the epsilon value\\n\",\n    \"        self.epsilon = 0.8   \\n\",\n    \"        \\n\",\n    \"        #define the update rate at which we want to update the target network\\n\",\n    \"        self.update_rate = 1000    \\n\",\n    \"        \\n\",\n    \"        #define the main network\\n\",\n    \"        self.main_network = self.build_network()\\n\",\n    \"        \\n\",\n    \"        #define the target network\\n\",\n    \"        self.target_network = self.build_network()\\n\",\n    \"        \\n\",\n    \"        #copy the weights of the main network to the target network\\n\",\n    \"        self.target_network.set_weights(self.main_network.get_weights())\\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"    #Let's define a function called build_network which is essentially our DQN. \\n\",\n    \"\\n\",\n    \"    def build_network(self):\\n\",\n    \"        model = Sequential()\\n\",\n    \"        model.add(Conv2D(32, (8, 8), strides=4, padding='same', input_shape=self.state_size))\\n\",\n    \"        model.add(Activation('relu'))\\n\",\n    \"        \\n\",\n    \"        model.add(Conv2D(64, (4, 4), strides=2, padding='same'))\\n\",\n    \"        model.add(Activation('relu'))\\n\",\n    \"        \\n\",\n    \"        model.add(Conv2D(64, (3, 3), strides=1, padding='same'))\\n\",\n    \"        model.add(Activation('relu'))\\n\",\n    \"        model.add(Flatten())\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"        model.add(Dense(512, activation='relu'))\\n\",\n    \"        model.add(Dense(self.action_size, activation='linear'))\\n\",\n    \"        \\n\",\n    \"        model.compile(loss='mse', optimizer=Adam())\\n\",\n    \"\\n\",\n    \"        return model\\n\",\n    \"\\n\",\n    \"    #We learned that we train DQN by randomly sampling a minibatch of transitions from the\\n\",\n    \"    #replay buffer. So, we define a function called store_transition which stores the transition information\\n\",\n    \"    #into the replay buffer\\n\",\n    \"\\n\",\n    \"    def store_transistion(self, state, action, reward, next_state, done):\\n\",\n    \"        self.replay_buffer.append((state, action, reward, next_state, done))\\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"    #We learned that in DQN, to take care of exploration-exploitation trade off, we select action\\n\",\n    \"    #using the epsilon-greedy policy. So, now we define the function called epsilon_greedy\\n\",\n    \"    #for selecting action using the epsilon-greedy policy.\\n\",\n    \"    \\n\",\n    \"    def epsilon_greedy(self, state):\\n\",\n    \"        if random.uniform(0,1) < self.epsilon:\\n\",\n    \"            return np.random.randint(self.action_size)\\n\",\n    \"        \\n\",\n    \"        Q_values = self.main_network.predict(state)\\n\",\n    \"        \\n\",\n    \"        return np.argmax(Q_values[0])\\n\",\n    \"\\n\",\n    \"    \\n\",\n    \"    #train the network\\n\",\n    \"    def train(self, batch_size):\\n\",\n    \"        \\n\",\n    \"        #sample a mini batch of transition from the replay buffer\\n\",\n    \"        minibatch = random.sample(self.replay_buffer, batch_size)\\n\",\n    \"        \\n\",\n    \"        #compute the Q value using the target network\\n\",\n    \"        for state, action, reward, next_state, done in minibatch:\\n\",\n    \"            if not done:\\n\",\n    \"                target_Q = (reward + self.gamma * np.amax(self.target_network.predict(next_state)))\\n\",\n    \"            else:\\n\",\n    \"                target_Q = reward\\n\",\n    \"                \\n\",\n    \"            #compute the Q value using the main network \\n\",\n    \"            Q_values = self.main_network.predict(state)\\n\",\n    \"            \\n\",\n    \"            Q_values[0][action] = target_Q\\n\",\n    \"            \\n\",\n    \"            #train the main network\\n\",\n    \"            self.main_network.fit(state, Q_values, epochs=1, verbose=0)\\n\",\n    \"            \\n\",\n    \"    #update the target network weights by copying from the main network\\n\",\n    \"    def update_target_network(self):\\n\",\n    \"        self.target_network.set_weights(self.main_network.get_weights())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's train the network. First, let's set the number of episodes we want to train the\\n\",\n    \"network:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 500\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 20000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 8\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the number of past game screens we want to consider:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_screens = 4     \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the DQN class\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"dqn = DQN(state_size, action_size)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"done = False\\n\",\n    \"time_step = 0\\n\",\n    \"\\n\",\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #set return to 0\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #preprocess the game screen\\n\",\n    \"    state = preprocess_state(env.reset())\\n\",\n    \"\\n\",\n    \"    #for each step in the episode\\n\",\n    \"    for t in range(num_timesteps):\\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"        \\n\",\n    \"        #update the time step\\n\",\n    \"        time_step += 1\\n\",\n    \"        \\n\",\n    \"        #update the target network\\n\",\n    \"        if time_step % dqn.update_rate == 0:\\n\",\n    \"            dqn.update_target_network()\\n\",\n    \"        \\n\",\n    \"        #select the action\\n\",\n    \"        action = dqn.epsilon_greedy(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, _ = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #preprocess the next state\\n\",\n    \"        next_state = preprocess_state(next_state)\\n\",\n    \"        \\n\",\n    \"        #store the transition information\\n\",\n    \"        dqn.store_transistion(state, action, reward, next_state, done)\\n\",\n    \"        \\n\",\n    \"        #update current state to next state\\n\",\n    \"        state = next_state\\n\",\n    \"        \\n\",\n    \"        #update the return\\n\",\n    \"        Return += reward\\n\",\n    \"        \\n\",\n    \"        #if the episode is done then print the return\\n\",\n    \"        if done:\\n\",\n    \"            print('Episode: ',i, ',' 'Return', Return)\\n\",\n    \"            break\\n\",\n    \"            \\n\",\n    \"        #if the number of transistions in the replay buffer is greater than batch size\\n\",\n    \"        #then train the network\\n\",\n    \"        if len(dqn.replay_buffer) > batch_size:\\n\",\n    \"            dqn.train(batch_size)\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we have learned how deep Q network works and how to build the DQN to play the Atari games, in the next section, we will learn an interesting variant of DQN called\\n\",\n    \"double DQN. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "09.  Deep Q Network and its Variants/.ipynb_checkpoints/9.03. Playing Atari Games using DQN-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using DQN \\n\",\n    \"\\n\",\n    \"Atari 2600 is a popular video game console from a game company called Atari. The Atari game console provides several popular games such as pong, space invaders, Ms Pacman, break out, centipede and many more. In this section, we will learn how to build the deep Q network for playing the Atari games. First, let's understand the architecture of DQN for playing the Atari games. \\n\",\n    \"\\n\",\n    \"## Architecture of DQN\\n\",\n    \"In the Atari environment, the image of the game screen is the state of the environment. So, we just feed the image of the game screen as an input to the DQN and it returns the Q value of all the actions in the state. Since we are dealing with the images, instead of using the vanilla deep neural network for approximating the Q value, we can use the convolutional neural network (CNN) since the convolutional neural network is very effective for handling images.\\n\",\n    \"\\n\",\n    \"Thus, now our DQN is the convolutional neural network. We feed the image of the game screen (game state) as an input to the convolutional neural network and it outputs the Q value of all the actions in the state.\\n\",\n    \"\\n\",\n    \"As shown in the below figure, given the image of the game screen (game state), the convolutional layers extract features from the image and produce a feature map. Next, we flatten the feature map and feed the flattened feature map as an input to the feedforward network. The feedforward network takes this flattened feature map as an input and returns the Q value of all the actions in the state: \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/4.png)\\n\",\n    \"\\n\",\n    \"Note that here we don't perform pooling operation. A pooling operation is useful when we perform tasks such as object detection, image classification and so on where we don't consider the position of the object in the image and we just want to know whether the desired object is present in the image. For example, if we want to identify whether there is a dog in an image, we only look for whether a dog is present in the image and we don't check the position of the dog in the image. Thus, in this case, pooling operation is used to identify whether there is a dog in the image irrespective of the position of the dog.\\n\",\n    \"\\n\",\n    \"But in our setting, pooling operation should not be performed. Because to understand the current game screen (state), the position is very important. For example, in a Pong game, we just don't want to classify if there is a ball in the game screen. We want to know the position of the ball so that we can take better action. Thus, we don't include the pooling operation in our DQN architecture. \\n\",\n    \"\\n\",\n    \"Now that we have understood the architecture of DQN to play the Atari games, in the next section, we will start implementing them. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Getting hands-on with DQN\\n\",\n    \"\\n\",\n    \"Let's implement DQN to play the Ms Pacman game. \\n\",\n    \"\\n\",\n    \"First, let's import the necessary\\n\",\n    \"libraries:\\n\",\n    \"\\n\",\n    \"Note that we use TensorFlow version 2.0\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import random\\n\",\n    \"import gym\\n\",\n    \"import numpy as np\\n\",\n    \"from collections import deque\\n\",\n    \"from tensorflow.keras.models import Sequential\\n\",\n    \"from tensorflow.keras.layers import Dense, Activation, Flatten, Conv2D, MaxPooling2D\\n\",\n    \"from tensorflow.keras.optimizers import Adam\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create the Ms Pacman game environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"MsPacman-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the state size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_size = (88, 80, 1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the number of actions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_size = env.action_space.n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Preprocess the game screen\\n\",\n    \"\\n\",\n    \"We learned that we feed the game state (image of the game screen) as an input to the DQN which is the convolutional neural network and it outputs the Q value of all the actions in the state. However, directly feeding the raw game screen image is not efficient, since the raw game screen size will be 210 x 160 x 3. This will be computationally expensive.\\n\",\n    \"\\n\",\n    \"To avoid this, we preprocess the game screen and then feed the preprocessed game screen to the DQN. First, we crop and resize the game screen image, convert the image to grayscale, normalize and then reshape the image to 88 x 80 x 1. Next, we feed this pre-processed game screen image as an input to the convolutional network (DQN) which returns the Q value.\\n\",\n    \"\\n\",\n    \"Now, let's define a function called preprocess_state which takes the game state (image of the game screen) as an input and returns the preprocessed game state (image of the game screen):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"color = np.array([210, 164, 74]).mean()\\n\",\n    \"\\n\",\n    \"def preprocess_state(state):\\n\",\n    \"\\n\",\n    \"    #crop and resize the image\\n\",\n    \"    image = state[1:176:2, ::2]\\n\",\n    \"\\n\",\n    \"    #convert the image to greyscale\\n\",\n    \"    image = image.mean(axis=2)\\n\",\n    \"\\n\",\n    \"    #improve image contrast\\n\",\n    \"    image[image==color] = 0\\n\",\n    \"\\n\",\n    \"    #normalize the image\\n\",\n    \"    image = (image - 128) / 128 - 1\\n\",\n    \"    \\n\",\n    \"    #reshape the image\\n\",\n    \"    image = np.expand_dims(image.reshape(88, 80, 1), axis=0)\\n\",\n    \"\\n\",\n    \"    return image\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Building the DQN \\n\",\n    \"\\n\",\n    \"Now, let's build the deep Q network. We learned that for playing atari games we use the\\n\",\n    \"convolutional neural network as the DQN which takes the image of the game screen as an\\n\",\n    \"input and returns the Q values.\\n\",\n    \"\\n\",\n    \"We define the DQN with three convolutional layers. The convolutional layers extract the\\n\",\n    \"features from the image and output the feature maps and then we flattened the feature map\\n\",\n    \"obtained by the convolutional layers and feed the flattened feature maps to the feedforward\\n\",\n    \"network (fully connected layer) which returns the Q value:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class DQN:\\n\",\n    \"    def __init__(self, state_size, action_size):\\n\",\n    \"        \\n\",\n    \"        #define the state size\\n\",\n    \"        self.state_size = state_size\\n\",\n    \"        \\n\",\n    \"        #define the action size\\n\",\n    \"        self.action_size = action_size\\n\",\n    \"        \\n\",\n    \"        #define the replay buffer\\n\",\n    \"        self.replay_buffer = deque(maxlen=5000)\\n\",\n    \"        \\n\",\n    \"        #define the discount factor\\n\",\n    \"        self.gamma = 0.9  \\n\",\n    \"        \\n\",\n    \"        #define the epsilon value\\n\",\n    \"        self.epsilon = 0.8   \\n\",\n    \"        \\n\",\n    \"        #define the update rate at which we want to update the target network\\n\",\n    \"        self.update_rate = 1000    \\n\",\n    \"        \\n\",\n    \"        #define the main network\\n\",\n    \"        self.main_network = self.build_network()\\n\",\n    \"        \\n\",\n    \"        #define the target network\\n\",\n    \"        self.target_network = self.build_network()\\n\",\n    \"        \\n\",\n    \"        #copy the weights of the main network to the target network\\n\",\n    \"        self.target_network.set_weights(self.main_network.get_weights())\\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"    #Let's define a function called build_network which is essentially our DQN. \\n\",\n    \"\\n\",\n    \"    def build_network(self):\\n\",\n    \"        model = Sequential()\\n\",\n    \"        model.add(Conv2D(32, (8, 8), strides=4, padding='same', input_shape=self.state_size))\\n\",\n    \"        model.add(Activation('relu'))\\n\",\n    \"        \\n\",\n    \"        model.add(Conv2D(64, (4, 4), strides=2, padding='same'))\\n\",\n    \"        model.add(Activation('relu'))\\n\",\n    \"        \\n\",\n    \"        model.add(Conv2D(64, (3, 3), strides=1, padding='same'))\\n\",\n    \"        model.add(Activation('relu'))\\n\",\n    \"        model.add(Flatten())\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"        model.add(Dense(512, activation='relu'))\\n\",\n    \"        model.add(Dense(self.action_size, activation='linear'))\\n\",\n    \"        \\n\",\n    \"        model.compile(loss='mse', optimizer=Adam())\\n\",\n    \"\\n\",\n    \"        return model\\n\",\n    \"\\n\",\n    \"    #We learned that we train DQN by randomly sampling a minibatch of transitions from the\\n\",\n    \"    #replay buffer. So, we define a function called store_transition which stores the transition information\\n\",\n    \"    #into the replay buffer\\n\",\n    \"\\n\",\n    \"    def store_transistion(self, state, action, reward, next_state, done):\\n\",\n    \"        self.replay_buffer.append((state, action, reward, next_state, done))\\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"    #We learned that in DQN, to take care of exploration-exploitation trade off, we select action\\n\",\n    \"    #using the epsilon-greedy policy. So, now we define the function called epsilon_greedy\\n\",\n    \"    #for selecting action using the epsilon-greedy policy.\\n\",\n    \"    \\n\",\n    \"    def epsilon_greedy(self, state):\\n\",\n    \"        if random.uniform(0,1) < self.epsilon:\\n\",\n    \"            return np.random.randint(self.action_size)\\n\",\n    \"        \\n\",\n    \"        Q_values = self.main_network.predict(state)\\n\",\n    \"        \\n\",\n    \"        return np.argmax(Q_values[0])\\n\",\n    \"\\n\",\n    \"    \\n\",\n    \"    #train the network\\n\",\n    \"    def train(self, batch_size):\\n\",\n    \"        \\n\",\n    \"        #sample a mini batch of transition from the replay buffer\\n\",\n    \"        minibatch = random.sample(self.replay_buffer, batch_size)\\n\",\n    \"        \\n\",\n    \"        #compute the Q value using the target network\\n\",\n    \"        for state, action, reward, next_state, done in minibatch:\\n\",\n    \"            if not done:\\n\",\n    \"                target_Q = (reward + self.gamma * np.amax(self.target_network.predict(next_state)))\\n\",\n    \"            else:\\n\",\n    \"                target_Q = reward\\n\",\n    \"                \\n\",\n    \"            #compute the Q value using the main network \\n\",\n    \"            Q_values = self.main_network.predict(state)\\n\",\n    \"            \\n\",\n    \"            Q_values[0][action] = target_Q\\n\",\n    \"            \\n\",\n    \"            #train the main network\\n\",\n    \"            self.main_network.fit(state, Q_values, epochs=1, verbose=0)\\n\",\n    \"            \\n\",\n    \"    #update the target network weights by copying from the main network\\n\",\n    \"    def update_target_network(self):\\n\",\n    \"        self.target_network.set_weights(self.main_network.get_weights())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's train the network. First, let's set the number of episodes we want to train the\\n\",\n    \"network:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 500\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 20000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 8\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the number of past game screens we want to consider:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_screens = 4     \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the DQN class\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"dqn = DQN(state_size, action_size)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"done = False\\n\",\n    \"time_step = 0\\n\",\n    \"\\n\",\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #set return to 0\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #preprocess the game screen\\n\",\n    \"    state = preprocess_state(env.reset())\\n\",\n    \"\\n\",\n    \"    #for each step in the episode\\n\",\n    \"    for t in range(num_timesteps):\\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"        \\n\",\n    \"        #update the time step\\n\",\n    \"        time_step += 1\\n\",\n    \"        \\n\",\n    \"        #update the target network\\n\",\n    \"        if time_step % dqn.update_rate == 0:\\n\",\n    \"            dqn.update_target_network()\\n\",\n    \"        \\n\",\n    \"        #select the action\\n\",\n    \"        action = dqn.epsilon_greedy(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, _ = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #preprocess the next state\\n\",\n    \"        next_state = preprocess_state(next_state)\\n\",\n    \"        \\n\",\n    \"        #store the transition information\\n\",\n    \"        dqn.store_transistion(state, action, reward, next_state, done)\\n\",\n    \"        \\n\",\n    \"        #update current state to next state\\n\",\n    \"        state = next_state\\n\",\n    \"        \\n\",\n    \"        #update the return\\n\",\n    \"        Return += reward\\n\",\n    \"        \\n\",\n    \"        #if the episode is done then print the return\\n\",\n    \"        if done:\\n\",\n    \"            print('Episode: ',i, ',' 'Return', Return)\\n\",\n    \"            break\\n\",\n    \"            \\n\",\n    \"        #if the number of transistions in the replay buffer is greater than batch size\\n\",\n    \"        #then train the network\\n\",\n    \"        if len(dqn.replay_buffer) > batch_size:\\n\",\n    \"            dqn.train(batch_size)\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we have learned how deep Q network works and how to build the DQN to play the Atari games, in the next section, we will learn an interesting variant of DQN called\\n\",\n    \"double DQN. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "09.  Deep Q Network and its Variants/9.03. Playing Atari Games using DQN.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using DQN \\n\",\n    \"\\n\",\n    \"Atari 2600 is a popular video game console from a game company called Atari. The Atari game console provides several popular games such as pong, space invaders, Ms Pacman, break out, centipede and many more. In this section, we will learn how to build the deep Q network for playing the Atari games. First, let's understand the architecture of DQN for playing the Atari games. \\n\",\n    \"\\n\",\n    \"## Architecture of DQN\\n\",\n    \"In the Atari environment, the image of the game screen is the state of the environment. So, we just feed the image of the game screen as an input to the DQN and it returns the Q value of all the actions in the state. Since we are dealing with the images, instead of using the vanilla deep neural network for approximating the Q value, we can use the convolutional neural network (CNN) since the convolutional neural network is very effective for handling images.\\n\",\n    \"\\n\",\n    \"Thus, now our DQN is the convolutional neural network. We feed the image of the game screen (game state) as an input to the convolutional neural network and it outputs the Q value of all the actions in the state.\\n\",\n    \"\\n\",\n    \"As shown in the below figure, given the image of the game screen (game state), the convolutional layers extract features from the image and produce a feature map. Next, we flatten the feature map and feed the flattened feature map as an input to the feedforward network. The feedforward network takes this flattened feature map as an input and returns the Q value of all the actions in the state: \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"![title](Images/4.png)\\n\",\n    \"\\n\",\n    \"Note that here we don't perform pooling operation. A pooling operation is useful when we perform tasks such as object detection, image classification and so on where we don't consider the position of the object in the image and we just want to know whether the desired object is present in the image. For example, if we want to identify whether there is a dog in an image, we only look for whether a dog is present in the image and we don't check the position of the dog in the image. Thus, in this case, pooling operation is used to identify whether there is a dog in the image irrespective of the position of the dog.\\n\",\n    \"\\n\",\n    \"But in our setting, pooling operation should not be performed. Because to understand the current game screen (state), the position is very important. For example, in a Pong game, we just don't want to classify if there is a ball in the game screen. We want to know the position of the ball so that we can take better action. Thus, we don't include the pooling operation in our DQN architecture. \\n\",\n    \"\\n\",\n    \"Now that we have understood the architecture of DQN to play the Atari games, in the next section, we will start implementing them. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Getting hands-on with DQN\\n\",\n    \"\\n\",\n    \"Let's implement DQN to play the Ms Pacman game. \\n\",\n    \"\\n\",\n    \"First, let's import the necessary\\n\",\n    \"libraries:\\n\",\n    \"\\n\",\n    \"Note that we use TensorFlow version 2.0\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import random\\n\",\n    \"import gym\\n\",\n    \"import numpy as np\\n\",\n    \"from collections import deque\\n\",\n    \"from tensorflow.keras.models import Sequential\\n\",\n    \"from tensorflow.keras.layers import Dense, Activation, Flatten, Conv2D, MaxPooling2D\\n\",\n    \"from tensorflow.keras.optimizers import Adam\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create the Ms Pacman game environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"MsPacman-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the state size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_size = (88, 80, 1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the number of actions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_size = env.action_space.n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Preprocess the game screen\\n\",\n    \"\\n\",\n    \"We learned that we feed the game state (image of the game screen) as an input to the DQN which is the convolutional neural network and it outputs the Q value of all the actions in the state. However, directly feeding the raw game screen image is not efficient, since the raw game screen size will be 210 x 160 x 3. This will be computationally expensive.\\n\",\n    \"\\n\",\n    \"To avoid this, we preprocess the game screen and then feed the preprocessed game screen to the DQN. First, we crop and resize the game screen image, convert the image to grayscale, normalize and then reshape the image to 88 x 80 x 1. Next, we feed this pre-processed game screen image as an input to the convolutional network (DQN) which returns the Q value.\\n\",\n    \"\\n\",\n    \"Now, let's define a function called preprocess_state which takes the game state (image of the game screen) as an input and returns the preprocessed game state (image of the game screen):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"color = np.array([210, 164, 74]).mean()\\n\",\n    \"\\n\",\n    \"def preprocess_state(state):\\n\",\n    \"\\n\",\n    \"    #crop and resize the image\\n\",\n    \"    image = state[1:176:2, ::2]\\n\",\n    \"\\n\",\n    \"    #convert the image to greyscale\\n\",\n    \"    image = image.mean(axis=2)\\n\",\n    \"\\n\",\n    \"    #improve image contrast\\n\",\n    \"    image[image==color] = 0\\n\",\n    \"\\n\",\n    \"    #normalize the image\\n\",\n    \"    image = (image - 128) / 128 - 1\\n\",\n    \"    \\n\",\n    \"    #reshape the image\\n\",\n    \"    image = np.expand_dims(image.reshape(88, 80, 1), axis=0)\\n\",\n    \"\\n\",\n    \"    return image\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Building the DQN \\n\",\n    \"\\n\",\n    \"Now, let's build the deep Q network. We learned that for playing atari games we use the\\n\",\n    \"convolutional neural network as the DQN which takes the image of the game screen as an\\n\",\n    \"input and returns the Q values.\\n\",\n    \"\\n\",\n    \"We define the DQN with three convolutional layers. The convolutional layers extract the\\n\",\n    \"features from the image and output the feature maps and then we flattened the feature map\\n\",\n    \"obtained by the convolutional layers and feed the flattened feature maps to the feedforward\\n\",\n    \"network (fully connected layer) which returns the Q value:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class DQN:\\n\",\n    \"    def __init__(self, state_size, action_size):\\n\",\n    \"        \\n\",\n    \"        #define the state size\\n\",\n    \"        self.state_size = state_size\\n\",\n    \"        \\n\",\n    \"        #define the action size\\n\",\n    \"        self.action_size = action_size\\n\",\n    \"        \\n\",\n    \"        #define the replay buffer\\n\",\n    \"        self.replay_buffer = deque(maxlen=5000)\\n\",\n    \"        \\n\",\n    \"        #define the discount factor\\n\",\n    \"        self.gamma = 0.9  \\n\",\n    \"        \\n\",\n    \"        #define the epsilon value\\n\",\n    \"        self.epsilon = 0.8   \\n\",\n    \"        \\n\",\n    \"        #define the update rate at which we want to update the target network\\n\",\n    \"        self.update_rate = 1000    \\n\",\n    \"        \\n\",\n    \"        #define the main network\\n\",\n    \"        self.main_network = self.build_network()\\n\",\n    \"        \\n\",\n    \"        #define the target network\\n\",\n    \"        self.target_network = self.build_network()\\n\",\n    \"        \\n\",\n    \"        #copy the weights of the main network to the target network\\n\",\n    \"        self.target_network.set_weights(self.main_network.get_weights())\\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"    #Let's define a function called build_network which is essentially our DQN. \\n\",\n    \"\\n\",\n    \"    def build_network(self):\\n\",\n    \"        model = Sequential()\\n\",\n    \"        model.add(Conv2D(32, (8, 8), strides=4, padding='same', input_shape=self.state_size))\\n\",\n    \"        model.add(Activation('relu'))\\n\",\n    \"        \\n\",\n    \"        model.add(Conv2D(64, (4, 4), strides=2, padding='same'))\\n\",\n    \"        model.add(Activation('relu'))\\n\",\n    \"        \\n\",\n    \"        model.add(Conv2D(64, (3, 3), strides=1, padding='same'))\\n\",\n    \"        model.add(Activation('relu'))\\n\",\n    \"        model.add(Flatten())\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"        model.add(Dense(512, activation='relu'))\\n\",\n    \"        model.add(Dense(self.action_size, activation='linear'))\\n\",\n    \"        \\n\",\n    \"        model.compile(loss='mse', optimizer=Adam())\\n\",\n    \"\\n\",\n    \"        return model\\n\",\n    \"\\n\",\n    \"    #We learned that we train DQN by randomly sampling a minibatch of transitions from the\\n\",\n    \"    #replay buffer. So, we define a function called store_transition which stores the transition information\\n\",\n    \"    #into the replay buffer\\n\",\n    \"\\n\",\n    \"    def store_transistion(self, state, action, reward, next_state, done):\\n\",\n    \"        self.replay_buffer.append((state, action, reward, next_state, done))\\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"    #We learned that in DQN, to take care of exploration-exploitation trade off, we select action\\n\",\n    \"    #using the epsilon-greedy policy. So, now we define the function called epsilon_greedy\\n\",\n    \"    #for selecting action using the epsilon-greedy policy.\\n\",\n    \"    \\n\",\n    \"    def epsilon_greedy(self, state):\\n\",\n    \"        if random.uniform(0,1) < self.epsilon:\\n\",\n    \"            return np.random.randint(self.action_size)\\n\",\n    \"        \\n\",\n    \"        Q_values = self.main_network.predict(state)\\n\",\n    \"        \\n\",\n    \"        return np.argmax(Q_values[0])\\n\",\n    \"\\n\",\n    \"    \\n\",\n    \"    #train the network\\n\",\n    \"    def train(self, batch_size):\\n\",\n    \"        \\n\",\n    \"        #sample a mini batch of transition from the replay buffer\\n\",\n    \"        minibatch = random.sample(self.replay_buffer, batch_size)\\n\",\n    \"        \\n\",\n    \"        #compute the Q value using the target network\\n\",\n    \"        for state, action, reward, next_state, done in minibatch:\\n\",\n    \"            if not done:\\n\",\n    \"                target_Q = (reward + self.gamma * np.amax(self.target_network.predict(next_state)))\\n\",\n    \"            else:\\n\",\n    \"                target_Q = reward\\n\",\n    \"                \\n\",\n    \"            #compute the Q value using the main network \\n\",\n    \"            Q_values = self.main_network.predict(state)\\n\",\n    \"            \\n\",\n    \"            Q_values[0][action] = target_Q\\n\",\n    \"            \\n\",\n    \"            #train the main network\\n\",\n    \"            self.main_network.fit(state, Q_values, epochs=1, verbose=0)\\n\",\n    \"            \\n\",\n    \"    #update the target network weights by copying from the main network\\n\",\n    \"    def update_target_network(self):\\n\",\n    \"        self.target_network.set_weights(self.main_network.get_weights())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's train the network. First, let's set the number of episodes we want to train the\\n\",\n    \"network:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 500\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 20000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 8\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the number of past game screens we want to consider:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_screens = 4     \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the DQN class\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"dqn = DQN(state_size, action_size)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"done = False\\n\",\n    \"time_step = 0\\n\",\n    \"\\n\",\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #set return to 0\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #preprocess the game screen\\n\",\n    \"    state = preprocess_state(env.reset())\\n\",\n    \"\\n\",\n    \"    #for each step in the episode\\n\",\n    \"    for t in range(num_timesteps):\\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"        \\n\",\n    \"        #update the time step\\n\",\n    \"        time_step += 1\\n\",\n    \"        \\n\",\n    \"        #update the target network\\n\",\n    \"        if time_step % dqn.update_rate == 0:\\n\",\n    \"            dqn.update_target_network()\\n\",\n    \"        \\n\",\n    \"        #select the action\\n\",\n    \"        action = dqn.epsilon_greedy(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, _ = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #preprocess the next state\\n\",\n    \"        next_state = preprocess_state(next_state)\\n\",\n    \"        \\n\",\n    \"        #store the transition information\\n\",\n    \"        dqn.store_transistion(state, action, reward, next_state, done)\\n\",\n    \"        \\n\",\n    \"        #update current state to next state\\n\",\n    \"        state = next_state\\n\",\n    \"        \\n\",\n    \"        #update the return\\n\",\n    \"        Return += reward\\n\",\n    \"        \\n\",\n    \"        #if the episode is done then print the return\\n\",\n    \"        if done:\\n\",\n    \"            print('Episode: ',i, ',' 'Return', Return)\\n\",\n    \"            break\\n\",\n    \"            \\n\",\n    \"        #if the number of transistions in the replay buffer is greater than batch size\\n\",\n    \"        #then train the network\\n\",\n    \"        if len(dqn.replay_buffer) > batch_size:\\n\",\n    \"            dqn.train(batch_size)\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we have learned how deep Q network works and how to build the DQN to play the Atari games, in the next section, we will learn an interesting variant of DQN called\\n\",\n    \"double DQN. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "09.  Deep Q Network and its Variants/READEME.md",
    "content": "# 9. Deep Q Network and its Variants\n\n* 9.1. What is Deep Q Network?\n* 9.2. Understanding DQN\n   * 9.2.1. Replay Buffer\n   * 9.2.2. Loss Function\n   * 9.2.3. Target Network\n   * 9.2.4. Putting it All Together\n   * 9.2.5. Algorithm - DQN\n* 9.3. Playing Atari Games using DQN\n   * 9.3.1. Architecture of DQN\n   * 9.3.2. Getting Hands-on with the DQN\n* 9.4. Double DQN\n   * 9.4.1. Algorithm - Double DQN\n* 9.5. DQN with Prioritized Experience Replay\n   * 9.5.1. Types of Prioritization\n   * 9.5.2. Correcting the Bias\n* 9.6. Dueling DQN\n   * 9.6.1. Understanding Dueling DQN\n   * 9.6.2.Architecture of Dueling DQN\n* 9.7. Deep Recurrent Q Network\n   * 9.7.1. Architecture of DRQN"
  },
  {
    "path": "10. Policy Gradient Method/.ipynb_checkpoints/10.01. Why Policy based Methods-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Why policy-based methods?\\n\",\n    \"\\n\",\n    \"The objective of reinforcement learning is to find the optimal policy, which is the\\n\",\n    \"policy that provides the maximum return. So far, we have learned several different\\n\",\n    \"algorithms for computing the optimal policy, and all these algorithms have been\\n\",\n    \"value-based methods. Wait, what are value-based methods? Let's recap what value-\\n\",\n    \"based methods are, and the problems associated with them, and then we will learn\\n\",\n    \"about policy-based methods. Recapping is always good, isn't it?\\n\",\n    \"\\n\",\n    \"With value-based methods, we extract the optimal policy from the optimal Q\\n\",\n    \"function (Q values), meaning we compute the Q values of all state-action pairs to\\n\",\n    \"find the policy. We extract the policy by selecting an action in each state that has the\\n\",\n    \"maximum Q value. For instance, let's say we have two states $s_0$ and $s_1$ and our action\\n\",\n    \"space has two actions; let the actions be 0 and 1. First, we compute the Q value of all\\n\",\n    \"the state-action pairs, as shown in the following table. Now, we extract policy from\\n\",\n    \"the Q function (Q values) by selecting action 0 in state $s_0$ and action 1 in state $s_1$ as\\n\",\n    \"they have the maximum Q value:\\n\",\n    \"\\n\",\n    \"![title](Images/1.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Later, we learned that it is difficult to compute the Q function when our environment\\n\",\n    \"has a large number of states and actions as it would be expensive to compute the\\n\",\n    \"Q values of all possible state-action pairs. So, we resorted to the Deep Q Network\\n\",\n    \"(DQN). In DQN, we used a neural network to approximate the Q function (Q value).\\n\",\n    \"Given a state, the network will return the Q values of all possible actions in that\\n\",\n    \"state. For instance, consider the grid world environment. Given a state, our DQN will\\n\",\n    \"return the Q values of all possible actions in that state. Then we select the action that\\n\",\n    \"has the highest Q value. As we can see in the following figure, given state E, DQN returns the\\n\",\n    \"Q value of all possible actions (up, down, left, right). Then we select the right action in\\n\",\n    \"state E since it has the maximum Q value:\\n\",\n    \"\\n\",\n    \"![title](Images/2.png)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Thus, in value-based methods, we improve the Q function iteratively, and once we\\n\",\n    \"have the optimal Q function, then we extract optimal policy by selecting the action\\n\",\n    \"in each state that has the maximum Q value.\\n\",\n    \"\\n\",\n    \"One of the disadvantages of the value-based method is that it is suitable only for\\n\",\n    \"discrete environments (environments with a discrete action space), and we cannot\\n\",\n    \"apply value-based methods in continuous environments (environments with a\\n\",\n    \"continuous action space).\\n\",\n    \"\\n\",\n    \"We have learned that a discrete action space has a discrete set of actions; for example,\\n\",\n    \"the grid world environment has discrete actions (up, down, left, and right) and the\\n\",\n    \"continuous action space consists of actions that are continuous values, for example,\\n\",\n    \"controlling the speed of a car.\\n\",\n    \"\\n\",\n    \"So far, we have only dealt with a discrete environment where we had a discrete\\n\",\n    \"action space, so we easily computed the Q value of all possible state-action pairs.\\n\",\n    \"But how can we compute the Q value of all possible state-action pairs when our\\n\",\n    \"action space is continuous? Say we are training an agent to drive a car and say we\\n\",\n    \"have one continuous action in our action space. Let the action be the speed of the\\n\",\n    \"car and the value of the speed of the car ranges from 0 to 150 kmph. In this case,\\n\",\n    \"how can we compute the Q value of all possible state-action pairs with the action\\n\",\n    \"being a continuous value?\\n\",\n    \"\\n\",\n    \"In this case, we can discretize the continuous actions into speed (0 to 10) as action\\n\",\n    \"1, speed (10 to 20) as action 2, and so on. After discretization, we can compute the\\n\",\n    \"Q value of all possible state-action pairs. However, discretization is not always\\n\",\n    \"desirable. We might lose several important features and we might end up in an\\n\",\n    \"action space with a huge set of actions.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Most real-world problems have continuous action space, say, a self-driving car, or\\n\",\n    \"a robot learning to walk and more. Apart from having a continuous action space they\\n\",\n    \"also have a high dimension. Thus, the DQN and other value-based methods cannot\\n\",\n    \"deal with the continuous action space effectively.\\n\",\n    \"\\n\",\n    \"So, we use the policy-based methods. With policy-based methods, we don't need\\n\",\n    \"to compute the Q function (Q values) to find the optimal policy; instead, we can\\n\",\n    \"compute them directly. That is, we don't need the Q function to extract the policy.\\n\",\n    \"Policy-based methods have several advantages over value-based methods, and they\\n\",\n    \"can handle both discrete and continuous action spaces.\\n\",\n    \"\\n\",\n    \"We learned that DQN takes care of the exploration-exploitation dilemma by using\\n\",\n    \"the epsilon-greedy policy. With the epsilon-greedy policy, we either select the best\\n\",\n    \"action with the probability 1-epsilon or a random action with the probability epsilon.\\n\",\n    \"Most policy-based methods use a stochastic policy. We know that with a stochastic\\n\",\n    \"policy, we select actions based on the probability distribution over the action\\n\",\n    \"space, which allows the agent to explore different actions instead of performing the\\n\",\n    \"same action every time. Thus, policy-based methods take care of the exploration-\\n\",\n    \"exploitation trade-off implicitly by using a stochastic policy. However, there are\\n\",\n    \"several policy-based methods that use a deterministic policy as well. We will learn\\n\",\n    \"more about them in the upcoming chapters.\\n\",\n    \"\\n\",\n    \"Okay, how do policy-based methods work, exactly? How do they find an optimal\\n\",\n    \"policy without computing the Q function? We will learn about this in the next\\n\",\n    \"section. Now that we have a basic understanding of what a policy gradient\\n\",\n    \"method is, and also the disadvantages of value-based methods, in the next section\\n\",\n    \"we will learn about a fundamental and interesting policy-based method called\\n\",\n    \"policy gradient.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "10. Policy Gradient Method/.ipynb_checkpoints/10.02. Policy Gradient Intuition-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Policy gradient intuition\\n\",\n    \"\\n\",\n    \"Policy gradient is one of the most popular algorithms in deep reinforcement learning.\\n\",\n    \"As we have learned, policy gradient is a policy-based method by which we can find\\n\",\n    \"the optimal policy without computing the Q function. It finds the optimal policy by\\n\",\n    \"directly parameterizing the policy using some parameter $\\\\theta$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"The policy gradient method uses a stochastic policy. We have learned that with a\\n\",\n    \"stochastic policy, we select an action based on the probability distribution over the\\n\",\n    \"action space. Say we have a stochastic policy π , then it gives the probability of taking\\n\",\n    \"an action $a$ given the state $s$. It can be denoted by $\\\\pi_{}(a|s)$ . In the policy gradient\\n\",\n    \"method, we use a parameterized policy, so we can denote our policy as $\\\\pi_{\\\\theta}(a|s)$ ,\\n\",\n    \"where θ indicates that our policy is parameterized.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Wait! What do we mean when we say a parameterized policy? What is it exactly?\\n\",\n    \"Remember with DQN, we learned that we parameterize our Q function to compute\\n\",\n    \"the Q value? We can do the same here, except instead of parameterizing the Q\\n\",\n    \"function, we will directly parameterize the policy to compute the optimal policy.\\n\",\n    \"That is, we can use any function approximator to learn the optimal policy, and θ is\\n\",\n    \"the parameter of our function approximator. We generally use a neural network as\\n\",\n    \"our function approximator. Thus, we have a policy π parameterized by θ where θ is\\n\",\n    \"the parameter of the neural network.\\n\",\n    \"\\n\",\n    \"Say we have a neural network with a parameter θ. First, we feed the state of the\\n\",\n    \"environment as an input to the network and it will output the probability of all\\n\",\n    \"the actions that can be performed in the state. That is, it outputs a probability\\n\",\n    \"distribution over an action space. We have learned that with policy gradient, we use\\n\",\n    \"a stochastic policy. So, the stochastic policy selects an action based on the probability\\n\",\n    \"distribution given by the neural network. In this way, we can directly compute the\\n\",\n    \"policy without using the Q function.\\n\",\n    \"\\n\",\n    \"Let's understand how the policy gradient method works with an example. Let's take\\n\",\n    \"our favorite grid world environment for better understanding. We know that in the\\n\",\n    \"grid world environment our action space has four possible actions: up, down, left,\\n\",\n    \"and right.\\n\",\n    \"\\n\",\n    \"Given any state as an input, the neural network will output the probability\\n\",\n    \"distribution over the action space. That is, as shown in the following figure, when we feed the\\n\",\n    \"state E as an input to the network, it will return the probability distribution over all\\n\",\n    \"actions in our action space. Now, our stochastic policy will select an action based on\\n\",\n    \"the probability distribution given by the neural network. So, it will select action up\\n\",\n    \"10% of the time, down 10% of the time, left 10% of the time, and right 70% of the time:\\n\",\n    \"\\n\",\n    \"![title](Images/3.png)\\n\",\n    \"\\n\",\n    \"We should not get confused with the DQN and the policy gradient method. With\\n\",\n    \"DQN, we feed the state as an input to the network, and it returns the Q values of all\\n\",\n    \"possible actions in that state, then we select an action that has a maximum Q value.\\n\",\n    \"But in the policy gradient method, we feed the state as input to the network, and it\\n\",\n    \"returns the probability distribution over an action space, and our stochastic policy\\n\",\n    \"uses the probability distribution returned by the neural network to select an action.\\n\",\n    \"\\n\",\n    \"Okay, in the policy gradient method, the network returns the probability distribution\\n\",\n    \"(action probabilities) over the action space, but how accurate are the probabilities?\\n\",\n    \"How does the network learn?\\n\",\n    \"\\n\",\n    \"__We will discuss this in detail in the next section.__\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "10. Policy Gradient Method/.ipynb_checkpoints/10.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Cart pole balancing with policy gradient\\n\",\n    \"\\n\",\n    \"Now, let's learn how to implement the policy gradient algorithm with reward-to-go for the\\n\",\n    \"cart pole balancing task.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"import numpy as np\\n\",\n    \"import gym\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For a clear understanding of how the policy gradient method works, we use\\n\",\n    \"TensorFlow in the non-eager mode by disabling TensorFlow 2 behavior\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the cart pole environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('CartPole-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the number of actions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_actions = env.action_space.n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Computing discounted and normalized reward\\n\",\n    \"\\n\",\n    \"Instead of using the rewards directly, we can use the discounted and normalized rewards. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.95\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's define a function called `discount_and_normalize_rewards` for computing the discounted and normalized rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def discount_and_normalize_rewards(episode_rewards):\\n\",\n    \"    \\n\",\n    \"    #initialize an array for storing the discounted reward\\n\",\n    \"    discounted_rewards = np.zeros_like(episode_rewards)\\n\",\n    \"    \\n\",\n    \"    #compute the discounted reward\\n\",\n    \"    reward_to_go = 0.0\\n\",\n    \"    for i in reversed(range(len(episode_rewards))):\\n\",\n    \"        reward_to_go = reward_to_go * gamma + episode_rewards[i]\\n\",\n    \"        discounted_rewards[i] = reward_to_go\\n\",\n    \"        \\n\",\n    \"    #normalize and return the reward\\n\",\n    \"    discounted_rewards -= np.mean(discounted_rewards)\\n\",\n    \"    discounted_rewards /= np.std(discounted_rewards)\\n\",\n    \"    \\n\",\n    \"    return discounted_rewards\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Building the policy network\\n\",\n    \"\\n\",\n    \"First, let's define the placeholder for the state:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_ph = tf.placeholder(tf.float32, [None, state_shape], name=\\\"state_ph\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the placeholder for the action:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_ph = tf.placeholder(tf.int32, [None, num_actions], name=\\\"action_ph\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the placeholder for the discounted reward:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"discounted_rewards_ph = tf.placeholder(tf.float32, [None,], name=\\\"discounted_rewards\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the layer 1:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"layer1 = tf.layers.dense(state_ph, units=32, activation=tf.nn.relu)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the layer 2, note that the number of units in the layer 2 is set to the number of\\n\",\n    \"actions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"layer2 = tf.layers.dense(layer1, units=num_actions)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Obtain the probability distribution over the action space as an output of the network by\\n\",\n    \"applying the softmax function to the result of layer 2:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"prob_dist = tf.nn.softmax(layer2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We learned that we compute gradient as:\\n\",\n    \"\\n\",\n    \"$$\\\\nabla_{\\\\theta} J(\\\\theta) = \\\\frac{1}{N} \\\\sum_{i=1}^{N}\\\\left[\\\\sum_{t=0}^{T-1}  \\\\nabla_{\\\\theta} \\\\log \\\\pi_{\\\\theta}\\\\left(a_{t} | s_{t}\\\\right)R_t\\\\right] $$\\n\",\n    \"    \\n\",\n    \"After computing the gradient we update the parameter of the network using the gradient\\n\",\n    \"ascent as:    \\n\",\n    \"\\n\",\n    \"$$\\\\theta = \\\\theta + \\\\alpha \\\\nabla_{\\\\theta} J(\\\\theta) $$\\n\",\n    \"\\n\",\n    \"However, it is a standard convention to perform minimization rather than maximization.\\n\",\n    \"So, we can convert the above maximization objective into the minimization objective by just\\n\",\n    \"adding a negative sign.\\n\",\n    \"\\n\",\n    \"Thus, we can define negative log policy as:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"neg_log_policy = tf.nn.softmax_cross_entropy_with_logits_v2(logits = layer2, labels = action_ph)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's define the loss:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"loss = tf.reduce_mean(neg_log_policy * discounted_rewards_ph) \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the train operation for minimizing the loss using Adam optimizer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"train = tf.train.AdamOptimizer(0.01).minimize(loss)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's train the network for several iterations. For simplicity, let's just generate one\\n\",\n    \"episode on every iteration.\\n\",\n    \"\\n\",\n    \"Set the number of iterations:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_iterations = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Iteration:0, Return: 71.0\\n\",\n      \"Iteration:10, Return: 12.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"#start the TensorFlow session\\n\",\n    \"with tf.Session() as sess:\\n\",\n    \"    \\n\",\n    \"    #initialize all the TensorFlow variables\\n\",\n    \"    sess.run(tf.global_variables_initializer())\\n\",\n    \"    \\n\",\n    \"    #for every iteration\\n\",\n    \"    for i in range(num_iterations):\\n\",\n    \"        \\n\",\n    \"        #initialize an empty list for storing the states, actions, and rewards obtained in the episode\\n\",\n    \"        episode_states, episode_actions, episode_rewards = [],[],[]\\n\",\n    \"    \\n\",\n    \"        #set the done to False\\n\",\n    \"        done = False\\n\",\n    \"        \\n\",\n    \"        #initialize the state by resetting the environment\\n\",\n    \"        state = env.reset()\\n\",\n    \"\\n\",\n    \"        #initialize the return\\n\",\n    \"        Return = 0\\n\",\n    \"\\n\",\n    \"        #while the episode is not over\\n\",\n    \"        while not done:\\n\",\n    \"            \\n\",\n    \"            #reshape the state\\n\",\n    \"            state = state.reshape([1,4])\\n\",\n    \"            \\n\",\n    \"            #feed the state to the policy network and the network returns the probability distribution\\n\",\n    \"            #over the action space as output which becomes our stochastic policy \\n\",\n    \"            pi = sess.run(prob_dist, feed_dict={state_ph: state})\\n\",\n    \"            \\n\",\n    \"            #now, we select an action using this stochastic policy\\n\",\n    \"            a = np.random.choice(range(pi.shape[1]), p=pi.ravel()) \\n\",\n    \"            \\n\",\n    \"            #perform the selected action\\n\",\n    \"            next_state, reward, done, info = env.step(a)\\n\",\n    \"            \\n\",\n    \"            #render the environment\\n\",\n    \"            env.render()\\n\",\n    \"            \\n\",\n    \"            #update the return\\n\",\n    \"            Return += reward\\n\",\n    \"            \\n\",\n    \"            #one-hot encode the action\\n\",\n    \"            action = np.zeros(num_actions)\\n\",\n    \"            action[a] = 1\\n\",\n    \"            \\n\",\n    \"            #store the state, action, and reward into their respective list\\n\",\n    \"            episode_states.append(state)\\n\",\n    \"            episode_actions.append(action)\\n\",\n    \"            episode_rewards.append(reward)\\n\",\n    \"\\n\",\n    \"            #update the state to the next state\\n\",\n    \"            state=next_state                                                                                                                                                                                                                                                                                                                                                                                                                                                                            \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"        #compute the discounted and normalized reward\\n\",\n    \"        discounted_rewards= discount_and_normalize_rewards(episode_rewards)\\n\",\n    \"        \\n\",\n    \"        #define the feed dictionary\\n\",\n    \"        feed_dict = {state_ph: np.vstack(np.array(episode_states)),\\n\",\n    \"                     action_ph: np.vstack(np.array(episode_actions)), \\n\",\n    \"                     discounted_rewards_ph: discounted_rewards \\n\",\n    \"                    }\\n\",\n    \"                    \\n\",\n    \"        #train the network\\n\",\n    \"        loss_, _ = sess.run([loss, train], feed_dict=feed_dict)\\n\",\n    \"\\n\",\n    \"        #print the return for every 10 iteration\\n\",\n    \"        if i%10==0:\\n\",\n    \"            print(\\\"Iteration:{}, Return: {}\\\".format(i,Return))  \\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we have learned how to implement the policy gradient algorithm with rewardto-go, in the next section, we will learn another interesting variance reduction technique\\n\",\n    \"called policy gradient with baseline. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "10. Policy Gradient Method/.ipynb_checkpoints/8.07. Cart Pole Balancing with Policy Gradient-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Cart pole balancing with policy gradient\\n\",\n    \"\\n\",\n    \"Now, let's learn how to implement the policy gradient algorithm with reward-to-go for the\\n\",\n    \"cart pole balancing task.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"non-resource variables are not supported in the long term\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"import numpy as np\\n\",\n    \"import gym\\n\",\n    \"\\n\",\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the cart pole environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('CartPole-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the number of actions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_actions = env.action_space.n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Computing discounted and normalized reward\\n\",\n    \"\\n\",\n    \"Instead of using the rewards directly, we can use the discounted and normalized rewards. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.95\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's define a function called discount_and_normalize_rewardsfor computing the discounted and normalized rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def discount_and_normalize_rewards(episode_rewards):\\n\",\n    \"    \\n\",\n    \"    #initialize an array for storing the discounted reward\\n\",\n    \"    discounted_rewards = np.zeros_like(episode_rewards)\\n\",\n    \"    \\n\",\n    \"    #compute the discounted reward\\n\",\n    \"    reward_to_go = 0.0\\n\",\n    \"    for i in reversed(range(len(episode_rewards))):\\n\",\n    \"        reward_to_go = reward_to_go * gamma + episode_rewards[i]\\n\",\n    \"        discounted_rewards[i] = reward_to_go\\n\",\n    \"        \\n\",\n    \"    #normalize and return the reward\\n\",\n    \"    discounted_rewards -= np.mean(discounted_rewards)\\n\",\n    \"    discounted_rewards /= np.std(discounted_rewards)\\n\",\n    \"    \\n\",\n    \"    return discounted_rewards\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Building the policy network\\n\",\n    \"\\n\",\n    \"First, let's define the placeholder for the state:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_ph = tf.placeholder(tf.float32, [None, state_shape], name=\\\"state_ph\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the placeholder for the action:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_ph = tf.placeholder(tf.int32, [None, num_actions], name=\\\"action_ph\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the placeholder for the discounted reward:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"discounted_rewards_ph = tf.placeholder(tf.float32, [None,], name=\\\"discounted_rewards\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the layer 1:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:From <ipython-input-10-d00536063726>:1: dense (from tensorflow.python.keras.legacy_tf_layers.core) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"Use keras.layers.Dense instead.\\n\",\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/keras/legacy_tf_layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"Please use `layer.__call__` method instead.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"layer1 = tf.layers.dense(state_ph, units=32, activation=tf.nn.relu)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the layer 2, note that the number of units in the layer 2 is set to the number of\\n\",\n    \"actions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"layer2 = tf.layers.dense(layer1, units=num_actions)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Obtain the probability distribution over the action space as an output of the network by\\n\",\n    \"applying the softmax function to the result of layer 2:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"prob_dist = tf.nn.softmax(layer2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We learned that we compute gradient as:\\n\",\n    \"\\n\",\n    \"$$\\\\nabla_{\\\\theta} J(\\\\theta) = \\\\frac{1}{N} \\\\sum_{i=1}^{N}\\\\left[\\\\sum_{t=0}^{T-1}  \\\\nabla_{\\\\theta} \\\\log \\\\pi_{\\\\theta}\\\\left(a_{t} | s_{t}\\\\right)R_t\\\\right] $$\\n\",\n    \"    \\n\",\n    \"After computing the gradient we update the parameter of the network using the gradient\\n\",\n    \"ascent as:    \\n\",\n    \"\\n\",\n    \"$$\\\\theta = \\\\theta + \\\\alpha \\\\nabla_{\\\\theta} J(\\\\theta) $$\\n\",\n    \"\\n\",\n    \"However, it is a standard convention to perform minimization rather than maximization.\\n\",\n    \"So, we can convert the above maximization objective into the minimization objective by just\\n\",\n    \"adding a negative sign.\\n\",\n    \"\\n\",\n    \"Thus, we can define negative log policy as:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"neg_log_policy = tf.nn.softmax_cross_entropy_with_logits_v2(logits = layer2, labels = action_ph)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's define the loss:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"loss = tf.reduce_mean(neg_log_policy * discounted_rewards_ph) \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the train operation for minimizing the loss using Adam optimizer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"train = tf.train.AdamOptimizer(0.01).minimize(loss)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's train the network for several iterations. For simplicity, let's just generate one\\n\",\n    \"episode on every iteration.\\n\",\n    \"\\n\",\n    \"Set the number of iterations:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_iterations = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Iteration:0, Return: 16.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"#start the TensorFlow session\\n\",\n    \"with tf.Session() as sess:\\n\",\n    \"    \\n\",\n    \"    #initialize all the TensorFlow variables\\n\",\n    \"    sess.run(tf.global_variables_initializer())\\n\",\n    \"    \\n\",\n    \"    #for every iteration\\n\",\n    \"    for i in range(num_iterations):\\n\",\n    \"        \\n\",\n    \"        #initialize an empty list for storing the states, actions, and rewards obtained in the episode\\n\",\n    \"        episode_states, episode_actions, episode_rewards = [],[],[]\\n\",\n    \"    \\n\",\n    \"        #set the done to False\\n\",\n    \"        done = False\\n\",\n    \"        \\n\",\n    \"        #initialize the state by resetting the environment\\n\",\n    \"        state = env.reset()\\n\",\n    \"\\n\",\n    \"        #initialize the return\\n\",\n    \"        Return = 0\\n\",\n    \"\\n\",\n    \"        #while the episode is not over\\n\",\n    \"        while not done:\\n\",\n    \"            \\n\",\n    \"            #reshape the state\\n\",\n    \"            state = state.reshape([1,4])\\n\",\n    \"            \\n\",\n    \"            #feed the state to the policy network and the network returns the probability distribution\\n\",\n    \"            #over the action space as output which becomes our stochastic policy \\n\",\n    \"            pi = sess.run(prob_dist, feed_dict={state_ph: state})\\n\",\n    \"            \\n\",\n    \"            #now, we select an action using this stochastic policy\\n\",\n    \"            a = np.random.choice(range(pi.shape[1]), p=pi.ravel()) \\n\",\n    \"            \\n\",\n    \"            #perform the selected action\\n\",\n    \"            next_state, reward, done, info = env.step(a)\\n\",\n    \"            \\n\",\n    \"            #render the environment\\n\",\n    \"            env.render()\\n\",\n    \"            \\n\",\n    \"            #update the return\\n\",\n    \"            Return += reward\\n\",\n    \"            \\n\",\n    \"            #one-hot encode the action\\n\",\n    \"            action = np.zeros(num_actions)\\n\",\n    \"            action[a] = 1\\n\",\n    \"            \\n\",\n    \"            #store the state, action, and reward into their respective list\\n\",\n    \"            episode_states.append(state)\\n\",\n    \"            episode_actions.append(action)\\n\",\n    \"            episode_rewards.append(reward)\\n\",\n    \"\\n\",\n    \"            #update the state to the next state\\n\",\n    \"            state=next_state                                                                                                                                                                                                                                                                                                                                                                                                                                                                            \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"        #compute the discounted and normalized reward\\n\",\n    \"        discounted_rewards= discount_and_normalize_rewards(episode_rewards)\\n\",\n    \"        \\n\",\n    \"        #define the feed dictionary\\n\",\n    \"        feed_dict = {state_ph: np.vstack(np.array(episode_states)),\\n\",\n    \"                     action_ph: np.vstack(np.array(episode_actions)), \\n\",\n    \"                     discounted_rewards_ph: discounted_rewards \\n\",\n    \"                    }\\n\",\n    \"                    \\n\",\n    \"        #train the network\\n\",\n    \"        loss_, _ = sess.run([loss, train], feed_dict=feed_dict)\\n\",\n    \"\\n\",\n    \"        #print the return for every 10 iteration\\n\",\n    \"        if i%10==0:\\n\",\n    \"            print(\\\"Iteration:{}, Return: {}\\\".format(i,Return))  \\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we have learned how to implement the policy gradient algorithm with rewardto-go, in the next section, we will learn another interesting variance reduction technique\\n\",\n    \"called policy gradient with baseline. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "10. Policy Gradient Method/10.07. Cart Pole Balancing with Policy Gradient.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Cart pole balancing with policy gradient\\n\",\n    \"\\n\",\n    \"Now, let's learn how to implement the policy gradient algorithm with reward-to-go for the\\n\",\n    \"cart pole balancing task.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"import numpy as np\\n\",\n    \"import gym\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For a clear understanding of how the policy gradient method works, we use\\n\",\n    \"TensorFlow in the non-eager mode by disabling TensorFlow 2 behavior\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the cart pole environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('CartPole-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the number of actions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_actions = env.action_space.n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Computing discounted and normalized reward\\n\",\n    \"\\n\",\n    \"Instead of using the rewards directly, we can use the discounted and normalized rewards. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.95\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's define a function called `discount_and_normalize_rewards` for computing the discounted and normalized rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def discount_and_normalize_rewards(episode_rewards):\\n\",\n    \"    \\n\",\n    \"    #initialize an array for storing the discounted reward\\n\",\n    \"    discounted_rewards = np.zeros_like(episode_rewards)\\n\",\n    \"    \\n\",\n    \"    #compute the discounted reward\\n\",\n    \"    reward_to_go = 0.0\\n\",\n    \"    for i in reversed(range(len(episode_rewards))):\\n\",\n    \"        reward_to_go = reward_to_go * gamma + episode_rewards[i]\\n\",\n    \"        discounted_rewards[i] = reward_to_go\\n\",\n    \"        \\n\",\n    \"    #normalize and return the reward\\n\",\n    \"    discounted_rewards -= np.mean(discounted_rewards)\\n\",\n    \"    discounted_rewards /= np.std(discounted_rewards)\\n\",\n    \"    \\n\",\n    \"    return discounted_rewards\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Building the policy network\\n\",\n    \"\\n\",\n    \"First, let's define the placeholder for the state:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_ph = tf.placeholder(tf.float32, [None, state_shape], name=\\\"state_ph\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the placeholder for the action:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_ph = tf.placeholder(tf.int32, [None, num_actions], name=\\\"action_ph\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the placeholder for the discounted reward:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"discounted_rewards_ph = tf.placeholder(tf.float32, [None,], name=\\\"discounted_rewards\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the layer 1:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"layer1 = tf.layers.dense(state_ph, units=32, activation=tf.nn.relu)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the layer 2, note that the number of units in the layer 2 is set to the number of\\n\",\n    \"actions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"layer2 = tf.layers.dense(layer1, units=num_actions)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Obtain the probability distribution over the action space as an output of the network by\\n\",\n    \"applying the softmax function to the result of layer 2:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"prob_dist = tf.nn.softmax(layer2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We learned that we compute gradient as:\\n\",\n    \"\\n\",\n    \"$$\\\\nabla_{\\\\theta} J(\\\\theta) = \\\\frac{1}{N} \\\\sum_{i=1}^{N}\\\\left[\\\\sum_{t=0}^{T-1}  \\\\nabla_{\\\\theta} \\\\log \\\\pi_{\\\\theta}\\\\left(a_{t} | s_{t}\\\\right)R_t\\\\right] $$\\n\",\n    \"    \\n\",\n    \"After computing the gradient we update the parameter of the network using the gradient\\n\",\n    \"ascent as:    \\n\",\n    \"\\n\",\n    \"$$\\\\theta = \\\\theta + \\\\alpha \\\\nabla_{\\\\theta} J(\\\\theta) $$\\n\",\n    \"\\n\",\n    \"However, it is a standard convention to perform minimization rather than maximization.\\n\",\n    \"So, we can convert the above maximization objective into the minimization objective by just\\n\",\n    \"adding a negative sign.\\n\",\n    \"\\n\",\n    \"Thus, we can define negative log policy as:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"neg_log_policy = tf.nn.softmax_cross_entropy_with_logits_v2(logits = layer2, labels = action_ph)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's define the loss:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"loss = tf.reduce_mean(neg_log_policy * discounted_rewards_ph) \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the train operation for minimizing the loss using Adam optimizer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"train = tf.train.AdamOptimizer(0.01).minimize(loss)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's train the network for several iterations. For simplicity, let's just generate one\\n\",\n    \"episode on every iteration.\\n\",\n    \"\\n\",\n    \"Set the number of iterations:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_iterations = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Iteration:0, Return: 71.0\\n\",\n      \"Iteration:10, Return: 12.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"#start the TensorFlow session\\n\",\n    \"with tf.Session() as sess:\\n\",\n    \"    \\n\",\n    \"    #initialize all the TensorFlow variables\\n\",\n    \"    sess.run(tf.global_variables_initializer())\\n\",\n    \"    \\n\",\n    \"    #for every iteration\\n\",\n    \"    for i in range(num_iterations):\\n\",\n    \"        \\n\",\n    \"        #initialize an empty list for storing the states, actions, and rewards obtained in the episode\\n\",\n    \"        episode_states, episode_actions, episode_rewards = [],[],[]\\n\",\n    \"    \\n\",\n    \"        #set the done to False\\n\",\n    \"        done = False\\n\",\n    \"        \\n\",\n    \"        #initialize the state by resetting the environment\\n\",\n    \"        state = env.reset()\\n\",\n    \"\\n\",\n    \"        #initialize the return\\n\",\n    \"        Return = 0\\n\",\n    \"\\n\",\n    \"        #while the episode is not over\\n\",\n    \"        while not done:\\n\",\n    \"            \\n\",\n    \"            #reshape the state\\n\",\n    \"            state = state.reshape([1,4])\\n\",\n    \"            \\n\",\n    \"            #feed the state to the policy network and the network returns the probability distribution\\n\",\n    \"            #over the action space as output which becomes our stochastic policy \\n\",\n    \"            pi = sess.run(prob_dist, feed_dict={state_ph: state})\\n\",\n    \"            \\n\",\n    \"            #now, we select an action using this stochastic policy\\n\",\n    \"            a = np.random.choice(range(pi.shape[1]), p=pi.ravel()) \\n\",\n    \"            \\n\",\n    \"            #perform the selected action\\n\",\n    \"            next_state, reward, done, info = env.step(a)\\n\",\n    \"            \\n\",\n    \"            #render the environment\\n\",\n    \"            env.render()\\n\",\n    \"            \\n\",\n    \"            #update the return\\n\",\n    \"            Return += reward\\n\",\n    \"            \\n\",\n    \"            #one-hot encode the action\\n\",\n    \"            action = np.zeros(num_actions)\\n\",\n    \"            action[a] = 1\\n\",\n    \"            \\n\",\n    \"            #store the state, action, and reward into their respective list\\n\",\n    \"            episode_states.append(state)\\n\",\n    \"            episode_actions.append(action)\\n\",\n    \"            episode_rewards.append(reward)\\n\",\n    \"\\n\",\n    \"            #update the state to the next state\\n\",\n    \"            state=next_state                                                                                                                                                                                                                                                                                                                                                                                                                                                                            \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"        #compute the discounted and normalized reward\\n\",\n    \"        discounted_rewards= discount_and_normalize_rewards(episode_rewards)\\n\",\n    \"        \\n\",\n    \"        #define the feed dictionary\\n\",\n    \"        feed_dict = {state_ph: np.vstack(np.array(episode_states)),\\n\",\n    \"                     action_ph: np.vstack(np.array(episode_actions)), \\n\",\n    \"                     discounted_rewards_ph: discounted_rewards \\n\",\n    \"                    }\\n\",\n    \"                    \\n\",\n    \"        #train the network\\n\",\n    \"        loss_, _ = sess.run([loss, train], feed_dict=feed_dict)\\n\",\n    \"\\n\",\n    \"        #print the return for every 10 iteration\\n\",\n    \"        if i%10==0:\\n\",\n    \"            print(\\\"Iteration:{}, Return: {}\\\".format(i,Return))  \\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we have learned how to implement the policy gradient algorithm with rewardto-go, in the next section, we will learn another interesting variance reduction technique\\n\",\n    \"called policy gradient with baseline. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "10. Policy Gradient Method/README.md",
    "content": "# 10. Policy Gradient Method\n* 10.1. Why Policy Based Methods?\n* 10.2. Policy Gradient Intuition\n* 10.3. Understanding the Policy Gradient\n* 10.4. Deriving Policy Gradient\n   * 10.4.1. Algorithm - Policy Gradient\n* 10.5. Variance Reduction Methods\n* 10.6. Policy Gradient with Reward-to-go\n   * 10.6.1. Algorithm - Reward-to-go Policy Gradient\n* 10.7. Cart Pole Balancing with Policy Gradient\n* 10.8. Policy Gradient with Baseline\n   * 10.8.1. Algorithm - Reinforce with Baseline\n"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/.ipynb_checkpoints/11.01. Overview of actor critic method-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Overview of actor critic method\\n\",\n    \"\\n\",\n    \"The actor critic method is one of the most popular algorithms in deep reinforcement learning. Several modern deep reinforcement learning algorithms are designed based on the actor critic methods. The actor critic method lies in the intersection of value based and policy based methods. That is, it takes advantage of both value based and policy based methods.\\n\",\n    \"\\n\",\n    \"In this section, without getting into more details, first, let's get a basic understanding of how actor critic method works and in the next section, we will get into more details and understand the math behind the actor critic methods. \\n\",\n    \"\\n\",\n    \"Actor critic, as the name suggests consists of two types of networks called actor network and critic network. The role of the actor network is to find an optimal policy while the role of the critic network is to evaluate the policy produced by the actor network. So, we can think of, critic network as a form of feedback network which evaluates and guides the actor network in finding the optimal policy as shown below:\\n\",\n    \"\\n\",\n    \"![title](Images/1.png)\\n\",\n    \"\\n\",\n    \"Okay, what's really actor and critic network? how it works together and improve the policy? The actor network is basically the policy network and it finds the optimal policy using a policy gradient method. The critic network is basically the value network and it estimates the state value. Thus using its state value, the critic network evaluates the action produced by actor network and sends its feedback to the actor. Based on the critic's feedback, actor network updates its parameter.\\n\",\n    \"\\n\",\n    \"Thus, in the actor critic method, we use two networks - actor network (policy network) which computes the policy and the critic network (value network) which evaluates the policy produced by actor network by computing the value function (state values). Isn't this similar to something we just learned in the previous chapter?\\n\",\n    \"\\n\",\n    \"Yes! If you could recollect it is similar to the policy gradient method with the baseline (reinforce with baseline) we learned in the previous chapter. Similar to reinforce with baseline, here also we have an actor (policy network) and a critic (value network) network. However, actor critic is NOT exactly similar to reinforce with baseline. In the reinforce with baseline method, we learned that we use value network as the baseline and it helps to reduce the variance in the gradient updates. In the actor-critic method as well, we use the critic to reduce variance in the gradient updates of the actor but also critic helps to improve the policy iteratively in an online fashion. The distinction between these two will be made clear in the next section.\\n\",\n    \"\\n\",\n    \"Now that we have a basic understanding of what is actor critic method, in the next section we will learn how exactly the actor critic method works in detail. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/.ipynb_checkpoints/11.05. Mountain Car Climbing using A3C-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Mountain car climbing using A3C\\n\",\n    \"\\n\",\n    \"Let's implement the A3C algorithm for the mountain car climbing task. In the mountain car\\n\",\n    \"climbing environment, a car is placed between the two mountains and the goal of the agent\\n\",\n    \"is to drive up the mountain on the right. But the problem is, the agent can't drive up the\\n\",\n    \"mountain in one pass. So, the agent has to drive back and forth to build momentum to\\n\",\n    \"drive up the mountain on the right. A high reward will be assigned if our agent spends less\\n\",\n    \"energy on driving up. The Mountain car environment is shown in the below figure:\\n\",\n    \"\\n\",\n    \"![title](Images/2.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The code used in this section is adapted from the open-source implementation of A3C\\n\",\n    \"(https://github.com/stefanbo92/A3C-Continuous) provided by Stefan Boschenriedter.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"import multiprocessing\\n\",\n    \"import threading\\n\",\n    \"import numpy as np\\n\",\n    \"import os\\n\",\n    \"import shutil\\n\",\n    \"import matplotlib.pyplot as plt\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For a clear understanding of how the A2C method works, we use\\n\",\n    \"TensorFlow in the non-eager mode by disabling TensorFlow 2 behavior.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the mountain car environment\\n\",\n    \"\\n\",\n    \"Let's create a mountain car environment using the gym. Note that our mountain car\\n\",\n    \"environment is a continuous environment meaning that our action space is continuous:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('MountainCarContinuous-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the action shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_shape = env.action_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that we created the continuous mountain car environment and thus our action space\\n\",\n    \"consists of continuous values. So, we get the bound of our action space: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_bound = [env.action_space.low, env.action_space.high]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the variables\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Now, let's define some of the important variables.\\n\",\n    \"\\n\",\n    \"Define the number of workers as the number of CPUs:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_workers = multiprocessing.cpu_count() \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 2000 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 200 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the global network (global agent) scope:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"global_net_scope = 'Global_Net'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the time step at which we want to update the global network:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"update_global = 10\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.90 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the beta value:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"beta = 0.01 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the directory where we want to store the logs:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"log_dir = 'logs'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the actor critic class\\n\",\n    \"\\n\",\n    \"We learned that in A3C both the global and worker agents follow the actor critic\\n\",\n    \"architecture. So, let's define the class called ActorCritic where we will implement the\\n\",\n    \"actor critic algorithm. For a clear understanding, you can check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class ActorCritic(object):\\n\",\n    \"    \\n\",\n    \"     #first, let's define the init method\\n\",\n    \"     def __init__(self, scope, sess, globalAC=None):\\n\",\n    \"            \\n\",\n    \"        #initialize the TensorFlow session\\n\",\n    \"        self.sess=sess\\n\",\n    \"        \\n\",\n    \"        #define the actor network optimizer as RMS prop\\n\",\n    \"        self.actor_optimizer = tf.train.RMSPropOptimizer(0.0001, name='RMSPropA')\\n\",\n    \"        \\n\",\n    \"        #define the critic network optimizer as RMS prop\\n\",\n    \"        self.critic_optimizer = tf.train.RMSPropOptimizer(0.001, name='RMSPropC')\\n\",\n    \" \\n\",\n    \"        #if the scope is the global network (global agent)\\n\",\n    \"        if scope == global_net_scope:\\n\",\n    \"            with tf.variable_scope(scope):\\n\",\n    \"                    \\n\",\n    \"                #define the placeholder for the state\\n\",\n    \"                self.state = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"                \\n\",\n    \"                #build the global network (global agent) and get the actor and critic parameters\\n\",\n    \"                self.actor_params, self.critic_params = self.build_network(scope)[-2:]\\n\",\n    \"      \\n\",\n    \"        #if the network is not the global network then\\n\",\n    \"        else:\\n\",\n    \"            with tf.variable_scope(scope):\\n\",\n    \"                \\n\",\n    \"                #define the placeholder for the state\\n\",\n    \"                self.state = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"                \\n\",\n    \"                #we learned that our environment is the continuous environment, so the actor network\\n\",\n    \"                #(policy network) returns the mean and variance of the action and then we build the action\\n\",\n    \"                #distribution out of this mean and variance and select the action based on this action \\n\",\n    \"                #distribution. \\n\",\n    \"                \\n\",\n    \"                #define the placeholder for obtaining the action distribution\\n\",\n    \"                self.action_dist = tf.placeholder(tf.float32, [None, action_shape], 'action')\\n\",\n    \"                \\n\",\n    \"                #define the placeholder for the target value\\n\",\n    \"                self.target_value = tf.placeholder(tf.float32, [None, 1], 'Vtarget')\\n\",\n    \"                \\n\",\n    \"                #build the worker network (worker agent) and get the mean and variance of the action, the\\n\",\n    \"                #value of the state, and actor and critic network parameters:\\n\",\n    \"                mean, variance, self.value, self.actor_params, self.critic_params = self.build_network(scope)\\n\",\n    \"\\n\",\n    \"                #Compute the TD error which is the difference between the target value of the state and the\\n\",\n    \"                #predicted value of the state\\n\",\n    \"                td_error = tf.subtract(self.target_value, self.value, name='TD_error')\\n\",\n    \"    \\n\",\n    \"                #now, let's define the critic network loss\\n\",\n    \"                with tf.name_scope('critic_loss'):\\n\",\n    \"                    self.critic_loss = tf.reduce_mean(tf.square(td_error))\\n\",\n    \"                    \\n\",\n    \"                with tf.name_scope('wrap_action'):\\n\",\n    \"                    mean, variance = mean * action_bound[1], variance + 1e-4\\n\",\n    \"                    \\n\",\n    \"                #create a normal distribution based on the mean and variance of the action\\n\",\n    \"                normal_dist = tf.distributions.Normal(mean, variance)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"                #now, let's define the actor network loss\\n\",\n    \"                with tf.name_scope('actor_loss'):\\n\",\n    \"                    \\n\",\n    \"                    #compute the log probability of the action\\n\",\n    \"                    log_prob = normal_dist.log_prob(self.action_dist)\\n\",\n    \"         \\n\",\n    \"                    #define the entropy of the policy\\n\",\n    \"                    entropy_pi = normal_dist.entropy()\\n\",\n    \"                    \\n\",\n    \"                    #compute the actor network loss\\n\",\n    \"                    self.loss = log_prob * td_error + (beta * entropy_pi)\\n\",\n    \"                    self.actor_loss = tf.reduce_mean(-self.loss)\\n\",\n    \"       \\n\",\n    \"                #select the action based on the normal distribution\\n\",\n    \"                with tf.name_scope('select_action'):\\n\",\n    \"                    self.action = tf.clip_by_value(tf.squeeze(normal_dist.sample(1), axis=0), \\n\",\n    \"                                                   action_bound[0], action_bound[1])\\n\",\n    \"     \\n\",\n    \"        \\n\",\n    \"                #compute the gradients of actor and critic network loss of the worker agent (local agent)\\n\",\n    \"                with tf.name_scope('local_grad'):\\n\",\n    \"\\n\",\n    \"                    self.actor_grads = tf.gradients(self.actor_loss, self.actor_params)\\n\",\n    \"                    self.critic_grads = tf.gradients(self.critic_loss, self.critic_params)\\n\",\n    \" \\n\",\n    \"            #now, let's perform the sync operation\\n\",\n    \"            with tf.name_scope('sync'):\\n\",\n    \"                \\n\",\n    \"                #after computing the gradients of the loss of the actor and critic network, worker agent\\n\",\n    \"                #sends (push) those gradients to the global agent\\n\",\n    \"                with tf.name_scope('push'):\\n\",\n    \"                    self.update_actor_params = self.actor_optimizer.apply_gradients(zip(self.actor_grads,\\n\",\n    \"                                                                                        globalAC.actor_params))\\n\",\n    \"                    self.update_critic_params = self.critic_optimizer.apply_gradients(zip(self.critic_grads, \\n\",\n    \"                                                                                          globalAC.critic_params))\\n\",\n    \"\\n\",\n    \"                #global agent updates their parameter with the gradients received from the worker agents\\n\",\n    \"                #(local agents). Then the worker agents, pull the updated parameter from the global agent\\n\",\n    \"                with tf.name_scope('pull'):\\n\",\n    \"                    self.pull_actor_params = [l_p.assign(g_p) for l_p, g_p in zip(self.actor_params, \\n\",\n    \"                                                                                  globalAC.actor_params)]\\n\",\n    \"                    self.pull_critic_params = [l_p.assign(g_p) for l_p, g_p in zip(self.critic_params, \\n\",\n    \"                                                                                   globalAC.critic_params)]\\n\",\n    \"                \\n\",\n    \"\\n\",\n    \"     #let's define the function for building the actor critic network\\n\",\n    \"     def build_network(self, scope):\\n\",\n    \"            \\n\",\n    \"        #initialize the weight:\\n\",\n    \"        w_init = tf.random_normal_initializer(0., .1)\\n\",\n    \"        \\n\",\n    \"        #define the actor network which returns the mean and variance of the action\\n\",\n    \"        with tf.variable_scope('actor'):\\n\",\n    \"            l_a = tf.layers.dense(self.state, 200, tf.nn.relu, kernel_initializer=w_init, name='la')\\n\",\n    \"            mean = tf.layers.dense(l_a, action_shape, tf.nn.tanh,kernel_initializer=w_init, name='mean')\\n\",\n    \"            variance = tf.layers.dense(l_a, action_shape, tf.nn.softplus, kernel_initializer=w_init, name='variance')\\n\",\n    \"            \\n\",\n    \"        #define the critic network which returns the value of the state\\n\",\n    \"        with tf.variable_scope('critic'):\\n\",\n    \"            l_c = tf.layers.dense(self.state, 100, tf.nn.relu, kernel_initializer=w_init, name='lc')\\n\",\n    \"            value = tf.layers.dense(l_c, 1, kernel_initializer=w_init, name='value')\\n\",\n    \"        \\n\",\n    \"        actor_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\\n\",\n    \"        critic_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\\n\",\n    \"        \\n\",\n    \"        #Return the mean and variance of the action produced by the actor network, value of the\\n\",\n    \"        #state computed by the critic network and the parameters of the actor and critic network\\n\",\n    \"        \\n\",\n    \"        return mean, variance, value, actor_params, critic_params\\n\",\n    \"    \\n\",\n    \"     #let's define a function called update_global for updating the parameters of the global\\n\",\n    \"     #network with the gradients of loss computed by the worker networks, that is, the push operation\\n\",\n    \"     def update_global(self, feed_dict):\\n\",\n    \"        self.sess.run([self.update_actor_params, self.update_critic_params], feed_dict)\\n\",\n    \"     \\n\",\n    \"     #we also define a function called pull_from_global for updating the parameters of the\\n\",\n    \"     #worker networks by pulling from the global network, that is, the pull operation\\n\",\n    \"     def pull_from_global(self):\\n\",\n    \"        self.sess.run([self.pull_actor_params, self.pull_critic_params])\\n\",\n    \"     \\n\",\n    \"     #define a function called select_action for selecting the action\\n\",\n    \"     def select_action(self, state):   \\n\",\n    \"        state = state[np.newaxis, :]\\n\",\n    \"        return self.sess.run(self.action, {self.state: state})[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the worker class\\n\",\n    \"\\n\",\n    \"Let's define the class called Worker where we will implement the worker agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class Worker(object):\\n\",\n    \"    \\n\",\n    \"    #first, let's define the init method:\\n\",\n    \"    def __init__(self, name, globalAC, sess):\\n\",\n    \"\\n\",\n    \"        #we learned that each worker agent works with their own copies of the environment. So,\\n\",\n    \"        #let's create a mountain car environment\\n\",\n    \"        self.env = gym.make('MountainCarContinuous-v0').unwrapped\\n\",\n    \"        \\n\",\n    \"        #define the name of the worker\\n\",\n    \"        self.name = name\\n\",\n    \"    \\n\",\n    \"        #create an object to our ActorCritic class\\n\",\n    \"        self.AC = ActorCritic(name, sess, globalAC)\\n\",\n    \"        \\n\",\n    \"        #initialize a TensorFlow session\\n\",\n    \"        self.sess=sess\\n\",\n    \"        \\n\",\n    \"    #define a function called work for the worker to learn:\\n\",\n    \"    def work(self):\\n\",\n    \"        global global_rewards, global_episodes\\n\",\n    \"        \\n\",\n    \"        #initialize the time step\\n\",\n    \"        total_step = 1\\n\",\n    \"     \\n\",\n    \"        #initialize a list for storing the states, actions, and rewards\\n\",\n    \"        batch_states, batch_actions, batch_rewards = [], [], []\\n\",\n    \"        \\n\",\n    \"        #when the global episodes are less than the number of episodes and coordinator is active\\n\",\n    \"        while not coord.should_stop() and global_episodes < num_episodes:\\n\",\n    \"            \\n\",\n    \"            #initialize the state by resetting the environment\\n\",\n    \"            state = self.env.reset()\\n\",\n    \"            \\n\",\n    \"            #initialize the return\\n\",\n    \"            Return = 0\\n\",\n    \"            \\n\",\n    \"            #for each step in the environment\\n\",\n    \"            for t in range(num_timesteps):\\n\",\n    \"                \\n\",\n    \"                #render the environment of only the worker 0:\\n\",\n    \"                if self.name == 'W_0':\\n\",\n    \"                    self.env.render()\\n\",\n    \"                    \\n\",\n    \"                #select the action\\n\",\n    \"                action = self.AC.select_action(state)\\n\",\n    \"                \\n\",\n    \"                #perform the selected action\\n\",\n    \"                next_state, reward, done, _ = self.env.step(action)\\n\",\n    \"                \\n\",\n    \"                #set done to true if we reached the final step of the episode else set to false\\n\",\n    \"                done = True if t == num_timesteps - 1 else False\\n\",\n    \"                \\n\",\n    \"                #update the return\\n\",\n    \"                Return += reward\\n\",\n    \"                \\n\",\n    \"                #store the state, action, and reward into the lists\\n\",\n    \"                batch_states.append(state)\\n\",\n    \"                batch_actions.append(action)\\n\",\n    \"                batch_rewards.append((reward+8)/8)\\n\",\n    \"    \\n\",\n    \"                #now, let's update the global network. If done is true then set the value of next state to 0 else\\n\",\n    \"                #the compute the value of the next state\\n\",\n    \"                if total_step % update_global == 0 or done:\\n\",\n    \"                    if done:\\n\",\n    \"                        v_s_ = 0\\n\",\n    \"                    else:\\n\",\n    \"                        v_s_ = self.sess.run(self.AC.value, {self.AC.state: next_state[np.newaxis, :]})[0, 0]\\n\",\n    \" \\n\",\n    \"                    batch_target_value = []\\n\",\n    \"                    \\n\",\n    \"                    #compute the target value which is sum of reward and discounted value of next state\\n\",\n    \"                    for reward in batch_rewards[::-1]:\\n\",\n    \"                        v_s_ = reward + gamma * v_s_\\n\",\n    \"                        batch_target_value.append(v_s_)\\n\",\n    \"\\n\",\n    \"                    #reverse the target value\\n\",\n    \"                    batch_target_value.reverse()\\n\",\n    \"                    \\n\",\n    \"                    #stack the state, action and target value\\n\",\n    \"                    batch_states, batch_actions, batch_target_value = np.vstack(batch_states), np.vstack(batch_actions), np.vstack(batch_target_value)\\n\",\n    \"                    \\n\",\n    \"                    #define the feed dictionary\\n\",\n    \"                    feed_dict = {\\n\",\n    \"                                 self.AC.state: batch_states,\\n\",\n    \"                                 self.AC.action_dist: batch_actions,\\n\",\n    \"                                 self.AC.target_value: batch_target_value,\\n\",\n    \"                                 }\\n\",\n    \"                    \\n\",\n    \"                    #update the global network\\n\",\n    \"                    self.AC.update_global(feed_dict)\\n\",\n    \"                    \\n\",\n    \"                    #empty the lists:\\n\",\n    \"                    batch_states, batch_actions, batch_rewards = [], [], []\\n\",\n    \"                    \\n\",\n    \"                    #update the worker network by pulling the parameters from the global network:\\n\",\n    \"                    self.AC.pull_from_global()\\n\",\n    \"                    \\n\",\n    \"                #update the state to the next state and increment the total step:\\n\",\n    \"                state = next_state\\n\",\n    \"                total_step += 1\\n\",\n    \"                \\n\",\n    \"                #update global rewards:\\n\",\n    \"                if done:\\n\",\n    \"                    if len(global_rewards) < 5:\\n\",\n    \"                        global_rewards.append(Return)\\n\",\n    \"                    else:\\n\",\n    \"                        global_rewards.append(Return)\\n\",\n    \"                        global_rewards[-1] =(np.mean(global_rewards[-5:]))\\n\",\n    \"                    \\n\",\n    \"                    global_episodes += 1\\n\",\n    \"                    break\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. Initialize the global rewards list and also initialize the\\n\",\n    \"global episodes counter:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"global_rewards = []\\n\",\n    \"global_episodes = 0\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Start the TensorFlow session:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sess = tf.Session()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.device(\\\"/cpu:0\\\"):\\n\",\n    \"    \\n\",\n    \"    #create a global agent\\n\",\n    \"    global_agent = ActorCritic(global_net_scope,sess)\\n\",\n    \"    worker_agents = []\\n\",\n    \"    \\n\",\n    \"    #create n number of worker agent:\\n\",\n    \"    for i in range(num_workers):\\n\",\n    \"        i_name = 'W_%i' % i\\n\",\n    \"        worker_agents.append(Worker(i_name, global_agent,sess))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the TensorFlow coordinator:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"coord = tf.train.Coordinator()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize all the TensorFlow variables:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sess.run(tf.global_variables_initializer())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Store the TensorFlow computational graph to the log directory:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"if os.path.exists(log_dir):\\n\",\n    \"    shutil.rmtree(log_dir)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 24,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<tensorflow.python.summary.writer.writer.FileWriter at 0x7ff8025eea20>\"\n      ]\n     },\n     \"execution_count\": 24,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.summary.FileWriter(log_dir, sess.graph)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, run the worker threads:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"worker_threads = []\\n\",\n    \"for worker in worker_agents:\\n\",\n    \"\\n\",\n    \"    job = lambda: worker.work()\\n\",\n    \"    thread = threading.Thread(target=job)\\n\",\n    \"    thread.start()\\n\",\n    \"    worker_threads.append(thread)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"coord.join(worker_threads)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For a better understanding of A3C architecture, let's take a look at the computational graph\\n\",\n    \"of A3C in the next section. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/.ipynb_checkpoints/9.05. Mountain Car Climbing using A3C-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Mountain car climbing using A3C\\n\",\n    \"\\n\",\n    \"Let's implement the A3C algorithm for the mountain car climbing task. In the mountain car\\n\",\n    \"climbing environment, a car is placed between the two mountains and the goal of the agent\\n\",\n    \"is to drive up the mountain on the right. But the problem is, the agent can't drive up the\\n\",\n    \"mountain in one pass. So, the agent has to drive back and forth to build momentum to\\n\",\n    \"drive up the mountain on the right. A high reward will be assigned if our agent spends less\\n\",\n    \"energy on driving up. The Mountain car environment is shown in the below figure:\\n\",\n    \"\\n\",\n    \"TBA\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The code used in this section is adapted from the open-source implementation of A3C\\n\",\n    \"(https://github.com/stefanbo92/A3C-Continuous) provided by Stefan Boschenriedter.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"non-resource variables are not supported in the long term\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"import multiprocessing\\n\",\n    \"import threading\\n\",\n    \"import numpy as np\\n\",\n    \"import os\\n\",\n    \"import shutil\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import tensorflow as tf\\n\",\n    \"\\n\",\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the mountain car environment\\n\",\n    \"\\n\",\n    \"Let's create a mountain car environment using the gym. Note that our mountain car\\n\",\n    \"environment is a continuous environment meaning that our action space is continuous:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('MountainCarContinuous-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the action shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_shape = env.action_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that we created the continuous mountain car environment and thus our action space\\n\",\n    \"consists of continuous values. So, we get the bound of our action space: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_bound = [env.action_space.low, env.action_space.high]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the variables\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Now, let's define some of the important variables.\\n\",\n    \"\\n\",\n    \"Define the number of workers as the number of CPUs:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_workers = multiprocessing.cpu_count() \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 2000 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 200 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the global network (global agent) scope:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"global_net_scope = 'Global_Net'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the time step at which we want to update the global network:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"update_global = 10\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.90 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the beta value:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"beta = 0.01 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the directory where we want to store the logs:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"log_dir = 'logs'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the actor critic class\\n\",\n    \"\\n\",\n    \"We learned that in A3C both the global and worker agents follow the actor critic\\n\",\n    \"architecture. So, let's define the class called ActorCritic where we will implement the\\n\",\n    \"actor critic algorithm. For a clear understanding, you can check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class ActorCritic(object):\\n\",\n    \"    \\n\",\n    \"     #first, let's define the init method\\n\",\n    \"     def __init__(self, scope, sess, globalAC=None):\\n\",\n    \"            \\n\",\n    \"        #initialize the TensorFlow session\\n\",\n    \"        self.sess=sess\\n\",\n    \"        \\n\",\n    \"        #define the actor network optimizer as RMS prop\\n\",\n    \"        self.actor_optimizer = tf.train.RMSPropOptimizer(0.0001, name='RMSPropA')\\n\",\n    \"        \\n\",\n    \"        #define the critic network optimizer as RMS prop\\n\",\n    \"        self.critic_optimizer = tf.train.RMSPropOptimizer(0.001, name='RMSPropC')\\n\",\n    \" \\n\",\n    \"        #if the scope is the global network (global agent)\\n\",\n    \"        if scope == global_net_scope:\\n\",\n    \"            with tf.variable_scope(scope):\\n\",\n    \"                    \\n\",\n    \"                #define the placeholder for the state\\n\",\n    \"                self.state = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"                \\n\",\n    \"                #build the global network (global agent) and get the actor and critic parameters\\n\",\n    \"                self.actor_params, self.critic_params = self.build_network(scope)[-2:]\\n\",\n    \"      \\n\",\n    \"        #if the network is not the global network then\\n\",\n    \"        else:\\n\",\n    \"            with tf.variable_scope(scope):\\n\",\n    \"                \\n\",\n    \"                #define the placeholder for the state\\n\",\n    \"                self.state = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"                \\n\",\n    \"                #we learned that our environment is the continuous environment, so the actor network\\n\",\n    \"                #(policy network) returns the mean and variance of the action and then we build the action\\n\",\n    \"                #distribution out of this mean and variance and select the action based on this action \\n\",\n    \"                #distribution. \\n\",\n    \"                \\n\",\n    \"                #define the placeholder for obtaining the action distribution\\n\",\n    \"                self.action_dist = tf.placeholder(tf.float32, [None, action_shape], 'action')\\n\",\n    \"                \\n\",\n    \"                #define the placeholder for the target value\\n\",\n    \"                self.target_value = tf.placeholder(tf.float32, [None, 1], 'Vtarget')\\n\",\n    \"                \\n\",\n    \"                #build the worker network (worker agent) and get the mean and variance of the action, the\\n\",\n    \"                #value of the state, and actor and critic network parameters:\\n\",\n    \"                mean, variance, self.value, self.actor_params, self.critic_params = self.build_network(scope)\\n\",\n    \"\\n\",\n    \"                #Compute the TD error which is the difference between the target value of the state and the\\n\",\n    \"                #predicted value of the state\\n\",\n    \"                td_error = tf.subtract(self.target_value, self.value, name='TD_error')\\n\",\n    \"    \\n\",\n    \"                #now, let's define the critic network loss\\n\",\n    \"                with tf.name_scope('critic_loss'):\\n\",\n    \"                    self.critic_loss = tf.reduce_mean(tf.square(td_error))\\n\",\n    \"                    \\n\",\n    \"                with tf.name_scope('wrap_action'):\\n\",\n    \"                    mean, variance = mean * action_bound[1], variance + 1e-4\\n\",\n    \"                    \\n\",\n    \"                #create a normal distribution based on the mean and variance of the action\\n\",\n    \"                normal_dist = tf.distributions.Normal(mean, variance)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"                #now, let's define the actor network loss\\n\",\n    \"                with tf.name_scope('actor_loss'):\\n\",\n    \"                    \\n\",\n    \"                    #compute the log probability of the action\\n\",\n    \"                    log_prob = normal_dist.log_prob(self.action_dist)\\n\",\n    \"         \\n\",\n    \"                    #define the entropy of the policy\\n\",\n    \"                    entropy_pi = normal_dist.entropy()\\n\",\n    \"                    \\n\",\n    \"                    #compute the actor network loss\\n\",\n    \"                    self.loss = log_prob * td_error + (beta * entropy_pi)\\n\",\n    \"                    self.actor_loss = tf.reduce_mean(-self.loss)\\n\",\n    \"       \\n\",\n    \"                #select the action based on the normal distribution\\n\",\n    \"                with tf.name_scope('select_action'):\\n\",\n    \"                    self.action = tf.clip_by_value(tf.squeeze(normal_dist.sample(1), axis=0), \\n\",\n    \"                                                   action_bound[0], action_bound[1])\\n\",\n    \"     \\n\",\n    \"        \\n\",\n    \"                #compute the gradients of actor and critic network loss of the worker agent (local agent)\\n\",\n    \"                with tf.name_scope('local_grad'):\\n\",\n    \"\\n\",\n    \"                    self.actor_grads = tf.gradients(self.actor_loss, self.actor_params)\\n\",\n    \"                    self.critic_grads = tf.gradients(self.critic_loss, self.critic_params)\\n\",\n    \" \\n\",\n    \"            #now, let's perform the sync operation\\n\",\n    \"            with tf.name_scope('sync'):\\n\",\n    \"                \\n\",\n    \"                #after computing the gradients of the loss of the actor and critic network, worker agent\\n\",\n    \"                #sends (push) those gradients to the global agent\\n\",\n    \"                with tf.name_scope('push'):\\n\",\n    \"                    self.update_actor_params = self.actor_optimizer.apply_gradients(zip(self.actor_grads,\\n\",\n    \"                                                                                        globalAC.actor_params))\\n\",\n    \"                    self.update_critic_params = self.critic_optimizer.apply_gradients(zip(self.critic_grads, \\n\",\n    \"                                                                                          globalAC.critic_params))\\n\",\n    \"\\n\",\n    \"                #global agent updates their parameter with the gradients received from the worker agents\\n\",\n    \"                #(local agents). Then the worker agents, pull the updated parameter from the global agent\\n\",\n    \"                with tf.name_scope('pull'):\\n\",\n    \"                    self.pull_actor_params = [l_p.assign(g_p) for l_p, g_p in zip(self.actor_params, \\n\",\n    \"                                                                                  globalAC.actor_params)]\\n\",\n    \"                    self.pull_critic_params = [l_p.assign(g_p) for l_p, g_p in zip(self.critic_params, \\n\",\n    \"                                                                                   globalAC.critic_params)]\\n\",\n    \"                \\n\",\n    \"\\n\",\n    \"     #let's define the function for building the actor critic network\\n\",\n    \"     def build_network(self, scope):\\n\",\n    \"            \\n\",\n    \"        #initialize the weight:\\n\",\n    \"        w_init = tf.random_normal_initializer(0., .1)\\n\",\n    \"        \\n\",\n    \"        #define the actor network which returns the mean and variance of the action\\n\",\n    \"        with tf.variable_scope('actor'):\\n\",\n    \"            l_a = tf.layers.dense(self.state, 200, tf.nn.relu, kernel_initializer=w_init, name='la')\\n\",\n    \"            mean = tf.layers.dense(l_a, action_shape, tf.nn.tanh,kernel_initializer=w_init, name='mean')\\n\",\n    \"            variance = tf.layers.dense(l_a, action_shape, tf.nn.softplus, kernel_initializer=w_init, name='variance')\\n\",\n    \"            \\n\",\n    \"        #define the critic network which returns the value of the state\\n\",\n    \"        with tf.variable_scope('critic'):\\n\",\n    \"            l_c = tf.layers.dense(self.state, 100, tf.nn.relu, kernel_initializer=w_init, name='lc')\\n\",\n    \"            value = tf.layers.dense(l_c, 1, kernel_initializer=w_init, name='value')\\n\",\n    \"        \\n\",\n    \"        actor_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\\n\",\n    \"        critic_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\\n\",\n    \"        \\n\",\n    \"        #Return the mean and variance of the action produced by the actor network, value of the\\n\",\n    \"        #state computed by the critic network and the parameters of the actor and critic network\\n\",\n    \"        \\n\",\n    \"        return mean, variance, value, actor_params, critic_params\\n\",\n    \"    \\n\",\n    \"     #let's define a function called update_global for updating the parameters of the global\\n\",\n    \"     #network with the gradients of loss computed by the worker networks, that is, the push operation\\n\",\n    \"     def update_global(self, feed_dict):\\n\",\n    \"        self.sess.run([self.update_actor_params, self.update_critic_params], feed_dict)\\n\",\n    \"     \\n\",\n    \"     #we also define a function called pull_from_global for updating the parameters of the\\n\",\n    \"     #worker networks by pulling from the global network, that is, the pull operation\\n\",\n    \"     def pull_from_global(self):\\n\",\n    \"        self.sess.run([self.pull_actor_params, self.pull_critic_params])\\n\",\n    \"     \\n\",\n    \"     #define a function called select_action for selecting the action\\n\",\n    \"     def select_action(self, state):   \\n\",\n    \"        state = state[np.newaxis, :]\\n\",\n    \"        return self.sess.run(self.action, {self.state: state})[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the worker class\\n\",\n    \"\\n\",\n    \"Let's define the class called Worker where we will implement the worker agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class Worker(object):\\n\",\n    \"    \\n\",\n    \"    #first, let's define the init method:\\n\",\n    \"    def __init__(self, name, globalAC, sess):\\n\",\n    \"\\n\",\n    \"        #we learned that each worker agent works with their own copies of the environment. So,\\n\",\n    \"        #let's create a mountain car environment\\n\",\n    \"        self.env = gym.make('MountainCarContinuous-v0').unwrapped\\n\",\n    \"        \\n\",\n    \"        #define the name of the worker\\n\",\n    \"        self.name = name\\n\",\n    \"    \\n\",\n    \"        #create an object to our ActorCritic class\\n\",\n    \"        self.AC = ActorCritic(name, sess, globalAC)\\n\",\n    \"        \\n\",\n    \"        #initialize a TensorFlow session\\n\",\n    \"        self.sess=sess\\n\",\n    \"        \\n\",\n    \"    #define a function called work for the worker to learn:\\n\",\n    \"    def work(self):\\n\",\n    \"        global global_rewards, global_episodes\\n\",\n    \"        \\n\",\n    \"        #initialize the time step\\n\",\n    \"        total_step = 1\\n\",\n    \"     \\n\",\n    \"        #initialize a list for storing the states, actions, and rewards\\n\",\n    \"        batch_states, batch_actions, batch_rewards = [], [], []\\n\",\n    \"        \\n\",\n    \"        #when the global episodes are less than the number of episodes and coordinator is active\\n\",\n    \"        while not coord.should_stop() and global_episodes < num_episodes:\\n\",\n    \"            \\n\",\n    \"            #initialize the state by resetting the environment\\n\",\n    \"            state = self.env.reset()\\n\",\n    \"            \\n\",\n    \"            #initialize the return\\n\",\n    \"            Return = 0\\n\",\n    \"            \\n\",\n    \"            #for each step in the environment\\n\",\n    \"            for t in range(num_timesteps):\\n\",\n    \"                \\n\",\n    \"                #render the environment of only the worker 0:\\n\",\n    \"                if self.name == 'W_0':\\n\",\n    \"                    self.env.render()\\n\",\n    \"                    \\n\",\n    \"                #select the action\\n\",\n    \"                action = self.AC.select_action(state)\\n\",\n    \"                \\n\",\n    \"                #perform the selected action\\n\",\n    \"                next_state, reward, done, _ = self.env.step(action)\\n\",\n    \"                \\n\",\n    \"                #set done to true if we reached the final step of the episode else set to false\\n\",\n    \"                done = True if t == num_timesteps - 1 else False\\n\",\n    \"                \\n\",\n    \"                #update the return\\n\",\n    \"                Return += reward\\n\",\n    \"                \\n\",\n    \"                #store the state, action, and reward into the lists\\n\",\n    \"                batch_states.append(state)\\n\",\n    \"                batch_actions.append(action)\\n\",\n    \"                batch_rewards.append((reward+8)/8)\\n\",\n    \"    \\n\",\n    \"                #now, let's update the global network. If done is true then set the value of next state to 0 else\\n\",\n    \"                #the compute the value of the next state\\n\",\n    \"                if total_step % update_global == 0 or done:\\n\",\n    \"                    if done:\\n\",\n    \"                        v_s_ = 0\\n\",\n    \"                    else:\\n\",\n    \"                        v_s_ = self.sess.run(self.AC.value, {self.AC.state: next_state[np.newaxis, :]})[0, 0]\\n\",\n    \" \\n\",\n    \"                    batch_target_value = []\\n\",\n    \"                    \\n\",\n    \"                    #compute the target value which is sum of reward and discounted value of next state\\n\",\n    \"                    for reward in batch_rewards[::-1]:\\n\",\n    \"                        v_s_ = reward + gamma * v_s_\\n\",\n    \"                        batch_target_value.append(v_s_)\\n\",\n    \"\\n\",\n    \"                    #reverse the target value\\n\",\n    \"                    batch_target_value.reverse()\\n\",\n    \"                    \\n\",\n    \"                    #stack the state, action and target value\\n\",\n    \"                    batch_states, batch_actions, batch_target_value = np.vstack(batch_states), np.vstack(batch_actions), np.vstack(batch_target_value)\\n\",\n    \"                    \\n\",\n    \"                    #define the feed dictionary\\n\",\n    \"                    feed_dict = {\\n\",\n    \"                                 self.AC.state: batch_states,\\n\",\n    \"                                 self.AC.action_dist: batch_actions,\\n\",\n    \"                                 self.AC.target_value: batch_target_value,\\n\",\n    \"                                 }\\n\",\n    \"                    \\n\",\n    \"                    #update the global network\\n\",\n    \"                    self.AC.update_global(feed_dict)\\n\",\n    \"                    \\n\",\n    \"                    #empty the lists:\\n\",\n    \"                    batch_states, batch_actions, batch_rewards = [], [], []\\n\",\n    \"                    \\n\",\n    \"                    #update the worker network by pulling the parameters from the global network:\\n\",\n    \"                    self.AC.pull_from_global()\\n\",\n    \"                    \\n\",\n    \"                #update the state to the next state and increment the total step:\\n\",\n    \"                state = next_state\\n\",\n    \"                total_step += 1\\n\",\n    \"                \\n\",\n    \"                #update global rewards:\\n\",\n    \"                if done:\\n\",\n    \"                    if len(global_rewards) < 5:\\n\",\n    \"                        global_rewards.append(Return)\\n\",\n    \"                    else:\\n\",\n    \"                        global_rewards.append(Return)\\n\",\n    \"                        global_rewards[-1] =(np.mean(global_rewards[-5:]))\\n\",\n    \"                    \\n\",\n    \"                    global_episodes += 1\\n\",\n    \"                    break\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. Initialize the global rewards list and also initialize the\\n\",\n    \"global episodes counter:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"global_rewards = []\\n\",\n    \"global_episodes = 0\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Start the TensorFlow session:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sess = tf.Session()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:From <ipython-input-14-836e004bd1a0>:115: dense (from tensorflow.python.keras.legacy_tf_layers.core) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"Use keras.layers.Dense instead.\\n\",\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/keras/legacy_tf_layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"Please use `layer.__call__` method instead.\\n\",\n      \"WARNING:tensorflow:From <ipython-input-14-836e004bd1a0>:59: Normal.__init__ (from tensorflow.python.ops.distributions.normal) is deprecated and will be removed after 2019-01-01.\\n\",\n      \"Instructions for updating:\\n\",\n      \"The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.\\n\",\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/ops/distributions/normal.py:160: Distribution.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.\\n\",\n      \"Instructions for updating:\\n\",\n      \"The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.\\n\",\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/training/rmsprop.py:123: calling Ones.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"Call initializer instance with the dtype argument instead of passing it to the constructor\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"with tf.device(\\\"/cpu:0\\\"):\\n\",\n    \"    \\n\",\n    \"    #create a global agent\\n\",\n    \"    global_agent = ActorCritic(global_net_scope,sess)\\n\",\n    \"    worker_agents = []\\n\",\n    \"    \\n\",\n    \"    #create n number of worker agent:\\n\",\n    \"    for i in range(num_workers):\\n\",\n    \"        i_name = 'W_%i' % i\\n\",\n    \"        worker_agents.append(Worker(i_name, global_agent,sess))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the TensorFlow coordinator:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"coord = tf.train.Coordinator()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize all the TensorFlow variables:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sess.run(tf.global_variables_initializer())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Store the TensorFlow computational graph to the log directory:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"if os.path.exists(log_dir):\\n\",\n    \"    shutil.rmtree(log_dir)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<tensorflow.python.summary.writer.writer.FileWriter at 0x7f3dd036ada0>\"\n      ]\n     },\n     \"execution_count\": 22,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.summary.FileWriter(log_dir, sess.graph)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, run the worker threads:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"worker_threads = []\\n\",\n    \"for worker in worker_agents:\\n\",\n    \"\\n\",\n    \"    job = lambda: worker.work()\\n\",\n    \"    thread = threading.Thread(target=job)\\n\",\n    \"    thread.start()\\n\",\n    \"    worker_threads.append(thread)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"coord.join(worker_threads)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For a better understanding of A3C architecture, let's take a look at the computational graph\\n\",\n    \"of A3C in the next section. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/11.05. Mountain Car Climbing using A3C.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Mountain car climbing using A3C\\n\",\n    \"\\n\",\n    \"Let's implement the A3C algorithm for the mountain car climbing task. In the mountain car\\n\",\n    \"climbing environment, a car is placed between the two mountains and the goal of the agent\\n\",\n    \"is to drive up the mountain on the right. But the problem is, the agent can't drive up the\\n\",\n    \"mountain in one pass. So, the agent has to drive back and forth to build momentum to\\n\",\n    \"drive up the mountain on the right. A high reward will be assigned if our agent spends less\\n\",\n    \"energy on driving up. The Mountain car environment is shown in the below figure:\\n\",\n    \"\\n\",\n    \"![title](Images/2.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The code used in this section is adapted from the open-source implementation of A3C\\n\",\n    \"(https://github.com/stefanbo92/A3C-Continuous) provided by Stefan Boschenriedter.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"import multiprocessing\\n\",\n    \"import threading\\n\",\n    \"import numpy as np\\n\",\n    \"import os\\n\",\n    \"import shutil\\n\",\n    \"import matplotlib.pyplot as plt\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For a clear understanding of how the A2C method works, we use\\n\",\n    \"TensorFlow in the non-eager mode by disabling TensorFlow 2 behavior.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the mountain car environment\\n\",\n    \"\\n\",\n    \"Let's create a mountain car environment using the gym. Note that our mountain car\\n\",\n    \"environment is a continuous environment meaning that our action space is continuous:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('MountainCarContinuous-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the action shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_shape = env.action_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that we created the continuous mountain car environment and thus our action space\\n\",\n    \"consists of continuous values. So, we get the bound of our action space: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_bound = [env.action_space.low, env.action_space.high]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the variables\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Now, let's define some of the important variables.\\n\",\n    \"\\n\",\n    \"Define the number of workers as the number of CPUs:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_workers = multiprocessing.cpu_count() \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 2000 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 200 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the global network (global agent) scope:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"global_net_scope = 'Global_Net'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the time step at which we want to update the global network:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"update_global = 10\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.90 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the beta value:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"beta = 0.01 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the directory where we want to store the logs:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"log_dir = 'logs'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the actor critic class\\n\",\n    \"\\n\",\n    \"We learned that in A3C both the global and worker agents follow the actor critic\\n\",\n    \"architecture. So, let's define the class called ActorCritic where we will implement the\\n\",\n    \"actor critic algorithm. For a clear understanding, you can check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class ActorCritic(object):\\n\",\n    \"    \\n\",\n    \"     #first, let's define the init method\\n\",\n    \"     def __init__(self, scope, sess, globalAC=None):\\n\",\n    \"            \\n\",\n    \"        #initialize the TensorFlow session\\n\",\n    \"        self.sess=sess\\n\",\n    \"        \\n\",\n    \"        #define the actor network optimizer as RMS prop\\n\",\n    \"        self.actor_optimizer = tf.train.RMSPropOptimizer(0.0001, name='RMSPropA')\\n\",\n    \"        \\n\",\n    \"        #define the critic network optimizer as RMS prop\\n\",\n    \"        self.critic_optimizer = tf.train.RMSPropOptimizer(0.001, name='RMSPropC')\\n\",\n    \" \\n\",\n    \"        #if the scope is the global network (global agent)\\n\",\n    \"        if scope == global_net_scope:\\n\",\n    \"            with tf.variable_scope(scope):\\n\",\n    \"                    \\n\",\n    \"                #define the placeholder for the state\\n\",\n    \"                self.state = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"                \\n\",\n    \"                #build the global network (global agent) and get the actor and critic parameters\\n\",\n    \"                self.actor_params, self.critic_params = self.build_network(scope)[-2:]\\n\",\n    \"      \\n\",\n    \"        #if the network is not the global network then\\n\",\n    \"        else:\\n\",\n    \"            with tf.variable_scope(scope):\\n\",\n    \"                \\n\",\n    \"                #define the placeholder for the state\\n\",\n    \"                self.state = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"                \\n\",\n    \"                #we learned that our environment is the continuous environment, so the actor network\\n\",\n    \"                #(policy network) returns the mean and variance of the action and then we build the action\\n\",\n    \"                #distribution out of this mean and variance and select the action based on this action \\n\",\n    \"                #distribution. \\n\",\n    \"                \\n\",\n    \"                #define the placeholder for obtaining the action distribution\\n\",\n    \"                self.action_dist = tf.placeholder(tf.float32, [None, action_shape], 'action')\\n\",\n    \"                \\n\",\n    \"                #define the placeholder for the target value\\n\",\n    \"                self.target_value = tf.placeholder(tf.float32, [None, 1], 'Vtarget')\\n\",\n    \"                \\n\",\n    \"                #build the worker network (worker agent) and get the mean and variance of the action, the\\n\",\n    \"                #value of the state, and actor and critic network parameters:\\n\",\n    \"                mean, variance, self.value, self.actor_params, self.critic_params = self.build_network(scope)\\n\",\n    \"\\n\",\n    \"                #Compute the TD error which is the difference between the target value of the state and the\\n\",\n    \"                #predicted value of the state\\n\",\n    \"                td_error = tf.subtract(self.target_value, self.value, name='TD_error')\\n\",\n    \"    \\n\",\n    \"                #now, let's define the critic network loss\\n\",\n    \"                with tf.name_scope('critic_loss'):\\n\",\n    \"                    self.critic_loss = tf.reduce_mean(tf.square(td_error))\\n\",\n    \"                    \\n\",\n    \"                with tf.name_scope('wrap_action'):\\n\",\n    \"                    mean, variance = mean * action_bound[1], variance + 1e-4\\n\",\n    \"                    \\n\",\n    \"                #create a normal distribution based on the mean and variance of the action\\n\",\n    \"                normal_dist = tf.distributions.Normal(mean, variance)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"                #now, let's define the actor network loss\\n\",\n    \"                with tf.name_scope('actor_loss'):\\n\",\n    \"                    \\n\",\n    \"                    #compute the log probability of the action\\n\",\n    \"                    log_prob = normal_dist.log_prob(self.action_dist)\\n\",\n    \"         \\n\",\n    \"                    #define the entropy of the policy\\n\",\n    \"                    entropy_pi = normal_dist.entropy()\\n\",\n    \"                    \\n\",\n    \"                    #compute the actor network loss\\n\",\n    \"                    self.loss = log_prob * td_error + (beta * entropy_pi)\\n\",\n    \"                    self.actor_loss = tf.reduce_mean(-self.loss)\\n\",\n    \"       \\n\",\n    \"                #select the action based on the normal distribution\\n\",\n    \"                with tf.name_scope('select_action'):\\n\",\n    \"                    self.action = tf.clip_by_value(tf.squeeze(normal_dist.sample(1), axis=0), \\n\",\n    \"                                                   action_bound[0], action_bound[1])\\n\",\n    \"     \\n\",\n    \"        \\n\",\n    \"                #compute the gradients of actor and critic network loss of the worker agent (local agent)\\n\",\n    \"                with tf.name_scope('local_grad'):\\n\",\n    \"\\n\",\n    \"                    self.actor_grads = tf.gradients(self.actor_loss, self.actor_params)\\n\",\n    \"                    self.critic_grads = tf.gradients(self.critic_loss, self.critic_params)\\n\",\n    \" \\n\",\n    \"            #now, let's perform the sync operation\\n\",\n    \"            with tf.name_scope('sync'):\\n\",\n    \"                \\n\",\n    \"                #after computing the gradients of the loss of the actor and critic network, worker agent\\n\",\n    \"                #sends (push) those gradients to the global agent\\n\",\n    \"                with tf.name_scope('push'):\\n\",\n    \"                    self.update_actor_params = self.actor_optimizer.apply_gradients(zip(self.actor_grads,\\n\",\n    \"                                                                                        globalAC.actor_params))\\n\",\n    \"                    self.update_critic_params = self.critic_optimizer.apply_gradients(zip(self.critic_grads, \\n\",\n    \"                                                                                          globalAC.critic_params))\\n\",\n    \"\\n\",\n    \"                #global agent updates their parameter with the gradients received from the worker agents\\n\",\n    \"                #(local agents). Then the worker agents, pull the updated parameter from the global agent\\n\",\n    \"                with tf.name_scope('pull'):\\n\",\n    \"                    self.pull_actor_params = [l_p.assign(g_p) for l_p, g_p in zip(self.actor_params, \\n\",\n    \"                                                                                  globalAC.actor_params)]\\n\",\n    \"                    self.pull_critic_params = [l_p.assign(g_p) for l_p, g_p in zip(self.critic_params, \\n\",\n    \"                                                                                   globalAC.critic_params)]\\n\",\n    \"                \\n\",\n    \"\\n\",\n    \"     #let's define the function for building the actor critic network\\n\",\n    \"     def build_network(self, scope):\\n\",\n    \"            \\n\",\n    \"        #initialize the weight:\\n\",\n    \"        w_init = tf.random_normal_initializer(0., .1)\\n\",\n    \"        \\n\",\n    \"        #define the actor network which returns the mean and variance of the action\\n\",\n    \"        with tf.variable_scope('actor'):\\n\",\n    \"            l_a = tf.layers.dense(self.state, 200, tf.nn.relu, kernel_initializer=w_init, name='la')\\n\",\n    \"            mean = tf.layers.dense(l_a, action_shape, tf.nn.tanh,kernel_initializer=w_init, name='mean')\\n\",\n    \"            variance = tf.layers.dense(l_a, action_shape, tf.nn.softplus, kernel_initializer=w_init, name='variance')\\n\",\n    \"            \\n\",\n    \"        #define the critic network which returns the value of the state\\n\",\n    \"        with tf.variable_scope('critic'):\\n\",\n    \"            l_c = tf.layers.dense(self.state, 100, tf.nn.relu, kernel_initializer=w_init, name='lc')\\n\",\n    \"            value = tf.layers.dense(l_c, 1, kernel_initializer=w_init, name='value')\\n\",\n    \"        \\n\",\n    \"        actor_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')\\n\",\n    \"        critic_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')\\n\",\n    \"        \\n\",\n    \"        #Return the mean and variance of the action produced by the actor network, value of the\\n\",\n    \"        #state computed by the critic network and the parameters of the actor and critic network\\n\",\n    \"        \\n\",\n    \"        return mean, variance, value, actor_params, critic_params\\n\",\n    \"    \\n\",\n    \"     #let's define a function called update_global for updating the parameters of the global\\n\",\n    \"     #network with the gradients of loss computed by the worker networks, that is, the push operation\\n\",\n    \"     def update_global(self, feed_dict):\\n\",\n    \"        self.sess.run([self.update_actor_params, self.update_critic_params], feed_dict)\\n\",\n    \"     \\n\",\n    \"     #we also define a function called pull_from_global for updating the parameters of the\\n\",\n    \"     #worker networks by pulling from the global network, that is, the pull operation\\n\",\n    \"     def pull_from_global(self):\\n\",\n    \"        self.sess.run([self.pull_actor_params, self.pull_critic_params])\\n\",\n    \"     \\n\",\n    \"     #define a function called select_action for selecting the action\\n\",\n    \"     def select_action(self, state):   \\n\",\n    \"        state = state[np.newaxis, :]\\n\",\n    \"        return self.sess.run(self.action, {self.state: state})[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the worker class\\n\",\n    \"\\n\",\n    \"Let's define the class called Worker where we will implement the worker agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class Worker(object):\\n\",\n    \"    \\n\",\n    \"    #first, let's define the init method:\\n\",\n    \"    def __init__(self, name, globalAC, sess):\\n\",\n    \"\\n\",\n    \"        #we learned that each worker agent works with their own copies of the environment. So,\\n\",\n    \"        #let's create a mountain car environment\\n\",\n    \"        self.env = gym.make('MountainCarContinuous-v0').unwrapped\\n\",\n    \"        \\n\",\n    \"        #define the name of the worker\\n\",\n    \"        self.name = name\\n\",\n    \"    \\n\",\n    \"        #create an object to our ActorCritic class\\n\",\n    \"        self.AC = ActorCritic(name, sess, globalAC)\\n\",\n    \"        \\n\",\n    \"        #initialize a TensorFlow session\\n\",\n    \"        self.sess=sess\\n\",\n    \"        \\n\",\n    \"    #define a function called work for the worker to learn:\\n\",\n    \"    def work(self):\\n\",\n    \"        global global_rewards, global_episodes\\n\",\n    \"        \\n\",\n    \"        #initialize the time step\\n\",\n    \"        total_step = 1\\n\",\n    \"     \\n\",\n    \"        #initialize a list for storing the states, actions, and rewards\\n\",\n    \"        batch_states, batch_actions, batch_rewards = [], [], []\\n\",\n    \"        \\n\",\n    \"        #when the global episodes are less than the number of episodes and coordinator is active\\n\",\n    \"        while not coord.should_stop() and global_episodes < num_episodes:\\n\",\n    \"            \\n\",\n    \"            #initialize the state by resetting the environment\\n\",\n    \"            state = self.env.reset()\\n\",\n    \"            \\n\",\n    \"            #initialize the return\\n\",\n    \"            Return = 0\\n\",\n    \"            \\n\",\n    \"            #for each step in the environment\\n\",\n    \"            for t in range(num_timesteps):\\n\",\n    \"                \\n\",\n    \"                #render the environment of only the worker 0:\\n\",\n    \"                if self.name == 'W_0':\\n\",\n    \"                    self.env.render()\\n\",\n    \"                    \\n\",\n    \"                #select the action\\n\",\n    \"                action = self.AC.select_action(state)\\n\",\n    \"                \\n\",\n    \"                #perform the selected action\\n\",\n    \"                next_state, reward, done, _ = self.env.step(action)\\n\",\n    \"                \\n\",\n    \"                #set done to true if we reached the final step of the episode else set to false\\n\",\n    \"                done = True if t == num_timesteps - 1 else False\\n\",\n    \"                \\n\",\n    \"                #update the return\\n\",\n    \"                Return += reward\\n\",\n    \"                \\n\",\n    \"                #store the state, action, and reward into the lists\\n\",\n    \"                batch_states.append(state)\\n\",\n    \"                batch_actions.append(action)\\n\",\n    \"                batch_rewards.append((reward+8)/8)\\n\",\n    \"    \\n\",\n    \"                #now, let's update the global network. If done is true then set the value of next state to 0 else\\n\",\n    \"                #the compute the value of the next state\\n\",\n    \"                if total_step % update_global == 0 or done:\\n\",\n    \"                    if done:\\n\",\n    \"                        v_s_ = 0\\n\",\n    \"                    else:\\n\",\n    \"                        v_s_ = self.sess.run(self.AC.value, {self.AC.state: next_state[np.newaxis, :]})[0, 0]\\n\",\n    \" \\n\",\n    \"                    batch_target_value = []\\n\",\n    \"                    \\n\",\n    \"                    #compute the target value which is sum of reward and discounted value of next state\\n\",\n    \"                    for reward in batch_rewards[::-1]:\\n\",\n    \"                        v_s_ = reward + gamma * v_s_\\n\",\n    \"                        batch_target_value.append(v_s_)\\n\",\n    \"\\n\",\n    \"                    #reverse the target value\\n\",\n    \"                    batch_target_value.reverse()\\n\",\n    \"                    \\n\",\n    \"                    #stack the state, action and target value\\n\",\n    \"                    batch_states, batch_actions, batch_target_value = np.vstack(batch_states), np.vstack(batch_actions), np.vstack(batch_target_value)\\n\",\n    \"                    \\n\",\n    \"                    #define the feed dictionary\\n\",\n    \"                    feed_dict = {\\n\",\n    \"                                 self.AC.state: batch_states,\\n\",\n    \"                                 self.AC.action_dist: batch_actions,\\n\",\n    \"                                 self.AC.target_value: batch_target_value,\\n\",\n    \"                                 }\\n\",\n    \"                    \\n\",\n    \"                    #update the global network\\n\",\n    \"                    self.AC.update_global(feed_dict)\\n\",\n    \"                    \\n\",\n    \"                    #empty the lists:\\n\",\n    \"                    batch_states, batch_actions, batch_rewards = [], [], []\\n\",\n    \"                    \\n\",\n    \"                    #update the worker network by pulling the parameters from the global network:\\n\",\n    \"                    self.AC.pull_from_global()\\n\",\n    \"                    \\n\",\n    \"                #update the state to the next state and increment the total step:\\n\",\n    \"                state = next_state\\n\",\n    \"                total_step += 1\\n\",\n    \"                \\n\",\n    \"                #update global rewards:\\n\",\n    \"                if done:\\n\",\n    \"                    if len(global_rewards) < 5:\\n\",\n    \"                        global_rewards.append(Return)\\n\",\n    \"                    else:\\n\",\n    \"                        global_rewards.append(Return)\\n\",\n    \"                        global_rewards[-1] =(np.mean(global_rewards[-5:]))\\n\",\n    \"                    \\n\",\n    \"                    global_episodes += 1\\n\",\n    \"                    break\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. Initialize the global rewards list and also initialize the\\n\",\n    \"global episodes counter:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"global_rewards = []\\n\",\n    \"global_episodes = 0\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Start the TensorFlow session:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sess = tf.Session()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"with tf.device(\\\"/cpu:0\\\"):\\n\",\n    \"    \\n\",\n    \"    #create a global agent\\n\",\n    \"    global_agent = ActorCritic(global_net_scope,sess)\\n\",\n    \"    worker_agents = []\\n\",\n    \"    \\n\",\n    \"    #create n number of worker agent:\\n\",\n    \"    for i in range(num_workers):\\n\",\n    \"        i_name = 'W_%i' % i\\n\",\n    \"        worker_agents.append(Worker(i_name, global_agent,sess))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the TensorFlow coordinator:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"coord = tf.train.Coordinator()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize all the TensorFlow variables:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sess.run(tf.global_variables_initializer())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Store the TensorFlow computational graph to the log directory:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"if os.path.exists(log_dir):\\n\",\n    \"    shutil.rmtree(log_dir)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 24,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<tensorflow.python.summary.writer.writer.FileWriter at 0x7ff8025eea20>\"\n      ]\n     },\n     \"execution_count\": 24,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tf.summary.FileWriter(log_dir, sess.graph)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, run the worker threads:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"worker_threads = []\\n\",\n    \"for worker in worker_agents:\\n\",\n    \"\\n\",\n    \"    job = lambda: worker.work()\\n\",\n    \"    thread = threading.Thread(target=job)\\n\",\n    \"    thread.start()\\n\",\n    \"    worker_threads.append(thread)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"coord.join(worker_threads)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For a better understanding of A3C architecture, let's take a look at the computational graph\\n\",\n    \"of A3C in the next section. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "11. Actor Critic Methods - A2C and A3C/README.md",
    "content": "# 11. Actor Critic Methods - A2C and A3C\n* 11.1. Overview of Actor Critic Method\n* 11.2. Understanding the Actor Critic Method\n   * 11.2.1. Algorithm - Actor Critic\n* 11.3. Advantage Actor Critic\n* 11.4. Asynchronous Advantage Actor Critic\n   * 11.4.1. The Three As\n   * 11.4.2. The Architecture of A3C\n* 11.5. Mountain Car Climbing using A3C\n* 11.6. A2C Revisited"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/10.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Swinging up the pendulum using DDPG\\n\",\n    \"\\n\",\n    \"In this section, let's implement the DDPG algorithm to train the agent for swinging up the\\n\",\n    \"pendulum. That is, we will have a pendulum which starts swinging from a random\\n\",\n    \"position and the goal of our agent is to swing the pendulum up so it stays upright.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"First, let's import the required libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"non-resource variables are not supported in the long term\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import tensorflow as tf\\n\",\n    \"import numpy as np\\n\",\n    \"import gym\\n\",\n    \"\\n\",\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the gym environment\\n\",\n    \"\\n\",\n    \"Let's create a pendulum environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"Pendulum-v0\\\").unwrapped\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the action shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_shape = env.action_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that the pendulum is a continuous environment and thus our action space consists of\\n\",\n    \"continuous values. So, we get the bound of our action space:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_bound = [env.action_space.low, env.action_space.high]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the variables\\n\",\n    \"\\n\",\n    \"Now, let's define some of the important variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.9 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the value of $\\\\tau$ which is used for soft replacement:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"tau = 0.01 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the size of our replay buffer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"replay_buffer = 10000 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 32  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Defining the DDPG class\\n\",\n    \"\\n\",\n    \"Let's define the class called DDPG where we will implement the DDPG algorithm. For a clear understanding, you can also check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class DDPG(object):\\n\",\n    \"\\n\",\n    \"    #first, let's define the init method\\n\",\n    \"    def __init__(self, state_shape, action_shape, high_action_value,):\\n\",\n    \"        \\n\",\n    \"        #define the replay buffer for storing the transitions\\n\",\n    \"        self.replay_buffer = np.zeros((replay_buffer, state_shape * 2 + action_shape + 1), dtype=np.float32)\\n\",\n    \"    \\n\",\n    \"        #initialize the num_transitionsto 0 which implies that the number of transitions in our\\n\",\n    \"        #replay buffer is zero\\n\",\n    \"        self.num_transitions = 0\\n\",\n    \"            \\n\",\n    \"        #start the TensorFlow session\\n\",\n    \"        self.sess = tf.Session()\\n\",\n    \"        \\n\",\n    \"        #we learned that in DDPG, instead of selecting the action directly, to ensure exploration,\\n\",\n    \"        #we add some noise using the Ornstein-Uhlenbeck process. So, we first initialize the noise\\n\",\n    \"        self.noise = 3.0\\n\",\n    \"        \\n\",\n    \"        #initialize the state shape, action shape, and high action value\\n\",\n    \"        self.state_shape, self.action_shape, self.high_action_value = state_shape, action_shape, high_action_value\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the state\\n\",\n    \"        self.state = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the next state\\n\",\n    \"        self.next_state = tf.placeholder(tf.float32, [None, state_shape], 'next_state')\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for reward\\n\",\n    \"        self.reward = tf.placeholder(tf.float32, [None, 1], 'reward')\\n\",\n    \"        \\n\",\n    \"        #with the actor variable scope\\n\",\n    \"        with tf.variable_scope('Actor'):\\n\",\n    \"\\n\",\n    \"            #define the main actor network which is parameterized by phi. Actor network takes the state\\n\",\n    \"            #as an input and returns the action to be performed in that state\\n\",\n    \"            self.actor = self.build_actor_network(self.state, scope='main', trainable=True)\\n\",\n    \"            \\n\",\n    \"            #Define the target actor network which is parameterized by phi dash. Target actor network takes\\n\",\n    \"            #the next state as an input and returns the action to be performed in that state\\n\",\n    \"            target_actor = self.build_actor_network(self.next_state, scope='target', trainable=False)\\n\",\n    \"            \\n\",\n    \"        #with the critic variable scope\\n\",\n    \"        with tf.variable_scope('Critic'):\\n\",\n    \"            \\n\",\n    \"            #define the main critic network which is parameterized by theta. Critic network takes the state\\n\",\n    \"            #and also the action produced by the actor in that state as an input and returns the Q value\\n\",\n    \"            critic = self.build_critic_network(self.state, self.actor, scope='main', trainable=True)\\n\",\n    \"            \\n\",\n    \"            #Define the target critic network which is parameterized by theta dash. Target critic network takes\\n\",\n    \"            #the next state and also the action produced by the target actor network in the next state as\\n\",\n    \"            #an input and returns the Q value\\n\",\n    \"            target_critic = self.build_critic_network(self.next_state, target_actor, scope='target', \\n\",\n    \"                                                      trainable=False)\\n\",\n    \"            \\n\",\n    \"        \\n\",\n    \"        #get the parameter of the main actor network, phi\\n\",\n    \"        self.main_actor_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/main')\\n\",\n    \"        \\n\",\n    \"        #get the parameter of the target actor network, phi dash\\n\",\n    \"        self.target_actor_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target')\\n\",\n    \"    \\n\",\n    \"        #get the parameter of the main critic network, theta\\n\",\n    \"        self.main_critic_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/main')\\n\",\n    \"        \\n\",\n    \"        #get the parameter of the target critic networ, theta dash\\n\",\n    \"        self.target_critic_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target')\\n\",\n    \"\\n\",\n    \"        #perform the soft replacement and update the parameter of the target actor network and\\n\",\n    \"        #the parameter of the target critic network\\n\",\n    \"        self.soft_replacement = [\\n\",\n    \"\\n\",\n    \"            [tf.assign(phi_, tau*phi + (1-tau)*phi_), tf.assign(theta_, tau*theta + (1-tau)*theta_)]\\n\",\n    \"            for phi, phi_, theta, theta_ in zip(self.main_actor_params, self.target_actor_params, self.main_critic_params, self.target_critic_params)\\n\",\n    \"\\n\",\n    \"            ]\\n\",\n    \"        \\n\",\n    \"        #compute the target Q value, we learned that the target Q value can be computed as the\\n\",\n    \"        #sum of reward and discounted Q value of next state-action pair\\n\",\n    \"        y = self.reward + gamma * target_critic\\n\",\n    \"        \\n\",\n    \"        #now, let's compute the loss of the critic network. The loss of the critic network is the mean\\n\",\n    \"        #squared error between the target Q value and the predicted Q value\\n\",\n    \"        MSE = tf.losses.mean_squared_error(labels=y, predictions=critic)\\n\",\n    \"        \\n\",\n    \"        #train the critic network by minimizing the mean squared error using Adam optimizer\\n\",\n    \"        self.train_critic = tf.train.AdamOptimizer(0.01).minimize(MSE, name=\\\"adam-ink\\\", var_list = self.main_critic_params)\\n\",\n    \"        \\n\",\n    \"        \\n\",\n    \"        #We learned that the objective function of the actor is to generate an action that maximizes\\n\",\n    \"        #the Q value produced by the critic network. We can maximize the above objective by computing gradients \\n\",\n    \"        #and by performing gradient ascent. However, it is a standard convention to perform minimization rather \\n\",\n    \"        #than maximization. So, we can convert the above maximization objective into the minimization\\n\",\n    \"        #objective by just adding a negative sign.\\n\",\n    \"        \\n\",\n    \"        \\n\",\n    \"        #now we can minimize the actor network objective by computing gradients and by performing gradient descent\\n\",\n    \"        actor_loss = -tf.reduce_mean(critic)    \\n\",\n    \"           \\n\",\n    \"        #train the actor network by minimizing the loss using Adam optimizer\\n\",\n    \"        self.train_actor = tf.train.AdamOptimizer(0.001).minimize(actor_loss, var_list=self.main_actor_params)\\n\",\n    \"            \\n\",\n    \"        #initialize all the TensorFlow variables:\\n\",\n    \"        self.sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"       \\n\",\n    \"    #let's define a function called select_action for selecting the action with the noise to ensure exploration\\n\",\n    \"    def select_action(self, state):\\n\",\n    \"        \\n\",\n    \"        #run the actor network and get the action\\n\",\n    \"        action = self.sess.run(self.actor, {self.state: state[np.newaxis, :]})[0]\\n\",\n    \"        \\n\",\n    \"        #now, we generate a normal distribution with mean as action and standard deviation as the\\n\",\n    \"        #noise and we randomly select an action from this normal distribution\\n\",\n    \"        action = np.random.normal(action, self.noise)\\n\",\n    \"        \\n\",\n    \"        #we need to make sure that our action should not fall away from the action bound. So, we\\n\",\n    \"        #clip the action so that they lie within the action bound and then we return the action\\n\",\n    \"        action =  np.clip(action, action_bound[0],action_bound[1])\\n\",\n    \"        \\n\",\n    \"        return action\\n\",\n    \"        \\n\",\n    \"    #now, let's define the train function\\n\",\n    \"    def train(self):\\n\",\n    \"        \\n\",\n    \"        #perform the soft replacement\\n\",\n    \"        self.sess.run(self.soft_replacement)\\n\",\n    \"        \\n\",\n    \"        #randomly select indices from the replay buffer with the given batch size\\n\",\n    \"        indices = np.random.choice(replay_buffer, size=batch_size)\\n\",\n    \"        \\n\",\n    \"        #select the batch of transitions from the replay buffer with the selected indices\\n\",\n    \"        batch_transition = self.replay_buffer[indices, :]\\n\",\n    \"\\n\",\n    \"        #get the batch of states, actions, rewards, and next states\\n\",\n    \"        batch_states = batch_transition[:, :self.state_shape]\\n\",\n    \"        batch_actions = batch_transition[:, self.state_shape: self.state_shape + self.action_shape]\\n\",\n    \"        batch_rewards = batch_transition[:, -self.state_shape - 1: -self.state_shape]\\n\",\n    \"        batch_next_state = batch_transition[:, -self.state_shape:]\\n\",\n    \"\\n\",\n    \"        #train the actor network\\n\",\n    \"        self.sess.run(self.train_actor, {self.state: batch_states})\\n\",\n    \"        \\n\",\n    \"        #train the critic network\\n\",\n    \"        self.sess.run(self.train_critic, {self.state: batch_states, self.actor: batch_actions,\\n\",\n    \"                                          self.reward: batch_rewards, self.next_state: batch_next_state})\\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"    #now, let's store the transitions in the replay buffer\\n\",\n    \"    def store_transition(self, state, actor, reward, next_state):\\n\",\n    \"\\n\",\n    \"        #first stack the state, action, reward, and next state\\n\",\n    \"        trans = np.hstack((state,actor,[reward],next_state))\\n\",\n    \"        \\n\",\n    \"        #get the index\\n\",\n    \"        index = self.num_transitions % replay_buffer\\n\",\n    \"        \\n\",\n    \"        #store the transition\\n\",\n    \"        self.replay_buffer[index, :] = trans\\n\",\n    \"        \\n\",\n    \"        #update the number of transitions\\n\",\n    \"        self.num_transitions += 1\\n\",\n    \"        \\n\",\n    \"        #if the number of transitions is greater than the replay buffer then train the network\\n\",\n    \"        if self.num_transitions > replay_buffer:\\n\",\n    \"            self.noise *= 0.99995\\n\",\n    \"            self.train()\\n\",\n    \"            \\n\",\n    \"\\n\",\n    \"    def build_actor_network(self, state, scope, trainable):\\n\",\n    \"        \\n\",\n    \"        #we define a function called build_actor_network for building the actor network. The\\n\",\n    \"        #actor network takes the state and returns the action to be performed in that state\\n\",\n    \"        with tf.variable_scope(scope):\\n\",\n    \"            layer_1 = tf.layers.dense(state, 30, activation = tf.nn.tanh, name = 'layer_1', trainable = trainable)\\n\",\n    \"            actor = tf.layers.dense(layer_1, self.action_shape, activation = tf.nn.tanh, name = 'actor', trainable = trainable)     \\n\",\n    \"            return tf.multiply(actor, self.high_action_value, name = \\\"scaled_a\\\")  \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"    def build_critic_network(self, state, actor, scope, trainable):\\n\",\n    \"        \\n\",\n    \"        #we define a function called build_critic_network for building the critic network. The\\n\",\n    \"        #critic network takes the state and the action produced by the actor in that state and returns the Q value\\n\",\n    \"        with tf.variable_scope(scope):\\n\",\n    \"            w1_s = tf.get_variable('w1_s', [self.state_shape, 30], trainable = trainable)\\n\",\n    \"            w1_a = tf.get_variable('w1_a', [self.action_shape, 30], trainable = trainable)\\n\",\n    \"            b1 = tf.get_variable('b1', [1, 30], trainable = trainable)\\n\",\n    \"            net = tf.nn.tanh( tf.matmul(state, w1_s) + tf.matmul(actor, w1_a) + b1 )\\n\",\n    \"\\n\",\n    \"            critic = tf.layers.dense(net, 1, trainable = trainable)\\n\",\n    \"            return critic\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. First, let's create an object to our DDPG class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:From <ipython-input-10-3711dc106c60>:175: dense (from tensorflow.python.keras.legacy_tf_layers.core) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"Use keras.layers.Dense instead.\\n\",\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/keras/legacy_tf_layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"Please use `layer.__call__` method instead.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"ddpg = DDPG(state_shape, action_shape, action_bound[1])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 300\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps in each episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 500 \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #initialize the return\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #for every step\\n\",\n    \"    for j in range(num_timesteps):\\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"\\n\",\n    \"        #select the action\\n\",\n    \"        action = ddpg.select_action(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #store the transition in the replay buffer\\n\",\n    \"        ddpg.store_transition(state, action, reward, next_state)\\n\",\n    \"      \\n\",\n    \"        #update the return\\n\",\n    \"        Return += reward\\n\",\n    \"    \\n\",\n    \"        #if the state is the terminal state then break\\n\",\n    \"        if done:\\n\",\n    \"            break\\n\",\n    \"    \\n\",\n    \"        #update the state to the next state\\n\",\n    \"        state = next_state\\n\",\n    \"\\n\",\n    \"    #print the return for every 10 episodes\\n\",\n    \"    if i %10 ==0:\\n\",\n    \"         print(\\\"Episode:{}, Return: {}\\\".format(i,Return))  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we learned how DDPG works and how to implement DDPG, in the next section,\\n\",\n    \"we will learn another interesting algorithm called twin delayed DDPG. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/12.01. DDPG-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Deep deterministic policy gradient\\n\",\n    \"\\n\",\n    \"DDPG is an off-policy, model-free algorithm, designed for environments where the\\n\",\n    \"action space is continuous. In the previous chapter, we learned how the actor-critic\\n\",\n    \"method works. DDPG is an actor-critic method where the actor estimates the policy\\n\",\n    \"using the policy gradient, and the critic evaluates the policy produced by the actor\\n\",\n    \"using the Q function.\\n\",\n    \"\\n\",\n    \"DDPG uses the policy network as an actor and deep Q network as a critic. One\\n\",\n    \"important difference between the DPPG and actor-critic algorithms we learned in\\n\",\n    \"the previous chapter is that DDPG tries to learn a deterministic policy instead of\\n\",\n    \"a stochastic policy.\\n\",\n    \"\\n\",\n    \"First, we will get an intuitive understanding of how DDPG works and then we will\\n\",\n    \"look into the algorithm in detail.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## An overview of DDPG\\n\",\n    \"\\n\",\n    \"DDPG is an actor-critic method that takes advantage of both the policy-based\\n\",\n    \"method and the value-based method. It uses a deterministic policy $\\\\mu$ instead of\\n\",\n    \"a stochastic policy $\\\\pi$.\\n\",\n    \"\\n\",\n    \"We learned that a deterministic policy tells the agent to perform one particular\\n\",\n    \"action in a given state, meaning a deterministic policy maps the state to one\\n\",\n    \"particular action:\\n\",\n    \"\\n\",\n    \"$$a = \\\\mu(s)$$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Whereas a stochastic policy maps the state to the probability distribution over the\\n\",\n    \"action space:\\n\",\n    \"\\n\",\n    \"$$ a \\\\sim \\\\pi(s)$$\\n\",\n    \"\\n\",\n    \"In a deterministic policy, whenever the agent visits the state, it always performs the\\n\",\n    \"same particular action. But with a stochastic policy, instead of performing the same\\n\",\n    \"action every time the agent visits the state, the agent performs a different action\\n\",\n    \"each time based on a probability distribution over the action space.\\n\",\n    \"\\n\",\n    \"Now, we will look into an overview of the actor and critic networks in the DDPG\\n\",\n    \"algorithm.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Actor\\n\",\n    \"\\n\",\n    \"The actor in DDPG is basically the policy network. The goal of the actor is to learn\\n\",\n    \"the mapping between the state and action. That is, the role of the actor is to learn the\\n\",\n    \"optimal policy that gives the maximum return. So, the actor uses the policy gradient\\n\",\n    \"method to learn the optimal policy.\\n\",\n    \"\\n\",\n    \"## Critic\\n\",\n    \"The critic is basically the value network. The goal of the critic is to evaluate the action\\n\",\n    \"produced by the actor network. How does the critic network evaluate the action\\n\",\n    \"produced by the actor network? Let's suppose we have a Q function; can we evaluate\\n\",\n    \"an action using the Q function? Yes! First, let's take a little detour and recap the use\\n\",\n    \"of the Q function.\\n\",\n    \"\\n\",\n    \"We know that the Q function gives the expected return that an agent would obtain\\n\",\n    \"starting from state s and performing an action a following a particular policy. The\\n\",\n    \"expected return produced by the Q function is often called the Q value. Thus, given a\\n\",\n    \"state and action, we obtain a Q value:\\n\",\n    \"\\n\",\n    \"* If the Q value is high, then we can say that the action performed in that state\\n\",\n    \"is a good action. That is, if the Q value is high, meaning the expected return\\n\",\n    \"is high when we perform an action a in state s, we can say that the action a is\\n\",\n    \"a good action.\\n\",\n    \"* If the Q value is low, then we can say that the action performed in that state\\n\",\n    \"is not a good action. That is, if the Q value is low, meaning the expected\\n\",\n    \"return is low when we perform an action a in state s, we can say that the\\n\",\n    \"action a is not a good action.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Okay, now how can the critic network evaluate an action produced by the actor\\n\",\n    \"network based on the Q function (Q value)? Let's suppose the actor network\\n\",\n    \"performs a down action in state A. So, now, the critic computes the Q value of\\n\",\n    \"moving down in state A. If the Q value is high, then the critic network gives feedback\\n\",\n    \"to the actor network that the action down is a good action in state A. If the Q value is\\n\",\n    \"low, then the critic network gives feedback to the actor network that the down action\\n\",\n    \"is not a good action in state A, and so the actor network tries to perform a different\\n\",\n    \"action in state A.\\n\",\n    \"\\n\",\n    \"Thus, with the Q function, the critic network can evaluate the action performed by\\n\",\n    \"the actor network. But wait, how can the critic network learn the Q function? Because\\n\",\n    \"only if it knows the Q function can it evaluate the action performed by the actor. So,\\n\",\n    \"how does the critic network learn the Q function? Here is where we use the deep\\n\",\n    \"Q network (DQN). We learned that with the DQN, we can use the neural network\\n\",\n    \"to approximate the Q function. So, now, we use the DQN as the critic network to\\n\",\n    \"compute the Q function.\\n\",\n    \"\\n\",\n    \"Thus, in a nutshell, DDPG is an actor-critic method and so it takes advantage of\\n\",\n    \"policy-based and value-based methods. DDPG consists of an actor that is a policy\\n\",\n    \"network and uses the policy gradient method to learn the optimal policy and the\\n\",\n    \"critic, which is a deep Q network, and it evaluates the action produced by the actor.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"__Now that we have a basic understanding of how the DDPG algorithm works, let's\\n\",\n    \"go into further detail in the next section. We will understand how exactly the actor and critic networks\\n\",\n    \"work by looking at them separately with detailed math.__\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/12.02. Swinging Up the Pendulum using DDPG -checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Swinging up the pendulum using DDPG\\n\",\n    \"\\n\",\n    \"In this section, let's implement the DDPG algorithm to train the agent for swinging up the\\n\",\n    \"pendulum. That is, we will have a pendulum which starts swinging from a random\\n\",\n    \"position and the goal of our agent is to swing the pendulum up so it stays upright.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"First, let's import the required libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"import gym\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For a clear understanding of how the DDPG method works, we use\\n\",\n    \"TensorFlow in the non-eager mode by disabling TensorFlow 2 behavior.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the gym environment\\n\",\n    \"\\n\",\n    \"Let's create a pendulum environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"Pendulum-v0\\\").unwrapped\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the action shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_shape = env.action_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that the pendulum is a continuous environment and thus our action space consists of\\n\",\n    \"continuous values. So, we get the bound of our action space:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_bound = [env.action_space.low, env.action_space.high]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the variables\\n\",\n    \"\\n\",\n    \"Now, let's define some of the important variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.9 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the value of $\\\\tau$ which is used for soft replacement:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"tau = 0.01 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the size of our replay buffer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"replay_buffer = 10000 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 32  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Defining the DDPG class\\n\",\n    \"\\n\",\n    \"Let's define the class called DDPG where we will implement the DDPG algorithm. For a clear understanding, you can also check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class DDPG(object):\\n\",\n    \"\\n\",\n    \"    #first, let's define the init method\\n\",\n    \"    def __init__(self, state_shape, action_shape, high_action_value,):\\n\",\n    \"        \\n\",\n    \"        #define the replay buffer for storing the transitions\\n\",\n    \"        self.replay_buffer = np.zeros((replay_buffer, state_shape * 2 + action_shape + 1), dtype=np.float32)\\n\",\n    \"    \\n\",\n    \"        #initialize the num_transitionsto 0 which implies that the number of transitions in our\\n\",\n    \"        #replay buffer is zero\\n\",\n    \"        self.num_transitions = 0\\n\",\n    \"            \\n\",\n    \"        #start the TensorFlow session\\n\",\n    \"        self.sess = tf.Session()\\n\",\n    \"        \\n\",\n    \"        #we learned that in DDPG, instead of selecting the action directly, to ensure exploration,\\n\",\n    \"        #we add some noise using the Ornstein-Uhlenbeck process. So, we first initialize the noise\\n\",\n    \"        self.noise = 3.0\\n\",\n    \"        \\n\",\n    \"        #initialize the state shape, action shape, and high action value\\n\",\n    \"        self.state_shape, self.action_shape, self.high_action_value = state_shape, action_shape, high_action_value\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the state\\n\",\n    \"        self.state = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the next state\\n\",\n    \"        self.next_state = tf.placeholder(tf.float32, [None, state_shape], 'next_state')\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for reward\\n\",\n    \"        self.reward = tf.placeholder(tf.float32, [None, 1], 'reward')\\n\",\n    \"        \\n\",\n    \"        #with the actor variable scope\\n\",\n    \"        with tf.variable_scope('Actor'):\\n\",\n    \"\\n\",\n    \"            #define the main actor network which is parameterized by phi. Actor network takes the state\\n\",\n    \"            #as an input and returns the action to be performed in that state\\n\",\n    \"            self.actor = self.build_actor_network(self.state, scope='main', trainable=True)\\n\",\n    \"            \\n\",\n    \"            #Define the target actor network which is parameterized by phi dash. Target actor network takes\\n\",\n    \"            #the next state as an input and returns the action to be performed in that state\\n\",\n    \"            target_actor = self.build_actor_network(self.next_state, scope='target', trainable=False)\\n\",\n    \"            \\n\",\n    \"        #with the critic variable scope\\n\",\n    \"        with tf.variable_scope('Critic'):\\n\",\n    \"            \\n\",\n    \"            #define the main critic network which is parameterized by theta. Critic network takes the state\\n\",\n    \"            #and also the action produced by the actor in that state as an input and returns the Q value\\n\",\n    \"            critic = self.build_critic_network(self.state, self.actor, scope='main', trainable=True)\\n\",\n    \"            \\n\",\n    \"            #Define the target critic network which is parameterized by theta dash. Target critic network takes\\n\",\n    \"            #the next state and also the action produced by the target actor network in the next state as\\n\",\n    \"            #an input and returns the Q value\\n\",\n    \"            target_critic = self.build_critic_network(self.next_state, target_actor, scope='target', \\n\",\n    \"                                                      trainable=False)\\n\",\n    \"            \\n\",\n    \"        \\n\",\n    \"        #get the parameter of the main actor network, phi\\n\",\n    \"        self.main_actor_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/main')\\n\",\n    \"        \\n\",\n    \"        #get the parameter of the target actor network, phi dash\\n\",\n    \"        self.target_actor_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target')\\n\",\n    \"    \\n\",\n    \"        #get the parameter of the main critic network, theta\\n\",\n    \"        self.main_critic_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/main')\\n\",\n    \"        \\n\",\n    \"        #get the parameter of the target critic networ, theta dash\\n\",\n    \"        self.target_critic_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target')\\n\",\n    \"\\n\",\n    \"        #perform the soft replacement and update the parameter of the target actor network and\\n\",\n    \"        #the parameter of the target critic network\\n\",\n    \"        self.soft_replacement = [\\n\",\n    \"\\n\",\n    \"            [tf.assign(phi_, tau*phi + (1-tau)*phi_), tf.assign(theta_, tau*theta + (1-tau)*theta_)]\\n\",\n    \"            for phi, phi_, theta, theta_ in zip(self.main_actor_params, self.target_actor_params, self.main_critic_params, self.target_critic_params)\\n\",\n    \"\\n\",\n    \"            ]\\n\",\n    \"        \\n\",\n    \"        #compute the target Q value, we learned that the target Q value can be computed as the\\n\",\n    \"        #sum of reward and discounted Q value of next state-action pair\\n\",\n    \"        y = self.reward + gamma * target_critic\\n\",\n    \"        \\n\",\n    \"        #now, let's compute the loss of the critic network. The loss of the critic network is the mean\\n\",\n    \"        #squared error between the target Q value and the predicted Q value\\n\",\n    \"        MSE = tf.losses.mean_squared_error(labels=y, predictions=critic)\\n\",\n    \"        \\n\",\n    \"        #train the critic network by minimizing the mean squared error using Adam optimizer\\n\",\n    \"        self.train_critic = tf.train.AdamOptimizer(0.01).minimize(MSE, name=\\\"adam-ink\\\", var_list = self.main_critic_params)\\n\",\n    \"        \\n\",\n    \"        \\n\",\n    \"        #We learned that the objective function of the actor is to generate an action that maximizes\\n\",\n    \"        #the Q value produced by the critic network. We can maximize the above objective by computing gradients \\n\",\n    \"        #and by performing gradient ascent. However, it is a standard convention to perform minimization rather \\n\",\n    \"        #than maximization. So, we can convert the above maximization objective into the minimization\\n\",\n    \"        #objective by just adding a negative sign.\\n\",\n    \"        \\n\",\n    \"        \\n\",\n    \"        #now we can minimize the actor network objective by computing gradients and by performing gradient descent\\n\",\n    \"        actor_loss = -tf.reduce_mean(critic)    \\n\",\n    \"           \\n\",\n    \"        #train the actor network by minimizing the loss using Adam optimizer\\n\",\n    \"        self.train_actor = tf.train.AdamOptimizer(0.001).minimize(actor_loss, var_list=self.main_actor_params)\\n\",\n    \"            \\n\",\n    \"        #initialize all the TensorFlow variables:\\n\",\n    \"        self.sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"       \\n\",\n    \"    #let's define a function called select_action for selecting the action with the noise to ensure exploration\\n\",\n    \"    def select_action(self, state):\\n\",\n    \"        \\n\",\n    \"        #run the actor network and get the action\\n\",\n    \"        action = self.sess.run(self.actor, {self.state: state[np.newaxis, :]})[0]\\n\",\n    \"        \\n\",\n    \"        #now, we generate a normal distribution with mean as action and standard deviation as the\\n\",\n    \"        #noise and we randomly select an action from this normal distribution\\n\",\n    \"        action = np.random.normal(action, self.noise)\\n\",\n    \"        \\n\",\n    \"        #we need to make sure that our action should not fall away from the action bound. So, we\\n\",\n    \"        #clip the action so that they lie within the action bound and then we return the action\\n\",\n    \"        action =  np.clip(action, action_bound[0],action_bound[1])\\n\",\n    \"        \\n\",\n    \"        return action\\n\",\n    \"        \\n\",\n    \"    #now, let's define the train function\\n\",\n    \"    def train(self):\\n\",\n    \"        \\n\",\n    \"        #perform the soft replacement\\n\",\n    \"        self.sess.run(self.soft_replacement)\\n\",\n    \"        \\n\",\n    \"        #randomly select indices from the replay buffer with the given batch size\\n\",\n    \"        indices = np.random.choice(replay_buffer, size=batch_size)\\n\",\n    \"        \\n\",\n    \"        #select the batch of transitions from the replay buffer with the selected indices\\n\",\n    \"        batch_transition = self.replay_buffer[indices, :]\\n\",\n    \"\\n\",\n    \"        #get the batch of states, actions, rewards, and next states\\n\",\n    \"        batch_states = batch_transition[:, :self.state_shape]\\n\",\n    \"        batch_actions = batch_transition[:, self.state_shape: self.state_shape + self.action_shape]\\n\",\n    \"        batch_rewards = batch_transition[:, -self.state_shape - 1: -self.state_shape]\\n\",\n    \"        batch_next_state = batch_transition[:, -self.state_shape:]\\n\",\n    \"\\n\",\n    \"        #train the actor network\\n\",\n    \"        self.sess.run(self.train_actor, {self.state: batch_states})\\n\",\n    \"        \\n\",\n    \"        #train the critic network\\n\",\n    \"        self.sess.run(self.train_critic, {self.state: batch_states, self.actor: batch_actions,\\n\",\n    \"                                          self.reward: batch_rewards, self.next_state: batch_next_state})\\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"    #now, let's store the transitions in the replay buffer\\n\",\n    \"    def store_transition(self, state, actor, reward, next_state):\\n\",\n    \"\\n\",\n    \"        #first stack the state, action, reward, and next state\\n\",\n    \"        trans = np.hstack((state,actor,[reward],next_state))\\n\",\n    \"        \\n\",\n    \"        #get the index\\n\",\n    \"        index = self.num_transitions % replay_buffer\\n\",\n    \"        \\n\",\n    \"        #store the transition\\n\",\n    \"        self.replay_buffer[index, :] = trans\\n\",\n    \"        \\n\",\n    \"        #update the number of transitions\\n\",\n    \"        self.num_transitions += 1\\n\",\n    \"        \\n\",\n    \"        #if the number of transitions is greater than the replay buffer then train the network\\n\",\n    \"        if self.num_transitions > replay_buffer:\\n\",\n    \"            self.noise *= 0.99995\\n\",\n    \"            self.train()\\n\",\n    \"            \\n\",\n    \"\\n\",\n    \"    def build_actor_network(self, state, scope, trainable):\\n\",\n    \"        \\n\",\n    \"        #we define a function called build_actor_network for building the actor network. The\\n\",\n    \"        #actor network takes the state and returns the action to be performed in that state\\n\",\n    \"        with tf.variable_scope(scope):\\n\",\n    \"            layer_1 = tf.layers.dense(state, 30, activation = tf.nn.tanh, name = 'layer_1', trainable = trainable)\\n\",\n    \"            actor = tf.layers.dense(layer_1, self.action_shape, activation = tf.nn.tanh, name = 'actor', trainable = trainable)     \\n\",\n    \"            return tf.multiply(actor, self.high_action_value, name = \\\"scaled_a\\\")  \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"    def build_critic_network(self, state, actor, scope, trainable):\\n\",\n    \"        \\n\",\n    \"        #we define a function called build_critic_network for building the critic network. The\\n\",\n    \"        #critic network takes the state and the action produced by the actor in that state and returns the Q value\\n\",\n    \"        with tf.variable_scope(scope):\\n\",\n    \"            w1_s = tf.get_variable('w1_s', [self.state_shape, 30], trainable = trainable)\\n\",\n    \"            w1_a = tf.get_variable('w1_a', [self.action_shape, 30], trainable = trainable)\\n\",\n    \"            b1 = tf.get_variable('b1', [1, 30], trainable = trainable)\\n\",\n    \"            net = tf.nn.tanh( tf.matmul(state, w1_s) + tf.matmul(actor, w1_a) + b1 )\\n\",\n    \"\\n\",\n    \"            critic = tf.layers.dense(net, 1, trainable = trainable)\\n\",\n    \"            return critic\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. First, let's create an object to our DDPG class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ddpg = DDPG(state_shape, action_shape, action_bound[1])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 300\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps in each episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 500 \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #initialize the return\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #for every step\\n\",\n    \"    for j in range(num_timesteps):\\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"\\n\",\n    \"        #select the action\\n\",\n    \"        action = ddpg.select_action(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #store the transition in the replay buffer\\n\",\n    \"        ddpg.store_transition(state, action, reward, next_state)\\n\",\n    \"      \\n\",\n    \"        #update the return\\n\",\n    \"        Return += reward\\n\",\n    \"    \\n\",\n    \"        #if the state is the terminal state then break\\n\",\n    \"        if done:\\n\",\n    \"            break\\n\",\n    \"    \\n\",\n    \"        #update the state to the next state\\n\",\n    \"        state = next_state\\n\",\n    \"\\n\",\n    \"    #print the return for every 10 episodes\\n\",\n    \"    if i %10 ==0:\\n\",\n    \"         print(\\\"Episode:{}, Return: {}\\\".format(i,Return))  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we learned how DDPG works and how to implement DDPG, in the next section,\\n\",\n    \"we will learn another interesting algorithm called twin delayed DDPG. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/12.03. Twin delayed DDPG-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Twin delayed DDPG\\n\",\n    \"\\n\",\n    \"Now, we will look into another interesting actor-critic algorithm, known as TD3.\\n\",\n    \"TD3 is an improvement (and basically a successor) to the DDPG algorithm we just\\n\",\n    \"covered.\\n\",\n    \"\\n\",\n    \"In the previous section, we learned how DDPG uses a deterministic policy to\\n\",\n    \"work on the continuous action space. DDPG has several advantages and has been\\n\",\n    \"successfully used in a variety of continuous action space environments.\\n\",\n    \"\\n\",\n    \"We understood that DDPG is an actor-critic method where an actor is a policy\\n\",\n    \"network and it finds the optimal policy, while the critic evaluates the policy\\n\",\n    \"produced by the actor by estimating the Q function using a DQN.\\n\",\n    \"One of the problems with DDPG is that the critic overestimates the target Q value.\\n\",\n    \"This overestimation causes several issues. We learned that the policy is improved\\n\",\n    \"based on the Q value given by the critic, but when the Q value has an approximation\\n\",\n    \"error, it causes stability issues to our policy and the policy may converge to local\\n\",\n    \"optima.\\n\",\n    \"\\n\",\n    \"Thus, to combat this, TD3 proposes three important features, which are as follows:\\n\",\n    \"1. Clipped double Q learning\\n\",\n    \"2. Delayed policy updates\\n\",\n    \"3. Target policy smoothing\\n\",\n    \"\\n\",\n    \"First, we will understand how TD3 works intuitively, and then we will look at the\\n\",\n    \"algorithm in detail.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Key features of TD3\\n\",\n    \"\\n\",\n    \"TD3 is essentially the same as DDPG, except that it proposes three important features\\n\",\n    \"to mitigate the problems in DDPG. In this section, let's first get a basic understanding\\n\",\n    \"of the key features of TD3. The three key features of TD3 are:\\n\",\n    \"\\n\",\n    \"__Clipped double Q learning:__ Instead of using one critic network, we use two\\n\",\n    \"main critic networks to compute the Q value and also use two target critic\\n\",\n    \"networks to compute the target value.\\n\",\n    \"We compute two target Q values using two target critic networks and use the\\n\",\n    \"minimum value of these two while computing the loss. This helps to prevent\\n\",\n    \"overestimation of the target Q value. We will learn more about this in detail\\n\",\n    \"in the next section.\\n\",\n    \"\\n\",\n    \"__Delayed policy updates:__ In DDPG, we learned that we update the parameter\\n\",\n    \"of both the actor (policy network) and critic (DQN) network at every step\\n\",\n    \"of the episode. Unlike DDPG, here we delay updating the parameter of the\\n\",\n    \"actor network.\\n\",\n    \"That is, the critic network parameter is updated at every step of the episode,\\n\",\n    \"but the actor network (policy network) parameter is delayed and updated\\n\",\n    \"only after every two steps of the episode.\\n\",\n    \"\\n\",\n    \"__Target policy smoothing:__ The DDPG method produces different target\\n\",\n    \"values even for the same action. Hence, the variance of the target value will\\n\",\n    \"be high even for the same action, so we reduce this variance by adding some\\n\",\n    \"noise to the target action.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"__Now that we have a basic idea of the key features of TD3, in the next section we will get into\\n\",\n    \"more detail and learn how exactly these three key features work and how\\n\",\n    \"they solve the problems associated with DDPG.__\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/.ipynb_checkpoints/Swinging up the pendulum using DDPG -checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Swinging up the pendulum using DDPG\\n\",\n    \"\\n\",\n    \"In this section, let's implement the DDPG algorithm to train the agent for swinging up the\\n\",\n    \"pendulum. That is, we will have a pendulum which starts swinging from a random\\n\",\n    \"position and the goal of our agent is to swing the pendulum up so it stays upright.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"First, let's import the required libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import tensorflow as tf\\n\",\n    \"import numpy as np\\n\",\n    \"import gym\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the gym environment\\n\",\n    \"\\n\",\n    \"Let's create a pendulum environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"Pendulum-v0\\\").unwrapped\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the action shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_shape = env.action_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that the pendulum is a continuous environment and thus our action space consists of\\n\",\n    \"continuous values. So, we get the bound of our action space:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_bound = [env.action_space.low, env.action_space.high]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the variables\\n\",\n    \"\\n\",\n    \"Now, let's define some of the important variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.9 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the value of $\\\\tau$ which is used for soft replacement:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"tau = 0.01 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the size of our replay buffer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"replay_buffer = 10000 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 32  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Defining the DDPG class\\n\",\n    \"\\n\",\n    \"Let's define the class called DDPG where we will implement the DDPG algorithm. For a clear understanding, you can also check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class DDPG(object):\\n\",\n    \"\\n\",\n    \"    #first, let's define the init method\\n\",\n    \"    def __init__(self, state_shape, action_shape, high_action_value,):\\n\",\n    \"        \\n\",\n    \"        #define the replay buffer for storing the transitions\\n\",\n    \"        self.replay_buffer = np.zeros((replay_buffer, state_shape * 2 + action_shape + 1), dtype=np.float32)\\n\",\n    \"    \\n\",\n    \"        #initialize the num_transitionsto 0 which implies that the number of transitions in our\\n\",\n    \"        #replay buffer is zero\\n\",\n    \"        self.num_transitions = 0\\n\",\n    \"            \\n\",\n    \"        #start the TensorFlow session\\n\",\n    \"        self.sess = tf.Session()\\n\",\n    \"        \\n\",\n    \"        #we learned that in DDPG, instead of selecting the action directly, to ensure exploration,\\n\",\n    \"        #we add some noise using the Ornstein-Uhlenbeck process. So, we first initialize the noise\\n\",\n    \"        self.noise = 3.0\\n\",\n    \"        \\n\",\n    \"        #initialize the state shape, action shape, and high action value\\n\",\n    \"        self.state_shape, self.action_shape, self.high_action_value = state_shape, action_shape, high_action_value\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the state\\n\",\n    \"        self.state = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the next state\\n\",\n    \"        self.next_state = tf.placeholder(tf.float32, [None, state_shape], 'next_state')\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for reward\\n\",\n    \"        self.reward = tf.placeholder(tf.float32, [None, 1], 'reward')\\n\",\n    \"        \\n\",\n    \"        #with the actor variable scope\\n\",\n    \"        with tf.variable_scope('Actor'):\\n\",\n    \"\\n\",\n    \"            #define the main actor network which is parameterized by phi. Actor network takes the state\\n\",\n    \"            #as an input and returns the action to be performed in that state\\n\",\n    \"            self.actor = self.build_actor_network(self.state, scope='main', trainable=True)\\n\",\n    \"            \\n\",\n    \"            #Define the target actor network which is parameterized by phi dash. Target actor network takes\\n\",\n    \"            #the next state as an input and returns the action to be performed in that state\\n\",\n    \"            target_actor = self.build_actor_network(self.next_state, scope='target', trainable=False)\\n\",\n    \"            \\n\",\n    \"        #with the critic variable scope\\n\",\n    \"        with tf.variable_scope('Critic'):\\n\",\n    \"            \\n\",\n    \"            #define the main critic network which is parameterized by theta. Critic network takes the state\\n\",\n    \"            #and also the action produced by the actor in that state as an input and returns the Q value\\n\",\n    \"            critic = self.build_critic_network(self.state, self.actor, scope='main', trainable=True)\\n\",\n    \"            \\n\",\n    \"            #Define the target critic network which is parameterized by theta dash. Target critic network takes\\n\",\n    \"            #the next state and also the action produced by the target actor network in the next state as\\n\",\n    \"            #an input and returns the Q value\\n\",\n    \"            target_critic = self.build_critic_network(self.next_state, target_actor, scope='target', \\n\",\n    \"                                                      trainable=False)\\n\",\n    \"            \\n\",\n    \"        \\n\",\n    \"        #get the parameter of the main actor network, phi\\n\",\n    \"        self.main_actor_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/main')\\n\",\n    \"        \\n\",\n    \"        #get the parameter of the target actor network, phi dash\\n\",\n    \"        self.target_actor_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target')\\n\",\n    \"    \\n\",\n    \"        #get the parameter of the main critic network, theta\\n\",\n    \"        self.main_critic_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/main')\\n\",\n    \"        \\n\",\n    \"        #get the parameter of the target critic networ, theta dash\\n\",\n    \"        self.target_critic_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target')\\n\",\n    \"\\n\",\n    \"        #perform the soft replacement and update the parameter of the target actor network and\\n\",\n    \"        #the parameter of the target critic network\\n\",\n    \"        self.soft_replacement = [\\n\",\n    \"\\n\",\n    \"            [tf.assign(phi_, tau*phi + (1-tau)*phi_), tf.assign(theta_, tau*theta + (1-tau)*theta_)]\\n\",\n    \"            for phi, phi_, theta, theta_ in zip(self.main_actor_params, self.target_actor_params, self.main_critic_params, self.target_critic_params)\\n\",\n    \"\\n\",\n    \"            ]\\n\",\n    \"        \\n\",\n    \"        #compute the target Q value, we learned that the target Q value can be computed as the\\n\",\n    \"        #sum of reward and discounted Q value of next state-action pair\\n\",\n    \"        y = self.reward + gamma * target_critic\\n\",\n    \"        \\n\",\n    \"        #now, let's compute the loss of the critic network. The loss of the critic network is the mean\\n\",\n    \"        #squared error between the target Q value and the predicted Q value\\n\",\n    \"        MSE = tf.losses.mean_squared_error(labels=y, predictions=critic)\\n\",\n    \"        \\n\",\n    \"        #train the critic network by minimizing the mean squared error using Adam optimizer\\n\",\n    \"        self.train_critic = tf.train.AdamOptimizer(0.01).minimize(MSE, name=\\\"adam-ink\\\", var_list = self.main_critic_params)\\n\",\n    \"        \\n\",\n    \"        \\n\",\n    \"        #We learned that the objective function of the actor is to generate an action that maximizes\\n\",\n    \"        #the Q value produced by the critic network. We can maximize the above objective by computing gradients \\n\",\n    \"        #and by performing gradient ascent. However, it is a standard convention to perform minimization rather \\n\",\n    \"        #than maximization. So, we can convert the above maximization objective into the minimization\\n\",\n    \"        #objective by just adding a negative sign.\\n\",\n    \"        \\n\",\n    \"        \\n\",\n    \"        #now we can minimize the actor network objective by computing gradients and by performing gradient descent\\n\",\n    \"        actor_loss = -tf.reduce_mean(critic)    \\n\",\n    \"           \\n\",\n    \"        #train the actor network by minimizing the loss using Adam optimizer\\n\",\n    \"        self.train_actor = tf.train.AdamOptimizer(0.001).minimize(actor_loss, var_list=self.main_actor_params)\\n\",\n    \"            \\n\",\n    \"        #initialize all the TensorFlow variables:\\n\",\n    \"        self.sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"       \\n\",\n    \"    #let's define a function called select_action for selecting the action with the noise to ensure exploration\\n\",\n    \"    def select_action(self, state):\\n\",\n    \"        \\n\",\n    \"        #run the actor network and get the action\\n\",\n    \"        action = self.sess.run(self.actor, {self.state: state[np.newaxis, :]})[0]\\n\",\n    \"        \\n\",\n    \"        #now, we generate a normal distribution with mean as action and standard deviation as the\\n\",\n    \"        #noise and we randomly select an action from this normal distribution\\n\",\n    \"        action = np.random.normal(action, self.noise)\\n\",\n    \"        \\n\",\n    \"        #we need to make sure that our action should not fall away from the action bound. So, we\\n\",\n    \"        #clip the action so that they lie within the action bound and then we return the action\\n\",\n    \"        action =  np.clip(action, action_bound[0],action_bound[1])\\n\",\n    \"        \\n\",\n    \"        return action\\n\",\n    \"        \\n\",\n    \"    #now, let's define the train function\\n\",\n    \"    def train(self):\\n\",\n    \"        \\n\",\n    \"        #perform the soft replacement\\n\",\n    \"        self.sess.run(self.soft_replacement)\\n\",\n    \"        \\n\",\n    \"        #randomly select indices from the replay buffer with the given batch size\\n\",\n    \"        indices = np.random.choice(replay_buffer, size=batch_size)\\n\",\n    \"        \\n\",\n    \"        #select the batch of transitions from the replay buffer with the selected indices\\n\",\n    \"        batch_transition = self.replay_buffer[indices, :]\\n\",\n    \"\\n\",\n    \"        #get the batch of states, actions, rewards, and next states\\n\",\n    \"        batch_states = batch_transition[:, :self.state_shape]\\n\",\n    \"        batch_actions = batch_transition[:, self.state_shape: self.state_shape + self.action_shape]\\n\",\n    \"        batch_rewards = batch_transition[:, -self.state_shape - 1: -self.state_shape]\\n\",\n    \"        batch_next_state = batch_transition[:, -self.state_shape:]\\n\",\n    \"\\n\",\n    \"        #train the actor network\\n\",\n    \"        self.sess.run(self.train_actor, {self.state: batch_states})\\n\",\n    \"        \\n\",\n    \"        #train the critic network\\n\",\n    \"        self.sess.run(self.train_critic, {self.state: batch_states, self.actor: batch_actions,\\n\",\n    \"                                          self.reward: batch_rewards, self.next_state: batch_next_state})\\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"    #now, let's store the transitions in the replay buffer\\n\",\n    \"    def store_transition(self, state, actor, reward, next_state):\\n\",\n    \"\\n\",\n    \"        #first stack the state, action, reward, and next state\\n\",\n    \"        trans = np.hstack((state,actor,[reward],next_state))\\n\",\n    \"        \\n\",\n    \"        #get the index\\n\",\n    \"        index = self.num_transitions % replay_buffer\\n\",\n    \"        \\n\",\n    \"        #store the transition\\n\",\n    \"        self.replay_buffer[index, :] = trans\\n\",\n    \"        \\n\",\n    \"        #update the number of transitions\\n\",\n    \"        self.num_transitions += 1\\n\",\n    \"        \\n\",\n    \"        #if the number of transitions is greater than the replay buffer then train the network\\n\",\n    \"        if self.num_transitions > replay_buffer:\\n\",\n    \"            self.noise *= 0.99995\\n\",\n    \"            self.train()\\n\",\n    \"            \\n\",\n    \"\\n\",\n    \"    def build_actor_network(self, state, scope, trainable):\\n\",\n    \"        \\n\",\n    \"        #we define a function called build_actor_network for building the actor network. The\\n\",\n    \"        #actor network takes the state and returns the action to be performed in that state\\n\",\n    \"        with tf.variable_scope(scope):\\n\",\n    \"            layer_1 = tf.layers.dense(state, 30, activation = tf.nn.tanh, name = 'layer_1', trainable = trainable)\\n\",\n    \"            actor = tf.layers.dense(layer_1, self.action_shape, activation = tf.nn.tanh, name = 'actor', trainable = trainable)     \\n\",\n    \"            return tf.multiply(actor, self.high_action_value, name = \\\"scaled_a\\\")  \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"    def build_critic_network(self, state, actor, scope, trainable):\\n\",\n    \"        \\n\",\n    \"        #we define a function called build_critic_network for building the critic network. The\\n\",\n    \"        #critic network takes the state and the action produced by the actor in that state and returns the Q value\\n\",\n    \"        with tf.variable_scope(scope):\\n\",\n    \"            w1_s = tf.get_variable('w1_s', [self.state_shape, 30], trainable = trainable)\\n\",\n    \"            w1_a = tf.get_variable('w1_a', [self.action_shape, 30], trainable = trainable)\\n\",\n    \"            b1 = tf.get_variable('b1', [1, 30], trainable = trainable)\\n\",\n    \"            net = tf.nn.tanh( tf.matmul(state, w1_s) + tf.matmul(actor, w1_a) + b1 )\\n\",\n    \"\\n\",\n    \"            critic = tf.layers.dense(net, 1, trainable = trainable)\\n\",\n    \"            return critic\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. First, let's create an object to our DDPG class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ddpg = DDPG(state_shape, action_shape, action_bound[1])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 300\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps in each episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 500 \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #initialize the return\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #for every step\\n\",\n    \"    for j in range(num_timesteps):\\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"\\n\",\n    \"        #select the action\\n\",\n    \"        action = ddpg.select_action(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #store the transition in the replay buffer\\n\",\n    \"        ddpg.store_transition(state, action, reward, next_state)\\n\",\n    \"      \\n\",\n    \"        #update the return\\n\",\n    \"        Return += reward\\n\",\n    \"    \\n\",\n    \"        #if the state is the terminal state then break\\n\",\n    \"        if done:\\n\",\n    \"            break\\n\",\n    \"    \\n\",\n    \"        #update the state to the next state\\n\",\n    \"        state = next_state\\n\",\n    \"\\n\",\n    \"    #print the return for every 10 episodes\\n\",\n    \"    if i %10 ==0:\\n\",\n    \"         print(\\\"Episode:{}, Return: {}\\\".format(i,Return))  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we learned how DDPG works and how to implement DDPG, in the next section,\\n\",\n    \"we will learn another interesting algorithm called twin delayed DDPG. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/12.05. Swinging Up the Pendulum using DDPG .ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Swinging up the pendulum using DDPG\\n\",\n    \"\\n\",\n    \"In this section, let's implement the DDPG algorithm to train the agent for swinging up the\\n\",\n    \"pendulum. That is, we will have a pendulum which starts swinging from a random\\n\",\n    \"position and the goal of our agent is to swing the pendulum up so it stays upright.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"First, let's import the required libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"import gym\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For a clear understanding of how the DDPG method works, we use\\n\",\n    \"TensorFlow in the non-eager mode by disabling TensorFlow 2 behavior.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the gym environment\\n\",\n    \"\\n\",\n    \"Let's create a pendulum environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"Pendulum-v0\\\").unwrapped\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the action shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_shape = env.action_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that the pendulum is a continuous environment and thus our action space consists of\\n\",\n    \"continuous values. So, we get the bound of our action space:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_bound = [env.action_space.low, env.action_space.high]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the variables\\n\",\n    \"\\n\",\n    \"Now, let's define some of the important variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.9 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the value of $\\\\tau$ which is used for soft replacement:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"tau = 0.01 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the size of our replay buffer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"replay_buffer = 10000 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 32  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Defining the DDPG class\\n\",\n    \"\\n\",\n    \"Let's define the class called DDPG where we will implement the DDPG algorithm. For a clear understanding, you can also check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class DDPG(object):\\n\",\n    \"\\n\",\n    \"    #first, let's define the init method\\n\",\n    \"    def __init__(self, state_shape, action_shape, high_action_value,):\\n\",\n    \"        \\n\",\n    \"        #define the replay buffer for storing the transitions\\n\",\n    \"        self.replay_buffer = np.zeros((replay_buffer, state_shape * 2 + action_shape + 1), dtype=np.float32)\\n\",\n    \"    \\n\",\n    \"        #initialize the num_transitionsto 0 which implies that the number of transitions in our\\n\",\n    \"        #replay buffer is zero\\n\",\n    \"        self.num_transitions = 0\\n\",\n    \"            \\n\",\n    \"        #start the TensorFlow session\\n\",\n    \"        self.sess = tf.Session()\\n\",\n    \"        \\n\",\n    \"        #we learned that in DDPG, instead of selecting the action directly, to ensure exploration,\\n\",\n    \"        #we add some noise using the Ornstein-Uhlenbeck process. So, we first initialize the noise\\n\",\n    \"        self.noise = 3.0\\n\",\n    \"        \\n\",\n    \"        #initialize the state shape, action shape, and high action value\\n\",\n    \"        self.state_shape, self.action_shape, self.high_action_value = state_shape, action_shape, high_action_value\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the state\\n\",\n    \"        self.state = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the next state\\n\",\n    \"        self.next_state = tf.placeholder(tf.float32, [None, state_shape], 'next_state')\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for reward\\n\",\n    \"        self.reward = tf.placeholder(tf.float32, [None, 1], 'reward')\\n\",\n    \"        \\n\",\n    \"        #with the actor variable scope\\n\",\n    \"        with tf.variable_scope('Actor'):\\n\",\n    \"\\n\",\n    \"            #define the main actor network which is parameterized by phi. Actor network takes the state\\n\",\n    \"            #as an input and returns the action to be performed in that state\\n\",\n    \"            self.actor = self.build_actor_network(self.state, scope='main', trainable=True)\\n\",\n    \"            \\n\",\n    \"            #Define the target actor network which is parameterized by phi dash. Target actor network takes\\n\",\n    \"            #the next state as an input and returns the action to be performed in that state\\n\",\n    \"            target_actor = self.build_actor_network(self.next_state, scope='target', trainable=False)\\n\",\n    \"            \\n\",\n    \"        #with the critic variable scope\\n\",\n    \"        with tf.variable_scope('Critic'):\\n\",\n    \"            \\n\",\n    \"            #define the main critic network which is parameterized by theta. Critic network takes the state\\n\",\n    \"            #and also the action produced by the actor in that state as an input and returns the Q value\\n\",\n    \"            critic = self.build_critic_network(self.state, self.actor, scope='main', trainable=True)\\n\",\n    \"            \\n\",\n    \"            #Define the target critic network which is parameterized by theta dash. Target critic network takes\\n\",\n    \"            #the next state and also the action produced by the target actor network in the next state as\\n\",\n    \"            #an input and returns the Q value\\n\",\n    \"            target_critic = self.build_critic_network(self.next_state, target_actor, scope='target', \\n\",\n    \"                                                      trainable=False)\\n\",\n    \"            \\n\",\n    \"        \\n\",\n    \"        #get the parameter of the main actor network, phi\\n\",\n    \"        self.main_actor_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/main')\\n\",\n    \"        \\n\",\n    \"        #get the parameter of the target actor network, phi dash\\n\",\n    \"        self.target_actor_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target')\\n\",\n    \"    \\n\",\n    \"        #get the parameter of the main critic network, theta\\n\",\n    \"        self.main_critic_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/main')\\n\",\n    \"        \\n\",\n    \"        #get the parameter of the target critic networ, theta dash\\n\",\n    \"        self.target_critic_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target')\\n\",\n    \"\\n\",\n    \"        #perform the soft replacement and update the parameter of the target actor network and\\n\",\n    \"        #the parameter of the target critic network\\n\",\n    \"        self.soft_replacement = [\\n\",\n    \"\\n\",\n    \"            [tf.assign(phi_, tau*phi + (1-tau)*phi_), tf.assign(theta_, tau*theta + (1-tau)*theta_)]\\n\",\n    \"            for phi, phi_, theta, theta_ in zip(self.main_actor_params, self.target_actor_params, self.main_critic_params, self.target_critic_params)\\n\",\n    \"\\n\",\n    \"            ]\\n\",\n    \"        \\n\",\n    \"        #compute the target Q value, we learned that the target Q value can be computed as the\\n\",\n    \"        #sum of reward and discounted Q value of next state-action pair\\n\",\n    \"        y = self.reward + gamma * target_critic\\n\",\n    \"        \\n\",\n    \"        #now, let's compute the loss of the critic network. The loss of the critic network is the mean\\n\",\n    \"        #squared error between the target Q value and the predicted Q value\\n\",\n    \"        MSE = tf.losses.mean_squared_error(labels=y, predictions=critic)\\n\",\n    \"        \\n\",\n    \"        #train the critic network by minimizing the mean squared error using Adam optimizer\\n\",\n    \"        self.train_critic = tf.train.AdamOptimizer(0.01).minimize(MSE, name=\\\"adam-ink\\\", var_list = self.main_critic_params)\\n\",\n    \"        \\n\",\n    \"        \\n\",\n    \"        #We learned that the objective function of the actor is to generate an action that maximizes\\n\",\n    \"        #the Q value produced by the critic network. We can maximize the above objective by computing gradients \\n\",\n    \"        #and by performing gradient ascent. However, it is a standard convention to perform minimization rather \\n\",\n    \"        #than maximization. So, we can convert the above maximization objective into the minimization\\n\",\n    \"        #objective by just adding a negative sign.\\n\",\n    \"        \\n\",\n    \"        \\n\",\n    \"        #now we can minimize the actor network objective by computing gradients and by performing gradient descent\\n\",\n    \"        actor_loss = -tf.reduce_mean(critic)    \\n\",\n    \"           \\n\",\n    \"        #train the actor network by minimizing the loss using Adam optimizer\\n\",\n    \"        self.train_actor = tf.train.AdamOptimizer(0.001).minimize(actor_loss, var_list=self.main_actor_params)\\n\",\n    \"            \\n\",\n    \"        #initialize all the TensorFlow variables:\\n\",\n    \"        self.sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"       \\n\",\n    \"    #let's define a function called select_action for selecting the action with the noise to ensure exploration\\n\",\n    \"    def select_action(self, state):\\n\",\n    \"        \\n\",\n    \"        #run the actor network and get the action\\n\",\n    \"        action = self.sess.run(self.actor, {self.state: state[np.newaxis, :]})[0]\\n\",\n    \"        \\n\",\n    \"        #now, we generate a normal distribution with mean as action and standard deviation as the\\n\",\n    \"        #noise and we randomly select an action from this normal distribution\\n\",\n    \"        action = np.random.normal(action, self.noise)\\n\",\n    \"        \\n\",\n    \"        #we need to make sure that our action should not fall away from the action bound. So, we\\n\",\n    \"        #clip the action so that they lie within the action bound and then we return the action\\n\",\n    \"        action =  np.clip(action, action_bound[0],action_bound[1])\\n\",\n    \"        \\n\",\n    \"        return action\\n\",\n    \"        \\n\",\n    \"    #now, let's define the train function\\n\",\n    \"    def train(self):\\n\",\n    \"        \\n\",\n    \"        #perform the soft replacement\\n\",\n    \"        self.sess.run(self.soft_replacement)\\n\",\n    \"        \\n\",\n    \"        #randomly select indices from the replay buffer with the given batch size\\n\",\n    \"        indices = np.random.choice(replay_buffer, size=batch_size)\\n\",\n    \"        \\n\",\n    \"        #select the batch of transitions from the replay buffer with the selected indices\\n\",\n    \"        batch_transition = self.replay_buffer[indices, :]\\n\",\n    \"\\n\",\n    \"        #get the batch of states, actions, rewards, and next states\\n\",\n    \"        batch_states = batch_transition[:, :self.state_shape]\\n\",\n    \"        batch_actions = batch_transition[:, self.state_shape: self.state_shape + self.action_shape]\\n\",\n    \"        batch_rewards = batch_transition[:, -self.state_shape - 1: -self.state_shape]\\n\",\n    \"        batch_next_state = batch_transition[:, -self.state_shape:]\\n\",\n    \"\\n\",\n    \"        #train the actor network\\n\",\n    \"        self.sess.run(self.train_actor, {self.state: batch_states})\\n\",\n    \"        \\n\",\n    \"        #train the critic network\\n\",\n    \"        self.sess.run(self.train_critic, {self.state: batch_states, self.actor: batch_actions,\\n\",\n    \"                                          self.reward: batch_rewards, self.next_state: batch_next_state})\\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"    #now, let's store the transitions in the replay buffer\\n\",\n    \"    def store_transition(self, state, actor, reward, next_state):\\n\",\n    \"\\n\",\n    \"        #first stack the state, action, reward, and next state\\n\",\n    \"        trans = np.hstack((state,actor,[reward],next_state))\\n\",\n    \"        \\n\",\n    \"        #get the index\\n\",\n    \"        index = self.num_transitions % replay_buffer\\n\",\n    \"        \\n\",\n    \"        #store the transition\\n\",\n    \"        self.replay_buffer[index, :] = trans\\n\",\n    \"        \\n\",\n    \"        #update the number of transitions\\n\",\n    \"        self.num_transitions += 1\\n\",\n    \"        \\n\",\n    \"        #if the number of transitions is greater than the replay buffer then train the network\\n\",\n    \"        if self.num_transitions > replay_buffer:\\n\",\n    \"            self.noise *= 0.99995\\n\",\n    \"            self.train()\\n\",\n    \"            \\n\",\n    \"\\n\",\n    \"    def build_actor_network(self, state, scope, trainable):\\n\",\n    \"        \\n\",\n    \"        #we define a function called build_actor_network for building the actor network. The\\n\",\n    \"        #actor network takes the state and returns the action to be performed in that state\\n\",\n    \"        with tf.variable_scope(scope):\\n\",\n    \"            layer_1 = tf.layers.dense(state, 30, activation = tf.nn.tanh, name = 'layer_1', trainable = trainable)\\n\",\n    \"            actor = tf.layers.dense(layer_1, self.action_shape, activation = tf.nn.tanh, name = 'actor', trainable = trainable)     \\n\",\n    \"            return tf.multiply(actor, self.high_action_value, name = \\\"scaled_a\\\")  \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"    def build_critic_network(self, state, actor, scope, trainable):\\n\",\n    \"        \\n\",\n    \"        #we define a function called build_critic_network for building the critic network. The\\n\",\n    \"        #critic network takes the state and the action produced by the actor in that state and returns the Q value\\n\",\n    \"        with tf.variable_scope(scope):\\n\",\n    \"            w1_s = tf.get_variable('w1_s', [self.state_shape, 30], trainable = trainable)\\n\",\n    \"            w1_a = tf.get_variable('w1_a', [self.action_shape, 30], trainable = trainable)\\n\",\n    \"            b1 = tf.get_variable('b1', [1, 30], trainable = trainable)\\n\",\n    \"            net = tf.nn.tanh( tf.matmul(state, w1_s) + tf.matmul(actor, w1_a) + b1 )\\n\",\n    \"\\n\",\n    \"            critic = tf.layers.dense(net, 1, trainable = trainable)\\n\",\n    \"            return critic\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. First, let's create an object to our DDPG class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ddpg = DDPG(state_shape, action_shape, action_bound[1])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 300\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps in each episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 500 \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #initialize the return\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #for every step\\n\",\n    \"    for j in range(num_timesteps):\\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"\\n\",\n    \"        #select the action\\n\",\n    \"        action = ddpg.select_action(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #store the transition in the replay buffer\\n\",\n    \"        ddpg.store_transition(state, action, reward, next_state)\\n\",\n    \"      \\n\",\n    \"        #update the return\\n\",\n    \"        Return += reward\\n\",\n    \"    \\n\",\n    \"        #if the state is the terminal state then break\\n\",\n    \"        if done:\\n\",\n    \"            break\\n\",\n    \"    \\n\",\n    \"        #update the state to the next state\\n\",\n    \"        state = next_state\\n\",\n    \"\\n\",\n    \"    #print the return for every 10 episodes\\n\",\n    \"    if i %10 ==0:\\n\",\n    \"         print(\\\"Episode:{}, Return: {}\\\".format(i,Return))  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we learned how DDPG works and how to implement DDPG, in the next section,\\n\",\n    \"we will learn another interesting algorithm called twin delayed DDPG. \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "12. Learning DDPG, TD3 and SAC/README.md",
    "content": "# 12. Learning DDPG, TD3 and SAC\n* 12.1. Deep Deterministic Policy Gradient\n   * 12.1.1. An Overview of DDPG\n* 12.2. Components of DDPG\n  * 12.2.1. Critic network\n  * 12.2.2. Actor Network\n* 12.3. Putting it all Together\n* 12.4. Algorithm - DDPG\n* 12.5. Swinging Up the Pendulum using DDPG\n* 12.6. Twin Delayed DDPG\n* 12.7. Components of DDPG\n   * 12.7.1. Key Features of TD3\n   * 12.7.2. Clipped Double Q Learning\n   * 12.7.3. Delayed Policy Updates\n   * 12.7.4. Target Policy Smoothing\n* 12.8. Putting it all Together\n* 12.9. Algorithm - TD3\n* 12.10. Soft Actor Critic\n* 12.11. Components of SAC\n   * 12.11.1. Understanding Soft Actor Critic\n   * 12.11.2. V and Q Function with the Entropy Term\n   * 12.11.3. Critic Network\n      * 12.11.3.1. Value Network\n      * 12.11.3.2. Q Network\n      * 12.11.3.3. Actor Network\n* 12.12 Putting it all Together\n* 12.13. Algorithm - SAC\n"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/.ipynb_checkpoints/ Implementing PPO-clipped method-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"# Implementing PPO-clipped method\\n\",\n    \"\\n\",\n    \"Let's implement the PPO-clipped method for swinging up the pendulum task. The code\\n\",\n    \"used in this section is adapted from one of the very good PPO implementations (https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/contents/12_Proximal_Policy_Optimization) by Morvan. \\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import tensorflow as tf\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import gym\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the gym environment\\n\",\n    \"\\n\",\n    \"Let's create a pendulum environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('Pendulum-v0').unwrapped\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the action shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_shape = env.action_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that the pendulum is a continuous environment and thus our action space consists of\\n\",\n    \"continuous values. So, we get the bound of our action space:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_bound = [env.action_space.low, env.action_space.high]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the epsilon value which is used in the clipped objective:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"epsilon = 0.2 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the PPO class\\n\",\n    \"\\n\",\n    \"Let's define the class called PPO where we will implement the PPO algorithm.  For a clear understanding, you can also check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class PPO(object):\\n\",\n    \"    #first, let's define the init method\\n\",\n    \"    def __init__(self):\\n\",\n    \"        \\n\",\n    \"        #start the TensorFlow session\\n\",\n    \"        self.sess = tf.Session()\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the state\\n\",\n    \"        self.state_ph = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"\\n\",\n    \"        #now, let's build the value network which returns the value of a state\\n\",\n    \"        with tf.variable_scope('value'):\\n\",\n    \"            layer1 = tf.layers.dense(self.state_ph, 100, tf.nn.relu)\\n\",\n    \"            self.v = tf.layers.dense(layer1, 1)\\n\",\n    \"            \\n\",\n    \"            #define the placeholder for the Q value\\n\",\n    \"            self.Q = tf.placeholder(tf.float32, [None, 1], 'discounted_r')\\n\",\n    \"            \\n\",\n    \"            #define the advantage value as the difference between the Q value and state value\\n\",\n    \"            self.advantage = self.Q - self.v\\n\",\n    \"\\n\",\n    \"            #compute the loss of the value network\\n\",\n    \"            self.value_loss = tf.reduce_mean(tf.square(self.advantage))\\n\",\n    \"            \\n\",\n    \"            #train the value network by minimizing the loss using Adam optimizer\\n\",\n    \"            self.train_value_nw = tf.train.AdamOptimizer(0.002).minimize(self.value_loss)\\n\",\n    \"\\n\",\n    \"        #now, we obtain the policy and its parameter from the policy network\\n\",\n    \"        pi, pi_params = self.build_policy_network('pi', trainable=True)\\n\",\n    \"\\n\",\n    \"        #obtain the old policy and its parameter from the policy network\\n\",\n    \"        oldpi, oldpi_params = self.build_policy_network('oldpi', trainable=False)\\n\",\n    \"        \\n\",\n    \"        #sample an action from the new policy\\n\",\n    \"        with tf.variable_scope('sample_action'):\\n\",\n    \"            self.sample_op = tf.squeeze(pi.sample(1), axis=0)       \\n\",\n    \"\\n\",\n    \"        #update the parameters of the old policy\\n\",\n    \"        with tf.variable_scope('update_oldpi'):\\n\",\n    \"            self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)]\\n\",\n    \"\\n\",\n    \"        #define the placeholder for the action\\n\",\n    \"        self.action_ph = tf.placeholder(tf.float32, [None, action_shape], 'action')\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the advantage\\n\",\n    \"        self.advantage_ph = tf.placeholder(tf.float32, [None, 1], 'advantage')\\n\",\n    \"\\n\",\n    \"        #now, let's define our surrogate objective function of the policy network\\n\",\n    \"        with tf.variable_scope('loss'):\\n\",\n    \"            with tf.variable_scope('surrogate'):\\n\",\n    \"                \\n\",\n    \"                #first, let's define the ratio \\n\",\n    \"                ratio = pi.prob(self.action_ph) / oldpi.prob(self.action_ph)\\n\",\n    \"    \\n\",\n    \"                #define the objective by multiplying ratio and the advantage value\\n\",\n    \"                objective = ratio * self.advantage_ph\\n\",\n    \"                \\n\",\n    \"                #define the objective function with the clipped and unclipped objective:\\n\",\n    \"                L = tf.reduce_mean(tf.minimum(objective, \\n\",\n    \"                                   tf.clip_by_value(ratio, 1.-epsilon, 1.+ epsilon)*self.advantage_ph))\\n\",\n    \"                \\n\",\n    \"            \\n\",\n    \"            #now, we can compute the gradient and maximize the objective function using gradient\\n\",\n    \"            #ascent. However, instead of doing that, we can convert the above maximization objective\\n\",\n    \"            #into the minimization objective by just adding a negative sign. So, we can denote the loss of\\n\",\n    \"            #the policy network as:\\n\",\n    \"            \\n\",\n    \"            self.policy_loss = -L\\n\",\n    \"    \\n\",\n    \"        #train the policy network by minimizing the loss using Adam optimizer:\\n\",\n    \"        with tf.variable_scope('train_policy'):\\n\",\n    \"            self.train_policy_nw = tf.train.AdamOptimizer(0.001).minimize(self.policy_loss)\\n\",\n    \"        \\n\",\n    \"        #initialize all the TensorFlow variables\\n\",\n    \"        self.sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"    #now, let's define the train function\\n\",\n    \"    def train(self, state, action, reward):\\n\",\n    \"        \\n\",\n    \"        #update the old policy\\n\",\n    \"        self.sess.run(self.update_oldpi_op)\\n\",\n    \"        \\n\",\n    \"        #compute the advantage value\\n\",\n    \"        adv = self.sess.run(self.advantage, {self.state_ph: state, self.Q: reward})\\n\",\n    \"            \\n\",\n    \"        #train the policy network\\n\",\n    \"        [self.sess.run(self.train_policy_nw, {self.state_ph: state, self.action_ph: action, self.advantage_ph: adv}) for _ in range(10)]\\n\",\n    \"        \\n\",\n    \"        #train the value network\\n\",\n    \"        [self.sess.run(self.train_value_nw, {self.state_ph: state, self.Q: reward}) for _ in range(10)]\\n\",\n    \"\\n\",\n    \"    \\n\",\n    \"    #we define a function called build_policy_network for building the policy network. Note\\n\",\n    \"    #that our action space is continuous here, so our policy network returns the mean and\\n\",\n    \"    #variance of the action as an output and then we generate a normal distribution using this\\n\",\n    \"    #mean and variance and we select an action by sampling from this normal distribution\\n\",\n    \"\\n\",\n    \"    def build_policy_network(self, name, trainable):\\n\",\n    \"        with tf.variable_scope(name):\\n\",\n    \"            \\n\",\n    \"            #define the layer of the network\\n\",\n    \"            layer1 = tf.layers.dense(self.state_ph, 100, tf.nn.relu, trainable=trainable)\\n\",\n    \"            \\n\",\n    \"            #compute mean\\n\",\n    \"            mu = 2 * tf.layers.dense(layer1, action_shape, tf.nn.tanh, trainable=trainable)\\n\",\n    \"            \\n\",\n    \"            #compute standard deviation\\n\",\n    \"            sigma = tf.layers.dense(layer1, action_shape, tf.nn.softplus, trainable=trainable)\\n\",\n    \"            \\n\",\n    \"            #compute the normal distribution\\n\",\n    \"            norm_dist = tf.distributions.Normal(loc=mu, scale=sigma)\\n\",\n    \"            \\n\",\n    \"        #get the parameters of the policy network\\n\",\n    \"        params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)\\n\",\n    \"        return norm_dist, params\\n\",\n    \"\\n\",\n    \"    #let's define a function called select_action for selecting the action\\n\",\n    \"    def select_action(self, state):\\n\",\n    \"        state = state[np.newaxis, :]\\n\",\n    \"        \\n\",\n    \"        #sample an action from the normal distribution generated by the policy network\\n\",\n    \"        action = self.sess.run(self.sample_op, {self.state_ph: state})[0]\\n\",\n    \"        \\n\",\n    \"        #we clip the action so that they lie within the action bound and then we return the action\\n\",\n    \"        action =  np.clip(action, action_bound[0], action_bound[1])\\n\",\n    \"\\n\",\n    \"        return action\\n\",\n    \"\\n\",\n    \"    #we define a function called get_state_value to obtain the value of the state computed by the value network\\n\",\n    \"    def get_state_value(self, state):\\n\",\n    \"        if state.ndim < 2: state = state[np.newaxis, :]\\n\",\n    \"        return self.sess.run(self.v, {self.state_ph: state})[0, 0]\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. First, let's create an object to our PPO class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ppo = PPO()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps in each episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 200\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.9\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 32\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's train\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Episode:0, Return: -1597.0752474266517\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #initialize the lists for holding the states, actions, and rewards obtained in the episode\\n\",\n    \"    episode_states, episode_actions, episode_rewards = [], [], []\\n\",\n    \"    \\n\",\n    \"    #initialize the return\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #for every step\\n\",\n    \"    for t in range(num_timesteps):   \\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"        \\n\",\n    \"        #select the action\\n\",\n    \"        action = ppo.select_action(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, _ = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #store the state, action, and reward in the list\\n\",\n    \"        episode_states.append(state)\\n\",\n    \"        episode_actions.append(action)\\n\",\n    \"        episode_rewards.append((reward+8)/8)    \\n\",\n    \"        \\n\",\n    \"        #update the state to the next state\\n\",\n    \"        state = next_state\\n\",\n    \"        \\n\",\n    \"        #update the return\\n\",\n    \"        Return += reward\\n\",\n    \"        \\n\",\n    \"        #if we reached the batch size or if we reached the final step of the episode\\n\",\n    \"        if (t+1) % batch_size == 0 or t == num_timesteps-1:\\n\",\n    \"            \\n\",\n    \"            #compute the value of the next state\\n\",\n    \"            v_s_ = ppo.get_state_value(next_state)\\n\",\n    \"            \\n\",\n    \"            #compute Q value as sum of reward and discounted value of next state\\n\",\n    \"            discounted_r = []\\n\",\n    \"            for reward in episode_rewards[::-1]:\\n\",\n    \"                v_s_ = reward + gamma * v_s_\\n\",\n    \"                discounted_r.append(v_s_)\\n\",\n    \"            discounted_r.reverse()\\n\",\n    \"    \\n\",\n    \"            #stack the episode states, actions, and rewards:\\n\",\n    \"            es, ea, er = np.vstack(episode_states), np.vstack(episode_actions), np.array(discounted_r)[:, np.newaxis]\\n\",\n    \"            \\n\",\n    \"            #empty the lists\\n\",\n    \"            episode_states, episode_actions, episode_rewards = [], [], []\\n\",\n    \"            \\n\",\n    \"            #train the network\\n\",\n    \"            ppo.train(es, ea, er)\\n\",\n    \"        \\n\",\n    \"    #print the return for every 10 episodes\\n\",\n    \"    if i %10 ==0:\\n\",\n    \"         print(\\\"Episode:{}, Return: {}\\\".format(i,Return))  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we learned how PPO with clipped objective works and how to implement them, in the next section we will learn another interesting type of PPO algorithm called PPO with\\n\",\n    \"the penalized objective.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/.ipynb_checkpoints/11.09. Implementing PPO-Clipped Method-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"# Implementing PPO-clipped method\\n\",\n    \"\\n\",\n    \"Let's implement the PPO-clipped method for swinging up the pendulum task. The code\\n\",\n    \"used in this section is adapted from one of the very good PPO implementations (https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/contents/12_Proximal_Policy_Optimization) by Morvan. \\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"non-resource variables are not supported in the long term\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import gym\\n\",\n    \"\\n\",\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the gym environment\\n\",\n    \"\\n\",\n    \"Let's create a pendulum environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('Pendulum-v0').unwrapped\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the action shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_shape = env.action_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that the pendulum is a continuous environment and thus our action space consists of\\n\",\n    \"continuous values. So, we get the bound of our action space:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_bound = [env.action_space.low, env.action_space.high]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the epsilon value which is used in the clipped objective:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"epsilon = 0.2 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the PPO class\\n\",\n    \"\\n\",\n    \"Let's define the class called PPO where we will implement the PPO algorithm.  For a clear understanding, you can also check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class PPO(object):\\n\",\n    \"    #first, let's define the init method\\n\",\n    \"    def __init__(self):\\n\",\n    \"        \\n\",\n    \"        #start the TensorFlow session\\n\",\n    \"        self.sess = tf.Session()\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the state\\n\",\n    \"        self.state_ph = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"\\n\",\n    \"        #now, let's build the value network which returns the value of a state\\n\",\n    \"        with tf.variable_scope('value'):\\n\",\n    \"            layer1 = tf.layers.dense(self.state_ph, 100, tf.nn.relu)\\n\",\n    \"            self.v = tf.layers.dense(layer1, 1)\\n\",\n    \"            \\n\",\n    \"            #define the placeholder for the Q value\\n\",\n    \"            self.Q = tf.placeholder(tf.float32, [None, 1], 'discounted_r')\\n\",\n    \"            \\n\",\n    \"            #define the advantage value as the difference between the Q value and state value\\n\",\n    \"            self.advantage = self.Q - self.v\\n\",\n    \"\\n\",\n    \"            #compute the loss of the value network\\n\",\n    \"            self.value_loss = tf.reduce_mean(tf.square(self.advantage))\\n\",\n    \"            \\n\",\n    \"            #train the value network by minimizing the loss using Adam optimizer\\n\",\n    \"            self.train_value_nw = tf.train.AdamOptimizer(0.002).minimize(self.value_loss)\\n\",\n    \"\\n\",\n    \"        #now, we obtain the policy and its parameter from the policy network\\n\",\n    \"        pi, pi_params = self.build_policy_network('pi', trainable=True)\\n\",\n    \"\\n\",\n    \"        #obtain the old policy and its parameter from the policy network\\n\",\n    \"        oldpi, oldpi_params = self.build_policy_network('oldpi', trainable=False)\\n\",\n    \"        \\n\",\n    \"        #sample an action from the new policy\\n\",\n    \"        with tf.variable_scope('sample_action'):\\n\",\n    \"            self.sample_op = tf.squeeze(pi.sample(1), axis=0)       \\n\",\n    \"\\n\",\n    \"        #update the parameters of the old policy\\n\",\n    \"        with tf.variable_scope('update_oldpi'):\\n\",\n    \"            self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)]\\n\",\n    \"\\n\",\n    \"        #define the placeholder for the action\\n\",\n    \"        self.action_ph = tf.placeholder(tf.float32, [None, action_shape], 'action')\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the advantage\\n\",\n    \"        self.advantage_ph = tf.placeholder(tf.float32, [None, 1], 'advantage')\\n\",\n    \"\\n\",\n    \"        #now, let's define our surrogate objective function of the policy network\\n\",\n    \"        with tf.variable_scope('loss'):\\n\",\n    \"            with tf.variable_scope('surrogate'):\\n\",\n    \"                \\n\",\n    \"                #first, let's define the ratio \\n\",\n    \"                ratio = pi.prob(self.action_ph) / oldpi.prob(self.action_ph)\\n\",\n    \"    \\n\",\n    \"                #define the objective by multiplying ratio and the advantage value\\n\",\n    \"                objective = ratio * self.advantage_ph\\n\",\n    \"                \\n\",\n    \"                #define the objective function with the clipped and unclipped objective:\\n\",\n    \"                L = tf.reduce_mean(tf.minimum(objective, \\n\",\n    \"                                   tf.clip_by_value(ratio, 1.-epsilon, 1.+ epsilon)*self.advantage_ph))\\n\",\n    \"                \\n\",\n    \"            \\n\",\n    \"            #now, we can compute the gradient and maximize the objective function using gradient\\n\",\n    \"            #ascent. However, instead of doing that, we can convert the above maximization objective\\n\",\n    \"            #into the minimization objective by just adding a negative sign. So, we can denote the loss of\\n\",\n    \"            #the policy network as:\\n\",\n    \"            \\n\",\n    \"            self.policy_loss = -L\\n\",\n    \"    \\n\",\n    \"        #train the policy network by minimizing the loss using Adam optimizer:\\n\",\n    \"        with tf.variable_scope('train_policy'):\\n\",\n    \"            self.train_policy_nw = tf.train.AdamOptimizer(0.001).minimize(self.policy_loss)\\n\",\n    \"        \\n\",\n    \"        #initialize all the TensorFlow variables\\n\",\n    \"        self.sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"    #now, let's define the train function\\n\",\n    \"    def train(self, state, action, reward):\\n\",\n    \"        \\n\",\n    \"        #update the old policy\\n\",\n    \"        self.sess.run(self.update_oldpi_op)\\n\",\n    \"        \\n\",\n    \"        #compute the advantage value\\n\",\n    \"        adv = self.sess.run(self.advantage, {self.state_ph: state, self.Q: reward})\\n\",\n    \"            \\n\",\n    \"        #train the policy network\\n\",\n    \"        [self.sess.run(self.train_policy_nw, {self.state_ph: state, self.action_ph: action, self.advantage_ph: adv}) for _ in range(10)]\\n\",\n    \"        \\n\",\n    \"        #train the value network\\n\",\n    \"        [self.sess.run(self.train_value_nw, {self.state_ph: state, self.Q: reward}) for _ in range(10)]\\n\",\n    \"\\n\",\n    \"    \\n\",\n    \"    #we define a function called build_policy_network for building the policy network. Note\\n\",\n    \"    #that our action space is continuous here, so our policy network returns the mean and\\n\",\n    \"    #variance of the action as an output and then we generate a normal distribution using this\\n\",\n    \"    #mean and variance and we select an action by sampling from this normal distribution\\n\",\n    \"\\n\",\n    \"    def build_policy_network(self, name, trainable):\\n\",\n    \"        with tf.variable_scope(name):\\n\",\n    \"            \\n\",\n    \"            #define the layer of the network\\n\",\n    \"            layer1 = tf.layers.dense(self.state_ph, 100, tf.nn.relu, trainable=trainable)\\n\",\n    \"            \\n\",\n    \"            #compute mean\\n\",\n    \"            mu = 2 * tf.layers.dense(layer1, action_shape, tf.nn.tanh, trainable=trainable)\\n\",\n    \"            \\n\",\n    \"            #compute standard deviation\\n\",\n    \"            sigma = tf.layers.dense(layer1, action_shape, tf.nn.softplus, trainable=trainable)\\n\",\n    \"            \\n\",\n    \"            #compute the normal distribution\\n\",\n    \"            norm_dist = tf.distributions.Normal(loc=mu, scale=sigma)\\n\",\n    \"            \\n\",\n    \"        #get the parameters of the policy network\\n\",\n    \"        params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)\\n\",\n    \"        return norm_dist, params\\n\",\n    \"\\n\",\n    \"    #let's define a function called select_action for selecting the action\\n\",\n    \"    def select_action(self, state):\\n\",\n    \"        state = state[np.newaxis, :]\\n\",\n    \"        \\n\",\n    \"        #sample an action from the normal distribution generated by the policy network\\n\",\n    \"        action = self.sess.run(self.sample_op, {self.state_ph: state})[0]\\n\",\n    \"        \\n\",\n    \"        #we clip the action so that they lie within the action bound and then we return the action\\n\",\n    \"        action =  np.clip(action, action_bound[0], action_bound[1])\\n\",\n    \"\\n\",\n    \"        return action\\n\",\n    \"\\n\",\n    \"    #we define a function called get_state_value to obtain the value of the state computed by the value network\\n\",\n    \"    def get_state_value(self, state):\\n\",\n    \"        if state.ndim < 2: state = state[np.newaxis, :]\\n\",\n    \"        return self.sess.run(self.v, {self.state_ph: state})[0, 0]\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. First, let's create an object to our PPO class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:From <ipython-input-7-e0c4c9e17d62>:13: dense (from tensorflow.python.keras.legacy_tf_layers.core) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"Use keras.layers.Dense instead.\\n\",\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/keras/legacy_tf_layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"Please use `layer.__call__` method instead.\\n\",\n      \"WARNING:tensorflow:From <ipython-input-7-e0c4c9e17d62>:111: Normal.__init__ (from tensorflow.python.ops.distributions.normal) is deprecated and will be removed after 2019-01-01.\\n\",\n      \"Instructions for updating:\\n\",\n      \"The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.\\n\",\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/ops/distributions/normal.py:160: Distribution.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.\\n\",\n      \"Instructions for updating:\\n\",\n      \"The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"ppo = PPO()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps in each episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 200\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.9\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 32\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's train\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Episode:0, Return: -1478.0535012255646\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #initialize the lists for holding the states, actions, and rewards obtained in the episode\\n\",\n    \"    episode_states, episode_actions, episode_rewards = [], [], []\\n\",\n    \"    \\n\",\n    \"    #initialize the return\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #for every step\\n\",\n    \"    for t in range(num_timesteps):   \\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"        \\n\",\n    \"        #select the action\\n\",\n    \"        action = ppo.select_action(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, _ = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #store the state, action, and reward in the list\\n\",\n    \"        episode_states.append(state)\\n\",\n    \"        episode_actions.append(action)\\n\",\n    \"        episode_rewards.append((reward+8)/8)    \\n\",\n    \"        \\n\",\n    \"        #update the state to the next state\\n\",\n    \"        state = next_state\\n\",\n    \"        \\n\",\n    \"        #update the return\\n\",\n    \"        Return += reward\\n\",\n    \"        \\n\",\n    \"        #if we reached the batch size or if we reached the final step of the episode\\n\",\n    \"        if (t+1) % batch_size == 0 or t == num_timesteps-1:\\n\",\n    \"            \\n\",\n    \"            #compute the value of the next state\\n\",\n    \"            v_s_ = ppo.get_state_value(next_state)\\n\",\n    \"            \\n\",\n    \"            #compute Q value as sum of reward and discounted value of next state\\n\",\n    \"            discounted_r = []\\n\",\n    \"            for reward in episode_rewards[::-1]:\\n\",\n    \"                v_s_ = reward + gamma * v_s_\\n\",\n    \"                discounted_r.append(v_s_)\\n\",\n    \"            discounted_r.reverse()\\n\",\n    \"    \\n\",\n    \"            #stack the episode states, actions, and rewards:\\n\",\n    \"            es, ea, er = np.vstack(episode_states), np.vstack(episode_actions), np.array(discounted_r)[:, np.newaxis]\\n\",\n    \"            \\n\",\n    \"            #empty the lists\\n\",\n    \"            episode_states, episode_actions, episode_rewards = [], [], []\\n\",\n    \"            \\n\",\n    \"            #train the network\\n\",\n    \"            ppo.train(es, ea, er)\\n\",\n    \"        \\n\",\n    \"    #print the return for every 10 episodes\\n\",\n    \"    if i %10 ==0:\\n\",\n    \"         print(\\\"Episode:{}, Return: {}\\\".format(i,Return))  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we learned how PPO with clipped objective works and how to implement them, in the next section we will learn another interesting type of PPO algorithm called PPO with\\n\",\n    \"the penalized objective.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/.ipynb_checkpoints/13.01. Trust Region Policy Optimization-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Trust Region Policy Optimization\\n\",\n    \"\\n\",\n    \"Trust region policy optimization shortly known as TRPO is one of the most popularly used algorithms in deep reinforcement learning. TRPO is a policy gradient algorithm and it acts as an improvement to the policy gradient with baseline we learned in chapter 8. We learned that policy gradient is an on-policy algorithm meaning that on every iteration, we improve the same policy with which we are generating trajectories. On every iteration, we update the parameter of our network and try to find the improved policy. The update rule for updating the parameter $\\\\theta$ of our network is given as follows:\\n\",\n    \"\\n\",\n    \"$$\\\\theta = \\\\theta + \\\\alpha \\\\nabla_{\\\\theta} J({\\\\theta}) $$\\n\",\n    \"\\n\",\n    \"Where $\\\\nabla_{\\\\theta} J({\\\\theta}) $ is the gradient and $\\\\alpha$ is known as the step size or learning rate. If the step size is large then there will be a large policy update and if it is small then there will be a small update in the policy. How can we find an optimal step size? In the policy gradient method, we keep the step size small and so on every iteration there will be a small improvement in the policy.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"But what happens if we take a large step on every iteration? Let's suppose we have a policy $\\\\pi$ parameterized by $\\\\theta$. So, on every iteration updating $\\\\theta$ implies that we are improving our policy. If the step size is large then the policy on every iteration varies greatly, that is, the old policy (policy used in the previous iteration) and the new policy (policy used in the current iteration) vary greatly. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"We learned that if the step size large is then the new policy and old policy will vary greatly. Since we are using parametrized policy, it implies that if we make a large update (large step size) then the parameter of old policy and new policy very heavily and this leads to a problem called model collapse.\\n\",\n    \"\\n\",\n    \"This is the reason, in the policy gradient method, instead of taking larger steps and update the parameter of our network we take a small step and update the parameter to keep the old policy and new policy closer. But how can we improve this? Can we take a larger step along with maintaining the old policy and new policy closer so that it won't affect our model performance and it will also help us to learn quickly? Yes, this problem is exactly solved by TRPO.\\n\",\n    \"\\n\",\n    \"TRPO tries to make a large policy update while imposing a constraint that old policy and new policy should not vary too much. Okay, what is this constraint? But first, how can we measure and understand if the old policy and new policy are changing greatly? Here is where we use a measure called KL divergence. The KL divergence is ubiquitous in reinforcement learning. It tells us how two probability distributions are different from each other. So, we can use the KL divergence to understand if our old policy and new policy varies greatly or not. TRPO adds a constraint that the KL divergence between the old policy and new policy should be less than or equal to some constant $\\\\delta$. That is, when we make a policy update, old policy and a new policy should not vary more than some constant. This constraint is called trust region constraint. \\n\",\n    \"\\n\",\n    \"Thus, TRPO tries to make a large policy update while imposing the constraint that the parameter of the old policy and a new policy should be within the trust region. Note that in the policy gradient method, we use the parameterized policy. Thus, keeping the parameter of the old policy and new policy within the trust region implies that the old policy and new policy is within the trust region.\\n\",\n    \"\\n\",\n    \"TRPO guarantees monotonic policy improvement, that is, it guarantees that there will always be a policy improvement on every iteration. This is the fundamental idea behind the TRPO algorithm. \\n\",\n    \"\\n\",\n    \"To understand how exactly TRPO works, we should understand the math behind TRPO. TRPO has pretty heavy math. But worry not! It will be simple if we understand the fundamental math concepts required to understand TRPO. So, before diving into the TRPO algorithm, first, we will understand several essential math concepts that are required to understand TRPO. Then we will learn how to design TRPO objective function with the trust region constraint and in the end, we will see how to solve the TRPO objective function. \\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/.ipynb_checkpoints/13.09. Implementing PPO-Clipped Method-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"# Implementing PPO-clipped method\\n\",\n    \"\\n\",\n    \"Let's implement the PPO-clipped method for swinging up the pendulum task. The code\\n\",\n    \"used in this section is adapted from one of the very good PPO implementations (https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/contents/12_Proximal_Policy_Optimization) by Morvan. \\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import gym\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For a clear understanding of how the PPO works, we use\\n\",\n    \"TensorFlow in the non-eager mode by disabling TensorFlow 2 behavior.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the gym environment\\n\",\n    \"\\n\",\n    \"Let's create a pendulum environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('Pendulum-v0').unwrapped\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the action shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_shape = env.action_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that the pendulum is a continuous environment and thus our action space consists of\\n\",\n    \"continuous values. So, we get the bound of our action space:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_bound = [env.action_space.low, env.action_space.high]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the epsilon value which is used in the clipped objective:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"epsilon = 0.2 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the PPO class\\n\",\n    \"\\n\",\n    \"Let's define the class called PPO where we will implement the PPO algorithm.  For a clear understanding, you can also check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class PPO(object):\\n\",\n    \"    #first, let's define the init method\\n\",\n    \"    def __init__(self):\\n\",\n    \"        \\n\",\n    \"        #start the TensorFlow session\\n\",\n    \"        self.sess = tf.Session()\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the state\\n\",\n    \"        self.state_ph = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"\\n\",\n    \"        #now, let's build the value network which returns the value of a state\\n\",\n    \"        with tf.variable_scope('value'):\\n\",\n    \"            layer1 = tf.layers.dense(self.state_ph, 100, tf.nn.relu)\\n\",\n    \"            self.v = tf.layers.dense(layer1, 1)\\n\",\n    \"            \\n\",\n    \"            #define the placeholder for the Q value\\n\",\n    \"            self.Q = tf.placeholder(tf.float32, [None, 1], 'discounted_r')\\n\",\n    \"            \\n\",\n    \"            #define the advantage value as the difference between the Q value and state value\\n\",\n    \"            self.advantage = self.Q - self.v\\n\",\n    \"\\n\",\n    \"            #compute the loss of the value network\\n\",\n    \"            self.value_loss = tf.reduce_mean(tf.square(self.advantage))\\n\",\n    \"            \\n\",\n    \"            #train the value network by minimizing the loss using Adam optimizer\\n\",\n    \"            self.train_value_nw = tf.train.AdamOptimizer(0.002).minimize(self.value_loss)\\n\",\n    \"\\n\",\n    \"        #now, we obtain the policy and its parameter from the policy network\\n\",\n    \"        pi, pi_params = self.build_policy_network('pi', trainable=True)\\n\",\n    \"\\n\",\n    \"        #obtain the old policy and its parameter from the policy network\\n\",\n    \"        oldpi, oldpi_params = self.build_policy_network('oldpi', trainable=False)\\n\",\n    \"        \\n\",\n    \"        #sample an action from the new policy\\n\",\n    \"        with tf.variable_scope('sample_action'):\\n\",\n    \"            self.sample_op = tf.squeeze(pi.sample(1), axis=0)       \\n\",\n    \"\\n\",\n    \"        #update the parameters of the old policy\\n\",\n    \"        with tf.variable_scope('update_oldpi'):\\n\",\n    \"            self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)]\\n\",\n    \"\\n\",\n    \"        #define the placeholder for the action\\n\",\n    \"        self.action_ph = tf.placeholder(tf.float32, [None, action_shape], 'action')\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the advantage\\n\",\n    \"        self.advantage_ph = tf.placeholder(tf.float32, [None, 1], 'advantage')\\n\",\n    \"\\n\",\n    \"        #now, let's define our surrogate objective function of the policy network\\n\",\n    \"        with tf.variable_scope('loss'):\\n\",\n    \"            with tf.variable_scope('surrogate'):\\n\",\n    \"                \\n\",\n    \"                #first, let's define the ratio \\n\",\n    \"                ratio = pi.prob(self.action_ph) / oldpi.prob(self.action_ph)\\n\",\n    \"    \\n\",\n    \"                #define the objective by multiplying ratio and the advantage value\\n\",\n    \"                objective = ratio * self.advantage_ph\\n\",\n    \"                \\n\",\n    \"                #define the objective function with the clipped and unclipped objective:\\n\",\n    \"                L = tf.reduce_mean(tf.minimum(objective, \\n\",\n    \"                                   tf.clip_by_value(ratio, 1.-epsilon, 1.+ epsilon)*self.advantage_ph))\\n\",\n    \"                \\n\",\n    \"            \\n\",\n    \"            #now, we can compute the gradient and maximize the objective function using gradient\\n\",\n    \"            #ascent. However, instead of doing that, we can convert the above maximization objective\\n\",\n    \"            #into the minimization objective by just adding a negative sign. So, we can denote the loss of\\n\",\n    \"            #the policy network as:\\n\",\n    \"            \\n\",\n    \"            self.policy_loss = -L\\n\",\n    \"    \\n\",\n    \"        #train the policy network by minimizing the loss using Adam optimizer:\\n\",\n    \"        with tf.variable_scope('train_policy'):\\n\",\n    \"            self.train_policy_nw = tf.train.AdamOptimizer(0.001).minimize(self.policy_loss)\\n\",\n    \"        \\n\",\n    \"        #initialize all the TensorFlow variables\\n\",\n    \"        self.sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"    #now, let's define the train function\\n\",\n    \"    def train(self, state, action, reward):\\n\",\n    \"        \\n\",\n    \"        #update the old policy\\n\",\n    \"        self.sess.run(self.update_oldpi_op)\\n\",\n    \"        \\n\",\n    \"        #compute the advantage value\\n\",\n    \"        adv = self.sess.run(self.advantage, {self.state_ph: state, self.Q: reward})\\n\",\n    \"            \\n\",\n    \"        #train the policy network\\n\",\n    \"        [self.sess.run(self.train_policy_nw, {self.state_ph: state, self.action_ph: action, self.advantage_ph: adv}) for _ in range(10)]\\n\",\n    \"        \\n\",\n    \"        #train the value network\\n\",\n    \"        [self.sess.run(self.train_value_nw, {self.state_ph: state, self.Q: reward}) for _ in range(10)]\\n\",\n    \"\\n\",\n    \"    \\n\",\n    \"    #we define a function called build_policy_network for building the policy network. Note\\n\",\n    \"    #that our action space is continuous here, so our policy network returns the mean and\\n\",\n    \"    #variance of the action as an output and then we generate a normal distribution using this\\n\",\n    \"    #mean and variance and we select an action by sampling from this normal distribution\\n\",\n    \"\\n\",\n    \"    def build_policy_network(self, name, trainable):\\n\",\n    \"        with tf.variable_scope(name):\\n\",\n    \"            \\n\",\n    \"            #define the layer of the network\\n\",\n    \"            layer1 = tf.layers.dense(self.state_ph, 100, tf.nn.relu, trainable=trainable)\\n\",\n    \"            \\n\",\n    \"            #compute mean\\n\",\n    \"            mu = 2 * tf.layers.dense(layer1, action_shape, tf.nn.tanh, trainable=trainable)\\n\",\n    \"            \\n\",\n    \"            #compute standard deviation\\n\",\n    \"            sigma = tf.layers.dense(layer1, action_shape, tf.nn.softplus, trainable=trainable)\\n\",\n    \"            \\n\",\n    \"            #compute the normal distribution\\n\",\n    \"            norm_dist = tf.distributions.Normal(loc=mu, scale=sigma)\\n\",\n    \"            \\n\",\n    \"        #get the parameters of the policy network\\n\",\n    \"        params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)\\n\",\n    \"        return norm_dist, params\\n\",\n    \"\\n\",\n    \"    #let's define a function called select_action for selecting the action\\n\",\n    \"    def select_action(self, state):\\n\",\n    \"        state = state[np.newaxis, :]\\n\",\n    \"        \\n\",\n    \"        #sample an action from the normal distribution generated by the policy network\\n\",\n    \"        action = self.sess.run(self.sample_op, {self.state_ph: state})[0]\\n\",\n    \"        \\n\",\n    \"        #we clip the action so that they lie within the action bound and then we return the action\\n\",\n    \"        action =  np.clip(action, action_bound[0], action_bound[1])\\n\",\n    \"\\n\",\n    \"        return action\\n\",\n    \"\\n\",\n    \"    #we define a function called get_state_value to obtain the value of the state computed by the value network\\n\",\n    \"    def get_state_value(self, state):\\n\",\n    \"        if state.ndim < 2: state = state[np.newaxis, :]\\n\",\n    \"        return self.sess.run(self.v, {self.state_ph: state})[0, 0]\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. First, let's create an object to our PPO class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ppo = PPO()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps in each episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 200\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.9\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 32\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's train\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Episode:0, Return: -1513.1276787297722\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #initialize the lists for holding the states, actions, and rewards obtained in the episode\\n\",\n    \"    episode_states, episode_actions, episode_rewards = [], [], []\\n\",\n    \"    \\n\",\n    \"    #initialize the return\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #for every step\\n\",\n    \"    for t in range(num_timesteps):   \\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"        \\n\",\n    \"        #select the action\\n\",\n    \"        action = ppo.select_action(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, _ = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #store the state, action, and reward in the list\\n\",\n    \"        episode_states.append(state)\\n\",\n    \"        episode_actions.append(action)\\n\",\n    \"        episode_rewards.append((reward+8)/8)    \\n\",\n    \"        \\n\",\n    \"        #update the state to the next state\\n\",\n    \"        state = next_state\\n\",\n    \"        \\n\",\n    \"        #update the return\\n\",\n    \"        Return += reward\\n\",\n    \"        \\n\",\n    \"        #if we reached the batch size or if we reached the final step of the episode\\n\",\n    \"        if (t+1) % batch_size == 0 or t == num_timesteps-1:\\n\",\n    \"            \\n\",\n    \"            #compute the value of the next state\\n\",\n    \"            v_s_ = ppo.get_state_value(next_state)\\n\",\n    \"            \\n\",\n    \"            #compute Q value as sum of reward and discounted value of next state\\n\",\n    \"            discounted_r = []\\n\",\n    \"            for reward in episode_rewards[::-1]:\\n\",\n    \"                v_s_ = reward + gamma * v_s_\\n\",\n    \"                discounted_r.append(v_s_)\\n\",\n    \"            discounted_r.reverse()\\n\",\n    \"    \\n\",\n    \"            #stack the episode states, actions, and rewards:\\n\",\n    \"            es, ea, er = np.vstack(episode_states), np.vstack(episode_actions), np.array(discounted_r)[:, np.newaxis]\\n\",\n    \"            \\n\",\n    \"            #empty the lists\\n\",\n    \"            episode_states, episode_actions, episode_rewards = [], [], []\\n\",\n    \"            \\n\",\n    \"            #train the network\\n\",\n    \"            ppo.train(es, ea, er)\\n\",\n    \"        \\n\",\n    \"    #print the return for every 10 episodes\\n\",\n    \"    if i %10 ==0:\\n\",\n    \"         print(\\\"Episode:{}, Return: {}\\\".format(i,Return))  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we learned how PPO with clipped objective works and how to implement them, in the next section we will learn another interesting type of PPO algorithm called PPO with\\n\",\n    \"the penalized objective.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/13.09. Implementing PPO-Clipped Method.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"source\": [\n    \"# Implementing PPO-clipped method\\n\",\n    \"\\n\",\n    \"Let's implement the PPO-clipped method for swinging up the pendulum task. The code\\n\",\n    \"used in this section is adapted from one of the very good PPO implementations (https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/contents/12_Proximal_Policy_Optimization) by Morvan. \\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import gym\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For a clear understanding of how the PPO works, we use\\n\",\n    \"TensorFlow in the non-eager mode by disabling TensorFlow 2 behavior.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating the gym environment\\n\",\n    \"\\n\",\n    \"Let's create a pendulum environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('Pendulum-v0').unwrapped\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the state shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state_shape = env.observation_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the action shape of the environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_shape = env.action_space.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that the pendulum is a continuous environment and thus our action space consists of\\n\",\n    \"continuous values. So, we get the bound of our action space:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_bound = [env.action_space.low, env.action_space.high]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the epsilon value which is used in the clipped objective:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"epsilon = 0.2 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the PPO class\\n\",\n    \"\\n\",\n    \"Let's define the class called PPO where we will implement the PPO algorithm.  For a clear understanding, you can also check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class PPO(object):\\n\",\n    \"    #first, let's define the init method\\n\",\n    \"    def __init__(self):\\n\",\n    \"        \\n\",\n    \"        #start the TensorFlow session\\n\",\n    \"        self.sess = tf.Session()\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the state\\n\",\n    \"        self.state_ph = tf.placeholder(tf.float32, [None, state_shape], 'state')\\n\",\n    \"\\n\",\n    \"        #now, let's build the value network which returns the value of a state\\n\",\n    \"        with tf.variable_scope('value'):\\n\",\n    \"            layer1 = tf.layers.dense(self.state_ph, 100, tf.nn.relu)\\n\",\n    \"            self.v = tf.layers.dense(layer1, 1)\\n\",\n    \"            \\n\",\n    \"            #define the placeholder for the Q value\\n\",\n    \"            self.Q = tf.placeholder(tf.float32, [None, 1], 'discounted_r')\\n\",\n    \"            \\n\",\n    \"            #define the advantage value as the difference between the Q value and state value\\n\",\n    \"            self.advantage = self.Q - self.v\\n\",\n    \"\\n\",\n    \"            #compute the loss of the value network\\n\",\n    \"            self.value_loss = tf.reduce_mean(tf.square(self.advantage))\\n\",\n    \"            \\n\",\n    \"            #train the value network by minimizing the loss using Adam optimizer\\n\",\n    \"            self.train_value_nw = tf.train.AdamOptimizer(0.002).minimize(self.value_loss)\\n\",\n    \"\\n\",\n    \"        #now, we obtain the policy and its parameter from the policy network\\n\",\n    \"        pi, pi_params = self.build_policy_network('pi', trainable=True)\\n\",\n    \"\\n\",\n    \"        #obtain the old policy and its parameter from the policy network\\n\",\n    \"        oldpi, oldpi_params = self.build_policy_network('oldpi', trainable=False)\\n\",\n    \"        \\n\",\n    \"        #sample an action from the new policy\\n\",\n    \"        with tf.variable_scope('sample_action'):\\n\",\n    \"            self.sample_op = tf.squeeze(pi.sample(1), axis=0)       \\n\",\n    \"\\n\",\n    \"        #update the parameters of the old policy\\n\",\n    \"        with tf.variable_scope('update_oldpi'):\\n\",\n    \"            self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)]\\n\",\n    \"\\n\",\n    \"        #define the placeholder for the action\\n\",\n    \"        self.action_ph = tf.placeholder(tf.float32, [None, action_shape], 'action')\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the advantage\\n\",\n    \"        self.advantage_ph = tf.placeholder(tf.float32, [None, 1], 'advantage')\\n\",\n    \"\\n\",\n    \"        #now, let's define our surrogate objective function of the policy network\\n\",\n    \"        with tf.variable_scope('loss'):\\n\",\n    \"            with tf.variable_scope('surrogate'):\\n\",\n    \"                \\n\",\n    \"                #first, let's define the ratio \\n\",\n    \"                ratio = pi.prob(self.action_ph) / oldpi.prob(self.action_ph)\\n\",\n    \"    \\n\",\n    \"                #define the objective by multiplying ratio and the advantage value\\n\",\n    \"                objective = ratio * self.advantage_ph\\n\",\n    \"                \\n\",\n    \"                #define the objective function with the clipped and unclipped objective:\\n\",\n    \"                L = tf.reduce_mean(tf.minimum(objective, \\n\",\n    \"                                   tf.clip_by_value(ratio, 1.-epsilon, 1.+ epsilon)*self.advantage_ph))\\n\",\n    \"                \\n\",\n    \"            \\n\",\n    \"            #now, we can compute the gradient and maximize the objective function using gradient\\n\",\n    \"            #ascent. However, instead of doing that, we can convert the above maximization objective\\n\",\n    \"            #into the minimization objective by just adding a negative sign. So, we can denote the loss of\\n\",\n    \"            #the policy network as:\\n\",\n    \"            \\n\",\n    \"            self.policy_loss = -L\\n\",\n    \"    \\n\",\n    \"        #train the policy network by minimizing the loss using Adam optimizer:\\n\",\n    \"        with tf.variable_scope('train_policy'):\\n\",\n    \"            self.train_policy_nw = tf.train.AdamOptimizer(0.001).minimize(self.policy_loss)\\n\",\n    \"        \\n\",\n    \"        #initialize all the TensorFlow variables\\n\",\n    \"        self.sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"    #now, let's define the train function\\n\",\n    \"    def train(self, state, action, reward):\\n\",\n    \"        \\n\",\n    \"        #update the old policy\\n\",\n    \"        self.sess.run(self.update_oldpi_op)\\n\",\n    \"        \\n\",\n    \"        #compute the advantage value\\n\",\n    \"        adv = self.sess.run(self.advantage, {self.state_ph: state, self.Q: reward})\\n\",\n    \"            \\n\",\n    \"        #train the policy network\\n\",\n    \"        [self.sess.run(self.train_policy_nw, {self.state_ph: state, self.action_ph: action, self.advantage_ph: adv}) for _ in range(10)]\\n\",\n    \"        \\n\",\n    \"        #train the value network\\n\",\n    \"        [self.sess.run(self.train_value_nw, {self.state_ph: state, self.Q: reward}) for _ in range(10)]\\n\",\n    \"\\n\",\n    \"    \\n\",\n    \"    #we define a function called build_policy_network for building the policy network. Note\\n\",\n    \"    #that our action space is continuous here, so our policy network returns the mean and\\n\",\n    \"    #variance of the action as an output and then we generate a normal distribution using this\\n\",\n    \"    #mean and variance and we select an action by sampling from this normal distribution\\n\",\n    \"\\n\",\n    \"    def build_policy_network(self, name, trainable):\\n\",\n    \"        with tf.variable_scope(name):\\n\",\n    \"            \\n\",\n    \"            #define the layer of the network\\n\",\n    \"            layer1 = tf.layers.dense(self.state_ph, 100, tf.nn.relu, trainable=trainable)\\n\",\n    \"            \\n\",\n    \"            #compute mean\\n\",\n    \"            mu = 2 * tf.layers.dense(layer1, action_shape, tf.nn.tanh, trainable=trainable)\\n\",\n    \"            \\n\",\n    \"            #compute standard deviation\\n\",\n    \"            sigma = tf.layers.dense(layer1, action_shape, tf.nn.softplus, trainable=trainable)\\n\",\n    \"            \\n\",\n    \"            #compute the normal distribution\\n\",\n    \"            norm_dist = tf.distributions.Normal(loc=mu, scale=sigma)\\n\",\n    \"            \\n\",\n    \"        #get the parameters of the policy network\\n\",\n    \"        params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)\\n\",\n    \"        return norm_dist, params\\n\",\n    \"\\n\",\n    \"    #let's define a function called select_action for selecting the action\\n\",\n    \"    def select_action(self, state):\\n\",\n    \"        state = state[np.newaxis, :]\\n\",\n    \"        \\n\",\n    \"        #sample an action from the normal distribution generated by the policy network\\n\",\n    \"        action = self.sess.run(self.sample_op, {self.state_ph: state})[0]\\n\",\n    \"        \\n\",\n    \"        #we clip the action so that they lie within the action bound and then we return the action\\n\",\n    \"        action =  np.clip(action, action_bound[0], action_bound[1])\\n\",\n    \"\\n\",\n    \"        return action\\n\",\n    \"\\n\",\n    \"    #we define a function called get_state_value to obtain the value of the state computed by the value network\\n\",\n    \"    def get_state_value(self, state):\\n\",\n    \"        if state.ndim < 2: state = state[np.newaxis, :]\\n\",\n    \"        return self.sess.run(self.v, {self.state_ph: state})[0, 0]\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. First, let's create an object to our PPO class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ppo = PPO()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of time steps in each episode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_timesteps = 200\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.9\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Set the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 32\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's train\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Episode:0, Return: -1513.1276787297722\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #initialize the lists for holding the states, actions, and rewards obtained in the episode\\n\",\n    \"    episode_states, episode_actions, episode_rewards = [], [], []\\n\",\n    \"    \\n\",\n    \"    #initialize the return\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #for every step\\n\",\n    \"    for t in range(num_timesteps):   \\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"        \\n\",\n    \"        #select the action\\n\",\n    \"        action = ppo.select_action(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, _ = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #store the state, action, and reward in the list\\n\",\n    \"        episode_states.append(state)\\n\",\n    \"        episode_actions.append(action)\\n\",\n    \"        episode_rewards.append((reward+8)/8)    \\n\",\n    \"        \\n\",\n    \"        #update the state to the next state\\n\",\n    \"        state = next_state\\n\",\n    \"        \\n\",\n    \"        #update the return\\n\",\n    \"        Return += reward\\n\",\n    \"        \\n\",\n    \"        #if we reached the batch size or if we reached the final step of the episode\\n\",\n    \"        if (t+1) % batch_size == 0 or t == num_timesteps-1:\\n\",\n    \"            \\n\",\n    \"            #compute the value of the next state\\n\",\n    \"            v_s_ = ppo.get_state_value(next_state)\\n\",\n    \"            \\n\",\n    \"            #compute Q value as sum of reward and discounted value of next state\\n\",\n    \"            discounted_r = []\\n\",\n    \"            for reward in episode_rewards[::-1]:\\n\",\n    \"                v_s_ = reward + gamma * v_s_\\n\",\n    \"                discounted_r.append(v_s_)\\n\",\n    \"            discounted_r.reverse()\\n\",\n    \"    \\n\",\n    \"            #stack the episode states, actions, and rewards:\\n\",\n    \"            es, ea, er = np.vstack(episode_states), np.vstack(episode_actions), np.array(discounted_r)[:, np.newaxis]\\n\",\n    \"            \\n\",\n    \"            #empty the lists\\n\",\n    \"            episode_states, episode_actions, episode_rewards = [], [], []\\n\",\n    \"            \\n\",\n    \"            #train the network\\n\",\n    \"            ppo.train(es, ea, er)\\n\",\n    \"        \\n\",\n    \"    #print the return for every 10 episodes\\n\",\n    \"    if i %10 ==0:\\n\",\n    \"         print(\\\"Episode:{}, Return: {}\\\".format(i,Return))  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we learned how PPO with clipped objective works and how to implement them, in the next section we will learn another interesting type of PPO algorithm called PPO with\\n\",\n    \"the penalized objective.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "13. TRPO, PPO and ACKTR Methods/README.md",
    "content": "# 13. TRPO, PPO and ACKTR Methods\n* 13.1 Trust Region Policy Optimization\n* 13.2. Math Essentials\n   * 13.2.1. Taylor series\n   * 13.2.2. Trust Region method\n   * 13.2.3. Conjugate Gradient Method\n   * 13.2.4. Lagrange Multiplier\n   * 13.2.5. Importance Sampling\n* 13.3. Designing the TRPO Objective Function\n   * 13.3.1. Parameterizing the Policy\n   * 13.3.2. Sample Based Estimation\n* 13.4. Solving the TRPO Objective Function\n   * 13.4.1. Computing the Search Direction\n   * 13.4.2. Perform Line Search in the Search Direction\n* 13.5. Algorithm - TRPO\n* 13.6. Proximal Policy Optimization\n* 13.7. PPO with Clipped Objective\n   * 13.8. Algorithm - PPO-Clipped\n* 13.9. Implementing PPO-Clipped Method\n* 13.10. PPO with Penalized Objective\n   * 13.10.1. Algorithm - PPO-Penalty\n* 13.11. Actor Critic using Kronecker Factored Trust Region\n* 13.12. Math Essentials\n   * 13.12.1. Block Matrix\n   * 13.12.2. Block Diagonal Matrix\n   * 13.12.3. Kronecker Product\n   * 13.12.4. Vec Operator\n   * 13.12.5. Properties of Kronecker Product\n* 13.13. Kronecker-Factored Approximate Curvature (K-FAC)\n* 13.14. K-FAC in Actor Critic\n   * 13.14.1 Incorporating Trust Region"
  },
  {
    "path": "14. Distributional Reinforcement Learning/.ipynb_checkpoints/12.03. Playing Atari games using Categorical DQN-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using Categorical DQN\\n\",\n    \"\\n\",\n    \"Let's implement the categorical DQN algorithm for playing the Atari games. The code used\\n\",\n    \"in this section is adapted from open-source categorical DQN implementation - \\n\",\n    \"https://github.com/princewen/tensorflow_practice/tree/master/RL/Basic-DisRLDemo provided by Prince Wen. \\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"non-resource variables are not supported in the long term\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"import random\\n\",\n    \"from collections import deque\\n\",\n    \"import math\\n\",\n    \"\\n\",\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"from tensorflow.python.framework import ops\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the convolutional layer\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def conv(inputs, kernel_shape, bias_shape, strides, weights, bias=None, activation=tf.nn.relu):\\n\",\n    \"\\n\",\n    \"    weights = tf.get_variable('weights', shape=kernel_shape, initializer=weights)\\n\",\n    \"    conv = tf.nn.conv2d(inputs, weights, strides=strides, padding='SAME')\\n\",\n    \"    if bias_shape is not None:\\n\",\n    \"        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)\\n\",\n    \"        return activation(conv + biases) if activation is not None else conv + biases\\n\",\n    \"    \\n\",\n    \"    return activation(conv) if activation is not None else conv\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the dense layer\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def dense(inputs, units, bias_shape, weights, bias=None, activation=tf.nn.relu):\\n\",\n    \"    \\n\",\n    \"    if not isinstance(inputs, ops.Tensor):\\n\",\n    \"        inputs = ops.convert_to_tensor(inputs, dtype='float')\\n\",\n    \"    if len(inputs.shape) > 2:\\n\",\n    \"        inputs = tf.layers.flatten(inputs)\\n\",\n    \"    flatten_shape = inputs.shape[1]\\n\",\n    \"    weights = tf.get_variable('weights', shape=[flatten_shape, units], initializer=weights)\\n\",\n    \"    dense = tf.matmul(inputs, weights)\\n\",\n    \"    if bias_shape is not None:\\n\",\n    \"        assert bias_shape[0] == units\\n\",\n    \"        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)\\n\",\n    \"        return activation(dense + biases) if activation is not None else dense + biases\\n\",\n    \"    \\n\",\n    \"    return activation(dense) if activation is not None else dense\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the variables\\n\",\n    \"\\n\",\n    \"Now, let's define some of the important variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the $V_{min}$ and $V_{max}$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"v_min = 0\\n\",\n    \"v_max = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the number of atoms (supports) $N$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"atoms = 51\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.99 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 10\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the time step at which we want to update the target network\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"update_target_net = 50 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the epsilon value which is used in the epsilon-greedy policy:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"epsilon = 0.5\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the replay buffer\\n\",\n    \"\\n\",\n    \"First, let's define the buffer length:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"buffer_length = 20000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the replay buffer as a deque structure:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"replay_buffer = deque(maxlen=buffer_length)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We define a function called sample_transitions which returns the randomly sampled\\n\",\n    \"minibatch of transitions from the replay buffer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def sample_transitions(batch_size):\\n\",\n    \"    batch = np.random.permutation(len(replay_buffer))[:batch_size]\\n\",\n    \"    trans = np.array(replay_buffer)[batch]\\n\",\n    \"    return trans\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the Categorical DQN class\\n\",\n    \"\\n\",\n    \"Let's define the class called Categorical_DQN where we will implement the categorical\\n\",\n    \"DQN algorithm. For a clear understanding, you can also check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class Categorical_DQN():\\n\",\n    \"    \\n\",\n    \"    #first, let's define the init method\\n\",\n    \"    def __init__(self,env):\\n\",\n    \"        \\n\",\n    \"        #start the TensorFlow session\\n\",\n    \"        self.sess = tf.InteractiveSession()\\n\",\n    \"        \\n\",\n    \"        #initialize v_min and v_max\\n\",\n    \"        self.v_max = v_max\\n\",\n    \"        self.v_min = v_min\\n\",\n    \"        \\n\",\n    \"        #initialize the number of atoms\\n\",\n    \"        self.atoms = atoms \\n\",\n    \"        \\n\",\n    \"        #initialize the epsilon value\\n\",\n    \"        self.epsilon = epsilon\\n\",\n    \"        \\n\",\n    \"        #get the state shape of the environment\\n\",\n    \"        self.state_shape = env.observation_space.shape\\n\",\n    \"        \\n\",\n    \"        #get the action shape of the environment\\n\",\n    \"        self.action_shape = env.action_space.n\\n\",\n    \"\\n\",\n    \"        #initialize the time step:\\n\",\n    \"        self.time_step = 0\\n\",\n    \"        \\n\",\n    \"        #initialize the target state shape\\n\",\n    \"        target_state_shape = [1]\\n\",\n    \"        target_state_shape.extend(self.state_shape)\\n\",\n    \"\\n\",\n    \"        #define the placeholder for the state\\n\",\n    \"        self.state_ph = tf.placeholder(tf.float32,target_state_shape)\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the action\\n\",\n    \"        self.action_ph = tf.placeholder(tf.int32,[1,1])\\n\",\n    \"                                       \\n\",\n    \"        #define the placeholder for the m value (distributed probability of target distribution)\\n\",\n    \"        self.m_ph = tf.placeholder(tf.float32,[self.atoms])\\n\",\n    \"    \\n\",\n    \"        #compute delta z\\n\",\n    \"        self.delta_z = (self.v_max - self.v_min) / (self.atoms - 1)\\n\",\n    \"                                       \\n\",\n    \"        #compute the support values\\n\",\n    \"        self.z = [self.v_min + i * self.delta_z for i in range(self.atoms)]\\n\",\n    \"\\n\",\n    \"        self.build_categorical_DQN()\\n\",\n    \"                                       \\n\",\n    \"        #initialize all the TensorFlow variables\\n\",\n    \"        self.sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"        \\n\",\n    \"    #let's define a function called build_network for building a deep network. Since we are\\n\",\n    \"    #dealing with the Atari games, we use the convolutional neural network\\n\",\n    \"                                       \\n\",\n    \"    def build_network(self, state, action, name, units_1, units_2, weights, bias, reg=None):\\n\",\n    \"                                       \\n\",\n    \"        #define the first convolutional layer\\n\",\n    \"        with tf.variable_scope('conv1'):\\n\",\n    \"            conv1 = conv(state, [5, 5, 3, 6], [6], [1, 2, 2, 1], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #define the second convolutional layer\\n\",\n    \"        with tf.variable_scope('conv2'):\\n\",\n    \"            conv2 = conv(conv1, [3, 3, 6, 12], [12], [1, 2, 2, 1], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #flatten the feature maps obtained as a result of the second convolutional layer\\n\",\n    \"        with tf.variable_scope('flatten'):\\n\",\n    \"            flatten = tf.layers.flatten(conv2)\\n\",\n    \"    \\n\",\n    \"        #define the first dense layer\\n\",\n    \"        with tf.variable_scope('dense1'):\\n\",\n    \"            dense1 = dense(flatten, units_1, [units_1], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #define the second dense layer\\n\",\n    \"        with tf.variable_scope('dense2'):\\n\",\n    \"            dense2 = dense(dense1, units_2, [units_2], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #concatenate the second dense layer with the action\\n\",\n    \"        with tf.variable_scope('concat'):\\n\",\n    \"            concatenated = tf.concat([dense2, tf.cast(action, tf.float32)], 1)\\n\",\n    \"                                       \\n\",\n    \"        #define the third layer and apply the softmax function to the result of the third layer and\\n\",\n    \"        #obtain the probabilities for each of the atoms\\n\",\n    \"        with tf.variable_scope('dense3'):\\n\",\n    \"            dense3 = dense(concatenated, self.atoms, [self.atoms], weights, bias) \\n\",\n    \"        return tf.nn.softmax(dense3)\\n\",\n    \"\\n\",\n    \"    #now, let's define a function called build_categorical_DQNfor building the main and\\n\",\n    \"    #target categorical deep Q networks\\n\",\n    \"                                       \\n\",\n    \"    def build_categorical_DQN(self):      \\n\",\n    \"                                       \\n\",\n    \"        #define the main categorical DQN and obtain the probabilities\\n\",\n    \"        with tf.variable_scope('main_net'):\\n\",\n    \"            name = ['main_net_params',tf.GraphKeys.GLOBAL_VARIABLES]\\n\",\n    \"            weights = tf.random_uniform_initializer(-0.1,0.1)\\n\",\n    \"            bias = tf.constant_initializer(0.1)\\n\",\n    \"\\n\",\n    \"            self.main_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)\\n\",\n    \"                                       \\n\",\n    \"        #define the target categorical DQN and obtain the probabilities\\n\",\n    \"        with tf.variable_scope('target_net'):\\n\",\n    \"            name = ['target_net_params',tf.GraphKeys.GLOBAL_VARIABLES]\\n\",\n    \"\\n\",\n    \"            weights = tf.random_uniform_initializer(-0.1,0.1)\\n\",\n    \"            bias = tf.constant_initializer(0.1)\\n\",\n    \"\\n\",\n    \"            self.target_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)\\n\",\n    \"\\n\",\n    \"        #compute the main Q value with probabilities obtained from the main categorical DQN\\n\",\n    \"        self.main_Q = tf.reduce_sum(self.main_p * self.z)\\n\",\n    \"                                    \\n\",\n    \"        #similarly, compute the target Q value with probabilities obtained from the target categorical DQN \\n\",\n    \"        self.target_Q = tf.reduce_sum(self.target_p * self.z)\\n\",\n    \"        \\n\",\n    \"        #define the cross entropy loss\\n\",\n    \"        self.cross_entropy_loss = -tf.reduce_sum(self.m_ph * tf.log(self.main_p))\\n\",\n    \"        \\n\",\n    \"        #define the optimizer and minimize the cross entropy loss using Adam optimizer\\n\",\n    \"        self.optimizer = tf.train.AdamOptimizer(0.01).minimize(self.cross_entropy_loss)\\n\",\n    \"    \\n\",\n    \"        #get the main network parameters\\n\",\n    \"        main_net_params = tf.get_collection(\\\"main_net_params\\\")\\n\",\n    \"        \\n\",\n    \"        #get the target network parameters\\n\",\n    \"        target_net_params = tf.get_collection('target_net_params')\\n\",\n    \"        \\n\",\n    \"        #define the update_target_net operation for updating the target network parameters by\\n\",\n    \"        #copying the parameters of the main network\\n\",\n    \"        self.update_target_net = [tf.assign(t, e) for t, e in zip(target_net_params, main_net_params)]\\n\",\n    \"\\n\",\n    \"    #let's define a function called train to train the network\\n\",\n    \"    def train(self,s,r,action,s_,gamma):\\n\",\n    \"        \\n\",\n    \"        #increment the time step\\n\",\n    \"        self.time_step += 1\\n\",\n    \"    \\n\",\n    \"        #get the target Q values\\n\",\n    \"        list_q_ = [self.sess.run(self.target_Q,feed_dict={self.state_ph:[s_],self.action_ph:[[a]]}) for a in range(self.action_shape)]\\n\",\n    \"        \\n\",\n    \"        #select the next state action a dash as the one which has the maximum Q value\\n\",\n    \"        a_ = tf.argmax(list_q_).eval()\\n\",\n    \"        \\n\",\n    \"        #initialize an array m with shape as the number of support with zero values. The denotes\\n\",\n    \"        #the distributed probability of the target distribution after the projection step\\n\",\n    \"\\n\",\n    \"        m = np.zeros(self.atoms)\\n\",\n    \"        \\n\",\n    \"        #get the probability for each atom using the target categorical DQN\\n\",\n    \"        p = self.sess.run(self.target_p,feed_dict = {self.state_ph:[s_],self.action_ph:[[a_]]})[0]\\n\",\n    \"        \\n\",\n    \"        #perform the projection step\\n\",\n    \"        for j in range(self.atoms):\\n\",\n    \"            Tz = min(self.v_max,max(self.v_min,r+gamma * self.z[j]))\\n\",\n    \"            bj = (Tz - self.v_min) / self.delta_z \\n\",\n    \"            l,u = math.floor(bj),math.ceil(bj) \\n\",\n    \"\\n\",\n    \"            pj = p[j]\\n\",\n    \"\\n\",\n    \"            m[int(l)] += pj * (u - bj)\\n\",\n    \"            m[int(u)] += pj * (bj - l)\\n\",\n    \"    \\n\",\n    \"        #train the network by minimizing the loss\\n\",\n    \"        self.sess.run(self.optimizer,feed_dict={self.state_ph:[s] , self.action_ph:[action], self.m_ph: m })\\n\",\n    \"        \\n\",\n    \"        #update the target network parameters by copying the main network parameters\\n\",\n    \"        if self.time_step % update_target_net == 0:\\n\",\n    \"            self.sess.run(self.update_target_net)\\n\",\n    \"    \\n\",\n    \"    #let's define a function called select_action for selecting the action. We generate a random number and if the number is less than epsilon we select the random\\n\",\n    \"    #action else we select the action which has maximum Q value.\\n\",\n    \"    def select_action(self,s):\\n\",\n    \"        if random.random() <= self.epsilon:\\n\",\n    \"            return random.randint(0, self.action_shape - 1)\\n\",\n    \"        else: \\n\",\n    \"            return np.argmax([self.sess.run(self.main_Q,feed_dict={self.state_ph:[s],self.action_ph:[[a]]}) for a in range(self.action_shape)])\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. First, create the Atari game environment using the\\n\",\n    \"gym. Let's create a Tennis game environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"Tennis-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create an object to our Categorical_DQN class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:From <ipython-input-13-51ba5d5455e2>:68: flatten (from tensorflow.python.keras.legacy_tf_layers.core) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"Use keras.layers.Flatten instead.\\n\",\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/keras/legacy_tf_layers/core.py:332: Layer.apply (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"Please use `layer.__call__` method instead.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"agent = Categorical_DQN(env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 800\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #set done to False\\n\",\n    \"    done = False\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #initialize the return\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #while the episode is not over\\n\",\n    \"    while not done:\\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"        \\n\",\n    \"        #select an action\\n\",\n    \"        action = agent.select_action(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #update the return\\n\",\n    \"        Return = Return + reward\\n\",\n    \"        \\n\",\n    \"        #store the transition information into the replay buffer\\n\",\n    \"        replay_buffer.append([state, reward, [action], next_state])\\n\",\n    \"        \\n\",\n    \"        #if the length of the replay buffer is greater than or equal to buffer size then start training the\\n\",\n    \"        #network by sampling transitions from the replay buffer\\n\",\n    \"        if len(replay_buffer) >= batch_size:\\n\",\n    \"            trans = sample_transitions(2)\\n\",\n    \"            for item in trans:\\n\",\n    \"                agent.train(item[0],item[1], item[2], item[3],gamma)\\n\",\n    \"                \\n\",\n    \"        #update the state to the next state\\n\",\n    \"        state = next_state\\n\",\n    \"    \\n\",\n    \"    #print the return obtained in the episode\\n\",\n    \"    print(\\\"Episode:{}, Return: {}\\\".format(i,Return))\\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we learned how categorical DQN works and how to implement them, in the next\\n\",\n    \"section, we will learn another interesting algorithm.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "14. Distributional Reinforcement Learning/.ipynb_checkpoints/14.03. Playing Atari games using Categorical DQN-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using Categorical DQN\\n\",\n    \"\\n\",\n    \"Let's implement the categorical DQN algorithm for playing the Atari games. The code used\\n\",\n    \"in this section is adapted from open-source categorical DQN implementation - \\n\",\n    \"https://github.com/princewen/tensorflow_practice/tree/master/RL/Basic-DisRLDemo provided by Prince Wen. \\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"import random\\n\",\n    \"from collections import deque\\n\",\n    \"import math\\n\",\n    \"\\n\",\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"from tensorflow.python.framework import ops\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the convolutional layer\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def conv(inputs, kernel_shape, bias_shape, strides, weights, bias=None, activation=tf.nn.relu):\\n\",\n    \"\\n\",\n    \"    weights = tf.get_variable('weights', shape=kernel_shape, initializer=weights)\\n\",\n    \"    conv = tf.nn.conv2d(inputs, weights, strides=strides, padding='SAME')\\n\",\n    \"    if bias_shape is not None:\\n\",\n    \"        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)\\n\",\n    \"        return activation(conv + biases) if activation is not None else conv + biases\\n\",\n    \"    \\n\",\n    \"    return activation(conv) if activation is not None else conv\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the dense layer\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def dense(inputs, units, bias_shape, weights, bias=None, activation=tf.nn.relu):\\n\",\n    \"    \\n\",\n    \"    if not isinstance(inputs, ops.Tensor):\\n\",\n    \"        inputs = ops.convert_to_tensor(inputs, dtype='float')\\n\",\n    \"    if len(inputs.shape) > 2:\\n\",\n    \"        inputs = tf.layers.flatten(inputs)\\n\",\n    \"    flatten_shape = inputs.shape[1]\\n\",\n    \"    weights = tf.get_variable('weights', shape=[flatten_shape, units], initializer=weights)\\n\",\n    \"    dense = tf.matmul(inputs, weights)\\n\",\n    \"    if bias_shape is not None:\\n\",\n    \"        assert bias_shape[0] == units\\n\",\n    \"        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)\\n\",\n    \"        return activation(dense + biases) if activation is not None else dense + biases\\n\",\n    \"    \\n\",\n    \"    return activation(dense) if activation is not None else dense\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the variables\\n\",\n    \"\\n\",\n    \"Now, let's define some of the important variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the $V_{min}$ and $V_{max}$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"v_min = 0\\n\",\n    \"v_max = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the number of atoms (supports) $N$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"atoms = 51\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.99 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 10\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the time step at which we want to update the target network\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"update_target_net = 50 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the epsilon value which is used in the epsilon-greedy policy:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"epsilon = 0.5\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the replay buffer\\n\",\n    \"\\n\",\n    \"First, let's define the buffer length:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"buffer_length = 20000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the replay buffer as a deque structure:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"replay_buffer = deque(maxlen=buffer_length)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We define a function called sample_transitions which returns the randomly sampled\\n\",\n    \"minibatch of transitions from the replay buffer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def sample_transitions(batch_size):\\n\",\n    \"    batch = np.random.permutation(len(replay_buffer))[:batch_size]\\n\",\n    \"    trans = np.array(replay_buffer)[batch]\\n\",\n    \"    return trans\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the Categorical DQN class\\n\",\n    \"\\n\",\n    \"Let's define the class called `Categorical_DQN` where we will implement the categorical\\n\",\n    \"DQN algorithm. For a clear understanding, you can also check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class Categorical_DQN():\\n\",\n    \"    \\n\",\n    \"    #first, let's define the init method\\n\",\n    \"    def __init__(self,env):\\n\",\n    \"        \\n\",\n    \"        #start the TensorFlow session\\n\",\n    \"        self.sess = tf.InteractiveSession()\\n\",\n    \"        \\n\",\n    \"        #initialize v_min and v_max\\n\",\n    \"        self.v_max = v_max\\n\",\n    \"        self.v_min = v_min\\n\",\n    \"        \\n\",\n    \"        #initialize the number of atoms\\n\",\n    \"        self.atoms = atoms \\n\",\n    \"        \\n\",\n    \"        #initialize the epsilon value\\n\",\n    \"        self.epsilon = epsilon\\n\",\n    \"        \\n\",\n    \"        #get the state shape of the environment\\n\",\n    \"        self.state_shape = env.observation_space.shape\\n\",\n    \"        \\n\",\n    \"        #get the action shape of the environment\\n\",\n    \"        self.action_shape = env.action_space.n\\n\",\n    \"\\n\",\n    \"        #initialize the time step:\\n\",\n    \"        self.time_step = 0\\n\",\n    \"        \\n\",\n    \"        #initialize the target state shape\\n\",\n    \"        target_state_shape = [1]\\n\",\n    \"        target_state_shape.extend(self.state_shape)\\n\",\n    \"\\n\",\n    \"        #define the placeholder for the state\\n\",\n    \"        self.state_ph = tf.placeholder(tf.float32,target_state_shape)\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the action\\n\",\n    \"        self.action_ph = tf.placeholder(tf.int32,[1,1])\\n\",\n    \"                                       \\n\",\n    \"        #define the placeholder for the m value (distributed probability of target distribution)\\n\",\n    \"        self.m_ph = tf.placeholder(tf.float32,[self.atoms])\\n\",\n    \"    \\n\",\n    \"        #compute delta z\\n\",\n    \"        self.delta_z = (self.v_max - self.v_min) / (self.atoms - 1)\\n\",\n    \"                                       \\n\",\n    \"        #compute the support values\\n\",\n    \"        self.z = [self.v_min + i * self.delta_z for i in range(self.atoms)]\\n\",\n    \"\\n\",\n    \"        self.build_categorical_DQN()\\n\",\n    \"                                       \\n\",\n    \"        #initialize all the TensorFlow variables\\n\",\n    \"        self.sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"        \\n\",\n    \"    #let's define a function called build_network for building a deep network. Since we are\\n\",\n    \"    #dealing with the Atari games, we use the convolutional neural network\\n\",\n    \"                                       \\n\",\n    \"    def build_network(self, state, action, name, units_1, units_2, weights, bias, reg=None):\\n\",\n    \"                                       \\n\",\n    \"        #define the first convolutional layer\\n\",\n    \"        with tf.variable_scope('conv1'):\\n\",\n    \"            conv1 = conv(state, [5, 5, 3, 6], [6], [1, 2, 2, 1], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #define the second convolutional layer\\n\",\n    \"        with tf.variable_scope('conv2'):\\n\",\n    \"            conv2 = conv(conv1, [3, 3, 6, 12], [12], [1, 2, 2, 1], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #flatten the feature maps obtained as a result of the second convolutional layer\\n\",\n    \"        with tf.variable_scope('flatten'):\\n\",\n    \"            flatten = tf.layers.flatten(conv2)\\n\",\n    \"    \\n\",\n    \"        #define the first dense layer\\n\",\n    \"        with tf.variable_scope('dense1'):\\n\",\n    \"            dense1 = dense(flatten, units_1, [units_1], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #define the second dense layer\\n\",\n    \"        with tf.variable_scope('dense2'):\\n\",\n    \"            dense2 = dense(dense1, units_2, [units_2], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #concatenate the second dense layer with the action\\n\",\n    \"        with tf.variable_scope('concat'):\\n\",\n    \"            concatenated = tf.concat([dense2, tf.cast(action, tf.float32)], 1)\\n\",\n    \"                                       \\n\",\n    \"        #define the third layer and apply the softmax function to the result of the third layer and\\n\",\n    \"        #obtain the probabilities for each of the atoms\\n\",\n    \"        with tf.variable_scope('dense3'):\\n\",\n    \"            dense3 = dense(concatenated, self.atoms, [self.atoms], weights, bias) \\n\",\n    \"        return tf.nn.softmax(dense3)\\n\",\n    \"\\n\",\n    \"    #now, let's define a function called build_categorical_DQNfor building the main and\\n\",\n    \"    #target categorical deep Q networks\\n\",\n    \"                                       \\n\",\n    \"    def build_categorical_DQN(self):      \\n\",\n    \"                                       \\n\",\n    \"        #define the main categorical DQN and obtain the probabilities\\n\",\n    \"        with tf.variable_scope('main_net'):\\n\",\n    \"            name = ['main_net_params',tf.GraphKeys.GLOBAL_VARIABLES]\\n\",\n    \"            weights = tf.random_uniform_initializer(-0.1,0.1)\\n\",\n    \"            bias = tf.constant_initializer(0.1)\\n\",\n    \"\\n\",\n    \"            self.main_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)\\n\",\n    \"                                       \\n\",\n    \"        #define the target categorical DQN and obtain the probabilities\\n\",\n    \"        with tf.variable_scope('target_net'):\\n\",\n    \"            name = ['target_net_params',tf.GraphKeys.GLOBAL_VARIABLES]\\n\",\n    \"\\n\",\n    \"            weights = tf.random_uniform_initializer(-0.1,0.1)\\n\",\n    \"            bias = tf.constant_initializer(0.1)\\n\",\n    \"\\n\",\n    \"            self.target_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)\\n\",\n    \"\\n\",\n    \"        #compute the main Q value with probabilities obtained from the main categorical DQN\\n\",\n    \"        self.main_Q = tf.reduce_sum(self.main_p * self.z)\\n\",\n    \"                                    \\n\",\n    \"        #similarly, compute the target Q value with probabilities obtained from the target categorical DQN \\n\",\n    \"        self.target_Q = tf.reduce_sum(self.target_p * self.z)\\n\",\n    \"        \\n\",\n    \"        #define the cross entropy loss\\n\",\n    \"        self.cross_entropy_loss = -tf.reduce_sum(self.m_ph * tf.log(self.main_p))\\n\",\n    \"        \\n\",\n    \"        #define the optimizer and minimize the cross entropy loss using Adam optimizer\\n\",\n    \"        self.optimizer = tf.train.AdamOptimizer(0.01).minimize(self.cross_entropy_loss)\\n\",\n    \"    \\n\",\n    \"        #get the main network parameters\\n\",\n    \"        main_net_params = tf.get_collection(\\\"main_net_params\\\")\\n\",\n    \"        \\n\",\n    \"        #get the target network parameters\\n\",\n    \"        target_net_params = tf.get_collection('target_net_params')\\n\",\n    \"        \\n\",\n    \"        #define the update_target_net operation for updating the target network parameters by\\n\",\n    \"        #copying the parameters of the main network\\n\",\n    \"        self.update_target_net = [tf.assign(t, e) for t, e in zip(target_net_params, main_net_params)]\\n\",\n    \"\\n\",\n    \"    #let's define a function called train to train the network\\n\",\n    \"    def train(self,s,r,action,s_,gamma):\\n\",\n    \"        \\n\",\n    \"        #increment the time step\\n\",\n    \"        self.time_step += 1\\n\",\n    \"    \\n\",\n    \"        #get the target Q values\\n\",\n    \"        list_q_ = [self.sess.run(self.target_Q,feed_dict={self.state_ph:[s_],self.action_ph:[[a]]}) for a in range(self.action_shape)]\\n\",\n    \"        \\n\",\n    \"        #select the next state action a dash as the one which has the maximum Q value\\n\",\n    \"        a_ = tf.argmax(list_q_).eval()\\n\",\n    \"        \\n\",\n    \"        #initialize an array m with shape as the number of support with zero values. The denotes\\n\",\n    \"        #the distributed probability of the target distribution after the projection step\\n\",\n    \"\\n\",\n    \"        m = np.zeros(self.atoms)\\n\",\n    \"        \\n\",\n    \"        #get the probability for each atom using the target categorical DQN\\n\",\n    \"        p = self.sess.run(self.target_p,feed_dict = {self.state_ph:[s_],self.action_ph:[[a_]]})[0]\\n\",\n    \"        \\n\",\n    \"        #perform the projection step\\n\",\n    \"        for j in range(self.atoms):\\n\",\n    \"            Tz = min(self.v_max,max(self.v_min,r+gamma * self.z[j]))\\n\",\n    \"            bj = (Tz - self.v_min) / self.delta_z \\n\",\n    \"            l,u = math.floor(bj),math.ceil(bj) \\n\",\n    \"\\n\",\n    \"            pj = p[j]\\n\",\n    \"\\n\",\n    \"            m[int(l)] += pj * (u - bj)\\n\",\n    \"            m[int(u)] += pj * (bj - l)\\n\",\n    \"    \\n\",\n    \"        #train the network by minimizing the loss\\n\",\n    \"        self.sess.run(self.optimizer,feed_dict={self.state_ph:[s] , self.action_ph:[action], self.m_ph: m })\\n\",\n    \"        \\n\",\n    \"        #update the target network parameters by copying the main network parameters\\n\",\n    \"        if self.time_step % update_target_net == 0:\\n\",\n    \"            self.sess.run(self.update_target_net)\\n\",\n    \"    \\n\",\n    \"    #let's define a function called select_action for selecting the action. We generate a random number and if the number is less than epsilon we select the random\\n\",\n    \"    #action else we select the action which has maximum Q value.\\n\",\n    \"    def select_action(self,s):\\n\",\n    \"        if random.random() <= self.epsilon:\\n\",\n    \"            return random.randint(0, self.action_shape - 1)\\n\",\n    \"        else: \\n\",\n    \"            return np.argmax([self.sess.run(self.main_Q,feed_dict={self.state_ph:[s],self.action_ph:[[a]]}) for a in range(self.action_shape)])\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. First, create the Atari game environment using the\\n\",\n    \"gym. Let's create a Tennis game environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"Tennis-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create an object to our `Categorical_DQN` class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"agent = Categorical_DQN(env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 800\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #set done to False\\n\",\n    \"    done = False\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #initialize the return\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #while the episode is not over\\n\",\n    \"    while not done:\\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"        \\n\",\n    \"        #select an action\\n\",\n    \"        action = agent.select_action(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #update the return\\n\",\n    \"        Return = Return + reward\\n\",\n    \"        \\n\",\n    \"        #store the transition information into the replay buffer\\n\",\n    \"        replay_buffer.append([state, reward, [action], next_state])\\n\",\n    \"        \\n\",\n    \"        #if the length of the replay buffer is greater than or equal to buffer size then start training the\\n\",\n    \"        #network by sampling transitions from the replay buffer\\n\",\n    \"        if len(replay_buffer) >= batch_size:\\n\",\n    \"            trans = sample_transitions(2)\\n\",\n    \"            for item in trans:\\n\",\n    \"                agent.train(item[0],item[1], item[2], item[3],gamma)\\n\",\n    \"                \\n\",\n    \"        #update the state to the next state\\n\",\n    \"        state = next_state\\n\",\n    \"    \\n\",\n    \"    #print the return obtained in the episode\\n\",\n    \"    print(\\\"Episode:{}, Return: {}\\\".format(i,Return))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we learned how categorical DQN works and how to implement them, in the next\\n\",\n    \"section, we will learn another interesting algorithm.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "14. Distributional Reinforcement Learning/.ipynb_checkpoints/Playing Atari games using Categorical DQN-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using Categorical DQN\\n\",\n    \"\\n\",\n    \"Let's implement the categorical DQN algorithm for playing the Atari games. The code used\\n\",\n    \"in this section is adapted from open-source categorical DQN implementation - \\n\",\n    \"https://github.com/princewen/tensorflow_practice/tree/master/RL/Basic-DisRLDemo provided by Prince Wen. \\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import tensorflow as tf\\n\",\n    \"import numpy as np\\n\",\n    \"import random\\n\",\n    \"from collections import deque\\n\",\n    \"import math\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"from tensorflow.python.framework import ops\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the convolutional layer\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def conv(inputs, kernel_shape, bias_shape, strides, weights, bias=None, activation=tf.nn.relu):\\n\",\n    \"\\n\",\n    \"    weights = tf.get_variable('weights', shape=kernel_shape, initializer=weights)\\n\",\n    \"    conv = tf.nn.conv2d(inputs, weights, strides=strides, padding='SAME')\\n\",\n    \"    if bias_shape is not None:\\n\",\n    \"        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)\\n\",\n    \"        return activation(conv + biases) if activation is not None else conv + biases\\n\",\n    \"    \\n\",\n    \"    return activation(conv) if activation is not None else conv\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the dense layer\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def dense(inputs, units, bias_shape, weights, bias=None, activation=tf.nn.relu):\\n\",\n    \"    \\n\",\n    \"    if not isinstance(inputs, ops.Tensor):\\n\",\n    \"        inputs = ops.convert_to_tensor(inputs, dtype='float')\\n\",\n    \"    if len(inputs.shape) > 2:\\n\",\n    \"        inputs = tf.contrib.layers.flatten(inputs)\\n\",\n    \"    flatten_shape = inputs.shape[1]\\n\",\n    \"    weights = tf.get_variable('weights', shape=[flatten_shape, units], initializer=weights)\\n\",\n    \"    dense = tf.matmul(inputs, weights)\\n\",\n    \"    if bias_shape is not None:\\n\",\n    \"        assert bias_shape[0] == units\\n\",\n    \"        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)\\n\",\n    \"        return activation(dense + biases) if activation is not None else dense + biases\\n\",\n    \"    \\n\",\n    \"    return activation(dense) if activation is not None else dense\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the variables\\n\",\n    \"\\n\",\n    \"Now, let's define some of the important variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the $V_{min}$ and $V_{max}$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"v_min = 0\\n\",\n    \"v_max = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the number of atoms (supports) $N$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"atoms = 51\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.99 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 10\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the time step at which we want to update the target network\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"update_target_net = 50 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the epsilon value which is used in the epsilon-greedy policy:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"epsilon = 0.5\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the replay buffer\\n\",\n    \"\\n\",\n    \"First, let's define the buffer length:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"buffer_length = 20000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the replay buffer as a deque structure:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"replay_buffer = deque(maxlen=buffer_length)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We define a function called sample_transitions which returns the randomly sampled\\n\",\n    \"minibatch of transitions from the replay buffer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def sample_transitions(batch_size):\\n\",\n    \"    batch = np.random.permutation(len(replay_buffer))[:batch_size]\\n\",\n    \"    trans = np.array(replay_buffer)[batch]\\n\",\n    \"    return trans\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the Categorical DQN class\\n\",\n    \"\\n\",\n    \"Let's define the class called Categorical_DQN where we will implement the categorical\\n\",\n    \"DQN algorithm. For a clear understanding, you can also check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class Categorical_DQN():\\n\",\n    \"    \\n\",\n    \"    #first, let's define the init method\\n\",\n    \"    def __init__(self,env):\\n\",\n    \"        \\n\",\n    \"        #start the TensorFlow session\\n\",\n    \"        self.sess = tf.InteractiveSession()\\n\",\n    \"        \\n\",\n    \"        #initialize v_min and v_max\\n\",\n    \"        self.v_max = v_max\\n\",\n    \"        self.v_min = v_min\\n\",\n    \"        \\n\",\n    \"        #initialize the number of atoms\\n\",\n    \"        self.atoms = atoms \\n\",\n    \"        \\n\",\n    \"        #initialize the epsilon value\\n\",\n    \"        self.epsilon = epsilon\\n\",\n    \"        \\n\",\n    \"        #get the state shape of the environment\\n\",\n    \"        self.state_shape = env.observation_space.shape\\n\",\n    \"        \\n\",\n    \"        #get the action shape of the environment\\n\",\n    \"        self.action_shape = env.action_space.n\\n\",\n    \"\\n\",\n    \"        #initialize the time step:\\n\",\n    \"        self.time_step = 0\\n\",\n    \"        \\n\",\n    \"        #initialize the target state shape\\n\",\n    \"        target_state_shape = [1]\\n\",\n    \"        target_state_shape.extend(self.state_shape)\\n\",\n    \"\\n\",\n    \"        #define the placeholder for the state\\n\",\n    \"        self.state_ph = tf.placeholder(tf.float32,target_state_shape)\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the action\\n\",\n    \"        self.action_ph = tf.placeholder(tf.int32,[1,1])\\n\",\n    \"                                       \\n\",\n    \"        #define the placeholder for the m value (distributed probability of target distribution)\\n\",\n    \"        self.m_ph = tf.placeholder(tf.float32,[self.atoms])\\n\",\n    \"    \\n\",\n    \"        #compute delta z\\n\",\n    \"        self.delta_z = (self.v_max - self.v_min) / (self.atoms - 1)\\n\",\n    \"                                       \\n\",\n    \"        #compute the support values\\n\",\n    \"        self.z = [self.v_min + i * self.delta_z for i in range(self.atoms)]\\n\",\n    \"\\n\",\n    \"        self.build_categorical_DQN()\\n\",\n    \"                                       \\n\",\n    \"        #initialize all the TensorFlow variables\\n\",\n    \"        self.sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"        \\n\",\n    \"    #let's define a function called build_network for building a deep network. Since we are\\n\",\n    \"    #dealing with the Atari games, we use the convolutional neural network\\n\",\n    \"                                       \\n\",\n    \"    def build_network(self, state, action, name, units_1, units_2, weights, bias, reg=None):\\n\",\n    \"                                       \\n\",\n    \"        #define the first convolutional layer\\n\",\n    \"        with tf.variable_scope('conv1'):\\n\",\n    \"            conv1 = conv(state, [5, 5, 3, 6], [6], [1, 2, 2, 1], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #define the second convolutional layer\\n\",\n    \"        with tf.variable_scope('conv2'):\\n\",\n    \"            conv2 = conv(conv1, [3, 3, 6, 12], [12], [1, 2, 2, 1], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #flatten the feature maps obtained as a result of the second convolutional layer\\n\",\n    \"        with tf.variable_scope('flatten'):\\n\",\n    \"            flatten = tf.contrib.layers.flatten(conv2)\\n\",\n    \"    \\n\",\n    \"        #define the first dense layer\\n\",\n    \"        with tf.variable_scope('dense1'):\\n\",\n    \"            dense1 = dense(flatten, units_1, [units_1], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #define the second dense layer\\n\",\n    \"        with tf.variable_scope('dense2'):\\n\",\n    \"            dense2 = dense(dense1, units_2, [units_2], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #concatenate the second dense layer with the action\\n\",\n    \"        with tf.variable_scope('concat'):\\n\",\n    \"            concatenated = tf.concat([dense2, tf.cast(action, tf.float32)], 1)\\n\",\n    \"                                       \\n\",\n    \"        #define the third layer and apply the softmax function to the result of the third layer and\\n\",\n    \"        #obtain the probabilities for each of the atoms\\n\",\n    \"        with tf.variable_scope('dense3'):\\n\",\n    \"            dense3 = dense(concatenated, self.atoms, [self.atoms], weights, bias) \\n\",\n    \"        return tf.nn.softmax(dense3)\\n\",\n    \"\\n\",\n    \"    #now, let's define a function called build_categorical_DQNfor building the main and\\n\",\n    \"    #target categorical deep Q networks\\n\",\n    \"                                       \\n\",\n    \"    def build_categorical_DQN(self):      \\n\",\n    \"                                       \\n\",\n    \"        #define the main categorical DQN and obtain the probabilities\\n\",\n    \"        with tf.variable_scope('main_net'):\\n\",\n    \"            name = ['main_net_params',tf.GraphKeys.GLOBAL_VARIABLES]\\n\",\n    \"            weights = tf.random_uniform_initializer(-0.1,0.1)\\n\",\n    \"            bias = tf.constant_initializer(0.1)\\n\",\n    \"\\n\",\n    \"            self.main_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)\\n\",\n    \"                                       \\n\",\n    \"        #define the target categorical DQN and obtain the probabilities\\n\",\n    \"        with tf.variable_scope('target_net'):\\n\",\n    \"            name = ['target_net_params',tf.GraphKeys.GLOBAL_VARIABLES]\\n\",\n    \"\\n\",\n    \"            weights = tf.random_uniform_initializer(-0.1,0.1)\\n\",\n    \"            bias = tf.constant_initializer(0.1)\\n\",\n    \"\\n\",\n    \"            self.target_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)\\n\",\n    \"\\n\",\n    \"        #compute the main Q value with probabilities obtained from the main categorical DQN\\n\",\n    \"        self.main_Q = tf.reduce_sum(self.main_p * self.z)\\n\",\n    \"                                    \\n\",\n    \"        #similarly, compute the target Q value with probabilities obtained from the target categorical DQN \\n\",\n    \"        self.target_Q = tf.reduce_sum(self.target_p * self.z)\\n\",\n    \"        \\n\",\n    \"        #define the cross entropy loss\\n\",\n    \"        self.cross_entropy_loss = -tf.reduce_sum(self.m_ph * tf.log(self.main_p))\\n\",\n    \"        \\n\",\n    \"        #define the optimizer and minimize the cross entropy loss using Adam optimizer\\n\",\n    \"        self.optimizer = tf.train.AdamOptimizer(0.01).minimize(self.cross_entropy_loss)\\n\",\n    \"    \\n\",\n    \"        #get the main network parameters\\n\",\n    \"        main_net_params = tf.get_collection(\\\"main_net_params\\\")\\n\",\n    \"        \\n\",\n    \"        #get the target network parameters\\n\",\n    \"        target_net_params = tf.get_collection('target_net_params')\\n\",\n    \"        \\n\",\n    \"        #define the update_target_net operation for updating the target network parameters by\\n\",\n    \"        #copying the parameters of the main network\\n\",\n    \"        self.update_target_net = [tf.assign(t, e) for t, e in zip(target_net_params, main_net_params)]\\n\",\n    \"\\n\",\n    \"    #let's define a function called train to train the network\\n\",\n    \"    def train(self,s,r,action,s_,gamma):\\n\",\n    \"        \\n\",\n    \"        #increment the time step\\n\",\n    \"        self.time_step += 1\\n\",\n    \"    \\n\",\n    \"        #get the target Q values\\n\",\n    \"        list_q_ = [self.sess.run(self.target_Q,feed_dict={self.state_ph:[s_],self.action_ph:[[a]]}) for a in range(self.action_shape)]\\n\",\n    \"        \\n\",\n    \"        #select the next state action a dash as the one which has the maximum Q value\\n\",\n    \"        a_ = tf.argmax(list_q_).eval()\\n\",\n    \"        \\n\",\n    \"        #initialize an array m with shape as the number of support with zero values. The denotes\\n\",\n    \"        #the distributed probability of the target distribution after the projection step\\n\",\n    \"\\n\",\n    \"        m = np.zeros(self.atoms)\\n\",\n    \"        \\n\",\n    \"        #get the probability for each atom using the target categorical DQN\\n\",\n    \"        p = self.sess.run(self.target_p,feed_dict = {self.state_ph:[s_],self.action_ph:[[a_]]})[0]\\n\",\n    \"        \\n\",\n    \"        #perform the projection step\\n\",\n    \"        for j in range(self.atoms):\\n\",\n    \"            Tz = min(self.v_max,max(self.v_min,r+gamma * self.z[j]))\\n\",\n    \"            bj = (Tz - self.v_min) / self.delta_z \\n\",\n    \"            l,u = math.floor(bj),math.ceil(bj) \\n\",\n    \"\\n\",\n    \"            pj = p[j]\\n\",\n    \"\\n\",\n    \"            m[int(l)] += pj * (u - bj)\\n\",\n    \"            m[int(u)] += pj * (bj - l)\\n\",\n    \"    \\n\",\n    \"        #train the network by minimizing the loss\\n\",\n    \"        self.sess.run(self.optimizer,feed_dict={self.state_ph:[s] , self.action_ph:[action], self.m_ph: m })\\n\",\n    \"        \\n\",\n    \"        #update the target network parameters by copying the main network parameters\\n\",\n    \"        if self.time_step % update_target_net == 0:\\n\",\n    \"            self.sess.run(self.update_target_net)\\n\",\n    \"    \\n\",\n    \"    #let's define a function called select_action for selecting the action. We generate a random number and if the number is less than epsilon we select the random\\n\",\n    \"    #action else we select the action which has maximum Q value.\\n\",\n    \"    def select_action(self,s):\\n\",\n    \"        if random.random() <= self.epsilon:\\n\",\n    \"            return random.randint(0, self.action_shape - 1)\\n\",\n    \"        else: \\n\",\n    \"            return np.argmax([self.sess.run(self.main_Q,feed_dict={self.state_ph:[s],self.action_ph:[[a]]}) for a in range(self.action_shape)])\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. First, create the Atari game environment using the\\n\",\n    \"gym. Let's create a Tennis game environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"Tennis-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create an object to our Categorical_DQN class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"agent = Categorical_DQN(env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 800\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #set done to False\\n\",\n    \"    done = False\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #initialize the return\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #while the episode is not over\\n\",\n    \"    while not done:\\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"        \\n\",\n    \"        #select an action\\n\",\n    \"        action = agent.select_action(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #update the return\\n\",\n    \"        Return = Return + reward\\n\",\n    \"        \\n\",\n    \"        #store the transition information into the replay buffer\\n\",\n    \"        replay_buffer.append([state, reward, [action], next_state])\\n\",\n    \"        \\n\",\n    \"        #if the length of the replay buffer is greater than or equal to buffer size then start training the\\n\",\n    \"        #network by sampling transitions from the replay buffer\\n\",\n    \"        if len(replay_buffer) >= batch_size:\\n\",\n    \"            trans = sample_transitions(2)\\n\",\n    \"            for item in trans:\\n\",\n    \"                agent.train(item[0],item[1], item[2], item[3],gamma)\\n\",\n    \"                \\n\",\n    \"        #update the state to the next state\\n\",\n    \"        state = next_state\\n\",\n    \"    \\n\",\n    \"    #print the return obtained in the episode\\n\",\n    \"    print(\\\"Episode:{}, Return: {}\\\".format(i,Return))\\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we learned how categorical DQN works and how to implement them, in the next\\n\",\n    \"section, we will learn another interesting algorithm.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "14. Distributional Reinforcement Learning/.ipynb_checkpoints/c51 done-Copy1-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\\n\",\n      \"  _np_qint8 = np.dtype([(\\\"qint8\\\", np.int8, 1)])\\n\",\n      \"/home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\\n\",\n      \"  _np_quint8 = np.dtype([(\\\"quint8\\\", np.uint8, 1)])\\n\",\n      \"/home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\\n\",\n      \"  _np_qint16 = np.dtype([(\\\"qint16\\\", np.int16, 1)])\\n\",\n      \"/home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\\n\",\n      \"  _np_quint16 = np.dtype([(\\\"quint16\\\", np.uint16, 1)])\\n\",\n      \"/home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\\n\",\n      \"  _np_qint32 = np.dtype([(\\\"qint32\\\", np.int32, 1)])\\n\",\n      \"/home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\\n\",\n      \"  np_resource = np.dtype([(\\\"resource\\\", np.ubyte, 1)])\\n\",\n      \"/home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\\n\",\n      \"  _np_qint8 = np.dtype([(\\\"qint8\\\", np.int8, 1)])\\n\",\n      \"/home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\\n\",\n      \"  _np_quint8 = np.dtype([(\\\"quint8\\\", np.uint8, 1)])\\n\",\n      \"/home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\\n\",\n      \"  _np_qint16 = np.dtype([(\\\"qint16\\\", np.int16, 1)])\\n\",\n      \"/home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\\n\",\n      \"  _np_quint16 = np.dtype([(\\\"quint16\\\", np.uint16, 1)])\\n\",\n      \"/home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\\n\",\n      \"  _np_qint32 = np.dtype([(\\\"qint32\\\", np.int32, 1)])\\n\",\n      \"/home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\\n\",\n      \"  np_resource = np.dtype([(\\\"resource\\\", np.ubyte, 1)])\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:\\n\",\n      \"The TensorFlow contrib module will not be included in TensorFlow 2.0.\\n\",\n      \"For more information, please see:\\n\",\n      \"  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md\\n\",\n      \"  * https://github.com/tensorflow/addons\\n\",\n      \"  * https://github.com/tensorflow/io (for I/O related ops)\\n\",\n      \"If you depend on functionality not listed there, please file an issue.\\n\",\n      \"\\n\",\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py:1634: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.\\n\",\n      \"Instructions for updating:\\n\",\n      \"Use keras.layers.flatten instead.\\n\",\n      \"WARNING:tensorflow:Entity <bound method Flatten.call of <tensorflow.python.layers.core.Flatten object at 0x7fd4081f6eb8>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method Flatten.call of <tensorflow.python.layers.core.Flatten object at 0x7fd4081f6eb8>>: AttributeError: module 'gast' has no attribute 'Num'\\n\",\n      \"WARNING: Entity <bound method Flatten.call of <tensorflow.python.layers.core.Flatten object at 0x7fd4081f6eb8>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method Flatten.call of <tensorflow.python.layers.core.Flatten object at 0x7fd4081f6eb8>>: AttributeError: module 'gast' has no attribute 'Num'\\n\",\n      \"WARNING:tensorflow:Entity <bound method Flatten.call of <tensorflow.python.layers.core.Flatten object at 0x7fd3c1e2a5f8>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method Flatten.call of <tensorflow.python.layers.core.Flatten object at 0x7fd3c1e2a5f8>>: AttributeError: module 'gast' has no attribute 'Num'\\n\",\n      \"WARNING: Entity <bound method Flatten.call of <tensorflow.python.layers.core.Flatten object at 0x7fd3c1e2a5f8>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method Flatten.call of <tensorflow.python.layers.core.Flatten object at 0x7fd3c1e2a5f8>>: AttributeError: module 'gast' has no attribute 'Num'\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"import numpy as np\\n\",\n    \"import random\\n\",\n    \"from collections import deque\\n\",\n    \"import math\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"from tensorflow.python.framework import ops\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def conv(inputs, kernel_shape, bias_shape, strides, weights, bias=None, activation=tf.nn.relu):\\n\",\n    \"\\n\",\n    \"    weights = tf.get_variable('weights', shape=kernel_shape, initializer=weights)\\n\",\n    \"    conv = tf.nn.conv2d(inputs, weights, strides=strides, padding='SAME')\\n\",\n    \"    if bias_shape is not None:\\n\",\n    \"        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)\\n\",\n    \"        return activation(conv + biases) if activation is not None else conv + biases\\n\",\n    \"    return activation(conv) if activation is not None else conv\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def dense(inputs, units, bias_shape, weights, bias=None, activation=tf.nn.relu):\\n\",\n    \"    \\n\",\n    \"\\n\",\n    \"    if not isinstance(inputs, ops.Tensor):\\n\",\n    \"        inputs = ops.convert_to_tensor(inputs, dtype='float')\\n\",\n    \"    if len(inputs.shape) > 2:\\n\",\n    \"        inputs = tf.contrib.layers.flatten(inputs)\\n\",\n    \"    flatten_shape = inputs.shape[1]\\n\",\n    \"    weights = tf.get_variable('weights', shape=[flatten_shape, units], initializer=weights)\\n\",\n    \"    dense = tf.matmul(inputs, weights)\\n\",\n    \"    if bias_shape is not None:\\n\",\n    \"        assert bias_shape[0] == units\\n\",\n    \"        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)\\n\",\n    \"        return activation(dense + biases) if activation is not None else dense + biases\\n\",\n    \"    return activation(dense) if activation is not None else dense\\n\",\n    \"\\n\",\n    \"v_min = 0\\n\",\n    \"v_max = 1000\\n\",\n    \"atoms = 51\\n\",\n    \"gamma = 0.99 \\n\",\n    \"batch_size = 10\\n\",\n    \"update_target_net = 50  \\n\",\n    \"epsilon = 0.5\\n\",\n    \"\\n\",\n    \"buffer_length = 20000\\n\",\n    \"replay_buffer = deque(maxlen=buffer_length)\\n\",\n    \"\\n\",\n    \"def sample_transitions(batch_size):\\n\",\n    \"    batch = np.random.permutation(len(replay_buffer))[:batch_size]\\n\",\n    \"    trans = np.array(replay_buffer)[batch]\\n\",\n    \"    return trans\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"class Categorical_DQN():\\n\",\n    \"    def __init__(self,env):\\n\",\n    \"        self.sess = tf.InteractiveSession()\\n\",\n    \"        self.v_max = v_max\\n\",\n    \"        self.v_min = v_min\\n\",\n    \"        self.atoms = atoms \\n\",\n    \"\\n\",\n    \"        self.epsilon = epsilon\\n\",\n    \"        self.state_shape = env.observation_space.shape\\n\",\n    \"        self.action_shape = env.action_space.n\\n\",\n    \"\\n\",\n    \"        self.time_step = 0\\n\",\n    \"\\n\",\n    \"        target_state_shape = [1]\\n\",\n    \"        target_state_shape.extend(self.state_shape)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"        self.state_ph = tf.placeholder(tf.float32,target_state_shape)\\n\",\n    \"        self.action_ph = tf.placeholder(tf.int32,[1,1])\\n\",\n    \"        self.m_ph = tf.placeholder(tf.float32,[self.atoms])\\n\",\n    \"\\n\",\n    \"        self.delta_z = (self.v_max - self.v_min) / (self.atoms - 1)\\n\",\n    \"        self.z = [self.v_min + i * self.delta_z for i in range(self.atoms)]\\n\",\n    \"\\n\",\n    \"        self.build_categorical_DQN()\\n\",\n    \"        self.sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"    def build_network(self, state, action, name, units_1, units_2, weights, bias, reg=None):\\n\",\n    \"        with tf.variable_scope('conv1'):\\n\",\n    \"            conv1 = conv(state, [5, 5, 3, 6], [6], [1, 2, 2, 1], weights, bias)\\n\",\n    \"        with tf.variable_scope('conv2'):\\n\",\n    \"            conv2 = conv(conv1, [3, 3, 6, 12], [12], [1, 2, 2, 1], weights, bias)\\n\",\n    \"        with tf.variable_scope('flatten'):\\n\",\n    \"            flatten = tf.contrib.layers.flatten(conv2)\\n\",\n    \"\\n\",\n    \"        with tf.variable_scope('dense1'):\\n\",\n    \"            dense1 = dense(flatten, units_1, [units_1], weights, bias)\\n\",\n    \"        with tf.variable_scope('dense2'):\\n\",\n    \"            dense2 = dense(dense1, units_2, [units_2], weights, bias)\\n\",\n    \"        with tf.variable_scope('concat'):\\n\",\n    \"            concatenated = tf.concat([dense2, tf.cast(action, tf.float32)], 1)\\n\",\n    \"        with tf.variable_scope('dense3'):\\n\",\n    \"            dense3 = dense(concatenated, self.atoms, [self.atoms], weights, bias) \\n\",\n    \"        return tf.nn.softmax(dense3)\\n\",\n    \"\\n\",\n    \"    def build_categorical_DQN(self):\\n\",\n    \"        with tf.variable_scope('target_net'):\\n\",\n    \"            name = ['target_net_params',tf.GraphKeys.GLOBAL_VARIABLES]\\n\",\n    \"\\n\",\n    \"            weights = tf.random_uniform_initializer(-0.1,0.1)\\n\",\n    \"            bias = tf.constant_initializer(0.1)\\n\",\n    \"\\n\",\n    \"            self.target_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)\\n\",\n    \"\\n\",\n    \"        with tf.variable_scope('main_net'):\\n\",\n    \"            name = ['main_net_params',tf.GraphKeys.GLOBAL_VARIABLES]\\n\",\n    \"            weights = tf.random_uniform_initializer(-0.1,0.1)\\n\",\n    \"            bias = tf.constant_initializer(0.1)\\n\",\n    \"\\n\",\n    \"            self.main_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"        self.main_Q = tf.reduce_sum(self.main_p * self.z)\\n\",\n    \"        self.target_Q = tf.reduce_sum(self.target_p * self.z)\\n\",\n    \"\\n\",\n    \"        self.cross_entropy_loss = -tf.reduce_sum(self.m_ph * tf.log(self.main_p))\\n\",\n    \"        self.optimizer = tf.train.AdamOptimizer(0.01).minimize(self.cross_entropy_loss)\\n\",\n    \"\\n\",\n    \"        main_net_params = tf.get_collection(\\\"main_net_params\\\")\\n\",\n    \"        target_net_params = tf.get_collection('target_net_params')\\n\",\n    \"\\n\",\n    \"        self.update_target_net = [tf.assign(t, e) for t, e in zip(target_net_params, main_net_params)]\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"    def train(self,s,r,action,s_,gamma):\\n\",\n    \"        self.time_step += 1\\n\",\n    \"\\n\",\n    \"        list_q_ = [self.sess.run(self.target_Q,feed_dict={self.state_ph:[s_],self.action_ph:[[a]]}) for a in range(self.action_shape)]\\n\",\n    \"        \\n\",\n    \"        a_ = tf.argmax(list_q_).eval()\\n\",\n    \"        \\n\",\n    \"\\n\",\n    \"        m = np.zeros(self.atoms)\\n\",\n    \"        p = self.sess.run(self.target_p,feed_dict = {self.state_ph:[s_],self.action_ph:[[a_]]})[0]\\n\",\n    \"        for j in range(self.atoms):\\n\",\n    \"            Tz = min(self.v_max,max(self.v_min,r+gamma * self.z[j]))\\n\",\n    \"            bj = (Tz - self.v_min) / self.delta_z \\n\",\n    \"            l,u = math.floor(bj),math.ceil(bj) \\n\",\n    \"\\n\",\n    \"            pj = p[j]\\n\",\n    \"\\n\",\n    \"            m[int(l)] += pj * (u - bj)\\n\",\n    \"            m[int(u)] += pj * (bj - l)\\n\",\n    \"\\n\",\n    \"        self.sess.run(self.optimizer,feed_dict={self.state_ph:[s] , self.action_ph:[action], self.m_ph: m })\\n\",\n    \"        if self.time_step % update_target_net == 0:\\n\",\n    \"            self.sess.run(self.update_target_net)\\n\",\n    \"\\n\",\n    \"    def select_action(self,s):\\n\",\n    \"        if random.random() <= self.epsilon:\\n\",\n    \"            return random.randint(0, self.action_shape - 1)\\n\",\n    \"        else: \\n\",\n    \"            return np.argmax([self.sess.run(self.main_Q,feed_dict={self.state_ph:[s],self.action_ph:[[a]]}) for a in range(self.action_shape)])\\n\",\n    \"\\n\",\n    \"num_episodes = 800\\n\",\n    \"env = gym.make(\\\"Tennis-v0\\\")\\n\",\n    \"agent = Categorical_DQN(env)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    done = False\\n\",\n    \"    state = env.reset()\\n\",\n    \"    Return = 0\\n\",\n    \"\\n\",\n    \"    while not done:\\n\",\n    \"\\n\",\n    \"        env.render()\\n\",\n    \"        action = agent.select_action(state)\\n\",\n    \"\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"\\n\",\n    \"        Return = Return + reward\\n\",\n    \"\\n\",\n    \"        replay_buffer.append([state, reward, [action], next_state])\\n\",\n    \"\\n\",\n    \"        if len(replay_buffer) >= batch_size:\\n\",\n    \"            trans = sample_transitions(2)\\n\",\n    \"            for item in trans:\\n\",\n    \"                agent.train(item[0],item[1], item[2], item[3],gamma)\\n\",\n    \"\\n\",\n    \"        state = next_state\\n\",\n    \"            \\n\",\n    \"    print(\\\"Episode:{}, Return: {}\\\".format(i,Return))\\n\",\n    \"    \\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env.close()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "14. Distributional Reinforcement Learning/14.03. Playing Atari games using Categorical DQN.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games using Categorical DQN\\n\",\n    \"\\n\",\n    \"Let's implement the categorical DQN algorithm for playing the Atari games. The code used\\n\",\n    \"in this section is adapted from open-source categorical DQN implementation - \\n\",\n    \"https://github.com/princewen/tensorflow_practice/tree/master/RL/Basic-DisRLDemo provided by Prince Wen. \\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"2.0.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import tensorflow as tf\\n\",\n    \"print(tf.__version__)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"import random\\n\",\n    \"from collections import deque\\n\",\n    \"import math\\n\",\n    \"\\n\",\n    \"import tensorflow.compat.v1 as tf\\n\",\n    \"tf.disable_v2_behavior()\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"from tensorflow.python.framework import ops\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the convolutional layer\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def conv(inputs, kernel_shape, bias_shape, strides, weights, bias=None, activation=tf.nn.relu):\\n\",\n    \"\\n\",\n    \"    weights = tf.get_variable('weights', shape=kernel_shape, initializer=weights)\\n\",\n    \"    conv = tf.nn.conv2d(inputs, weights, strides=strides, padding='SAME')\\n\",\n    \"    if bias_shape is not None:\\n\",\n    \"        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)\\n\",\n    \"        return activation(conv + biases) if activation is not None else conv + biases\\n\",\n    \"    \\n\",\n    \"    return activation(conv) if activation is not None else conv\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the dense layer\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def dense(inputs, units, bias_shape, weights, bias=None, activation=tf.nn.relu):\\n\",\n    \"    \\n\",\n    \"    if not isinstance(inputs, ops.Tensor):\\n\",\n    \"        inputs = ops.convert_to_tensor(inputs, dtype='float')\\n\",\n    \"    if len(inputs.shape) > 2:\\n\",\n    \"        inputs = tf.layers.flatten(inputs)\\n\",\n    \"    flatten_shape = inputs.shape[1]\\n\",\n    \"    weights = tf.get_variable('weights', shape=[flatten_shape, units], initializer=weights)\\n\",\n    \"    dense = tf.matmul(inputs, weights)\\n\",\n    \"    if bias_shape is not None:\\n\",\n    \"        assert bias_shape[0] == units\\n\",\n    \"        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)\\n\",\n    \"        return activation(dense + biases) if activation is not None else dense + biases\\n\",\n    \"    \\n\",\n    \"    return activation(dense) if activation is not None else dense\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the variables\\n\",\n    \"\\n\",\n    \"Now, let's define some of the important variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the $V_{min}$ and $V_{max}$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"v_min = 0\\n\",\n    \"v_max = 1000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the number of atoms (supports) $N$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"atoms = 51\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the discount factor, $\\\\gamma$:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gamma = 0.99 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the batch size:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"batch_size = 10\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the time step at which we want to update the target network\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"update_target_net = 50 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the epsilon value which is used in the epsilon-greedy policy:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"epsilon = 0.5\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the replay buffer\\n\",\n    \"\\n\",\n    \"First, let's define the buffer length:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"buffer_length = 20000\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the replay buffer as a deque structure:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"replay_buffer = deque(maxlen=buffer_length)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We define a function called sample_transitions which returns the randomly sampled\\n\",\n    \"minibatch of transitions from the replay buffer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def sample_transitions(batch_size):\\n\",\n    \"    batch = np.random.permutation(len(replay_buffer))[:batch_size]\\n\",\n    \"    trans = np.array(replay_buffer)[batch]\\n\",\n    \"    return trans\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining the Categorical DQN class\\n\",\n    \"\\n\",\n    \"Let's define the class called `Categorical_DQN` where we will implement the categorical\\n\",\n    \"DQN algorithm. For a clear understanding, you can also check the detailed explanation of code on the book.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class Categorical_DQN():\\n\",\n    \"    \\n\",\n    \"    #first, let's define the init method\\n\",\n    \"    def __init__(self,env):\\n\",\n    \"        \\n\",\n    \"        #start the TensorFlow session\\n\",\n    \"        self.sess = tf.InteractiveSession()\\n\",\n    \"        \\n\",\n    \"        #initialize v_min and v_max\\n\",\n    \"        self.v_max = v_max\\n\",\n    \"        self.v_min = v_min\\n\",\n    \"        \\n\",\n    \"        #initialize the number of atoms\\n\",\n    \"        self.atoms = atoms \\n\",\n    \"        \\n\",\n    \"        #initialize the epsilon value\\n\",\n    \"        self.epsilon = epsilon\\n\",\n    \"        \\n\",\n    \"        #get the state shape of the environment\\n\",\n    \"        self.state_shape = env.observation_space.shape\\n\",\n    \"        \\n\",\n    \"        #get the action shape of the environment\\n\",\n    \"        self.action_shape = env.action_space.n\\n\",\n    \"\\n\",\n    \"        #initialize the time step:\\n\",\n    \"        self.time_step = 0\\n\",\n    \"        \\n\",\n    \"        #initialize the target state shape\\n\",\n    \"        target_state_shape = [1]\\n\",\n    \"        target_state_shape.extend(self.state_shape)\\n\",\n    \"\\n\",\n    \"        #define the placeholder for the state\\n\",\n    \"        self.state_ph = tf.placeholder(tf.float32,target_state_shape)\\n\",\n    \"        \\n\",\n    \"        #define the placeholder for the action\\n\",\n    \"        self.action_ph = tf.placeholder(tf.int32,[1,1])\\n\",\n    \"                                       \\n\",\n    \"        #define the placeholder for the m value (distributed probability of target distribution)\\n\",\n    \"        self.m_ph = tf.placeholder(tf.float32,[self.atoms])\\n\",\n    \"    \\n\",\n    \"        #compute delta z\\n\",\n    \"        self.delta_z = (self.v_max - self.v_min) / (self.atoms - 1)\\n\",\n    \"                                       \\n\",\n    \"        #compute the support values\\n\",\n    \"        self.z = [self.v_min + i * self.delta_z for i in range(self.atoms)]\\n\",\n    \"\\n\",\n    \"        self.build_categorical_DQN()\\n\",\n    \"                                       \\n\",\n    \"        #initialize all the TensorFlow variables\\n\",\n    \"        self.sess.run(tf.global_variables_initializer())\\n\",\n    \"\\n\",\n    \"        \\n\",\n    \"    #let's define a function called build_network for building a deep network. Since we are\\n\",\n    \"    #dealing with the Atari games, we use the convolutional neural network\\n\",\n    \"                                       \\n\",\n    \"    def build_network(self, state, action, name, units_1, units_2, weights, bias, reg=None):\\n\",\n    \"                                       \\n\",\n    \"        #define the first convolutional layer\\n\",\n    \"        with tf.variable_scope('conv1'):\\n\",\n    \"            conv1 = conv(state, [5, 5, 3, 6], [6], [1, 2, 2, 1], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #define the second convolutional layer\\n\",\n    \"        with tf.variable_scope('conv2'):\\n\",\n    \"            conv2 = conv(conv1, [3, 3, 6, 12], [12], [1, 2, 2, 1], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #flatten the feature maps obtained as a result of the second convolutional layer\\n\",\n    \"        with tf.variable_scope('flatten'):\\n\",\n    \"            flatten = tf.layers.flatten(conv2)\\n\",\n    \"    \\n\",\n    \"        #define the first dense layer\\n\",\n    \"        with tf.variable_scope('dense1'):\\n\",\n    \"            dense1 = dense(flatten, units_1, [units_1], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #define the second dense layer\\n\",\n    \"        with tf.variable_scope('dense2'):\\n\",\n    \"            dense2 = dense(dense1, units_2, [units_2], weights, bias)\\n\",\n    \"                                       \\n\",\n    \"        #concatenate the second dense layer with the action\\n\",\n    \"        with tf.variable_scope('concat'):\\n\",\n    \"            concatenated = tf.concat([dense2, tf.cast(action, tf.float32)], 1)\\n\",\n    \"                                       \\n\",\n    \"        #define the third layer and apply the softmax function to the result of the third layer and\\n\",\n    \"        #obtain the probabilities for each of the atoms\\n\",\n    \"        with tf.variable_scope('dense3'):\\n\",\n    \"            dense3 = dense(concatenated, self.atoms, [self.atoms], weights, bias) \\n\",\n    \"        return tf.nn.softmax(dense3)\\n\",\n    \"\\n\",\n    \"    #now, let's define a function called build_categorical_DQNfor building the main and\\n\",\n    \"    #target categorical deep Q networks\\n\",\n    \"                                       \\n\",\n    \"    def build_categorical_DQN(self):      \\n\",\n    \"                                       \\n\",\n    \"        #define the main categorical DQN and obtain the probabilities\\n\",\n    \"        with tf.variable_scope('main_net'):\\n\",\n    \"            name = ['main_net_params',tf.GraphKeys.GLOBAL_VARIABLES]\\n\",\n    \"            weights = tf.random_uniform_initializer(-0.1,0.1)\\n\",\n    \"            bias = tf.constant_initializer(0.1)\\n\",\n    \"\\n\",\n    \"            self.main_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)\\n\",\n    \"                                       \\n\",\n    \"        #define the target categorical DQN and obtain the probabilities\\n\",\n    \"        with tf.variable_scope('target_net'):\\n\",\n    \"            name = ['target_net_params',tf.GraphKeys.GLOBAL_VARIABLES]\\n\",\n    \"\\n\",\n    \"            weights = tf.random_uniform_initializer(-0.1,0.1)\\n\",\n    \"            bias = tf.constant_initializer(0.1)\\n\",\n    \"\\n\",\n    \"            self.target_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)\\n\",\n    \"\\n\",\n    \"        #compute the main Q value with probabilities obtained from the main categorical DQN\\n\",\n    \"        self.main_Q = tf.reduce_sum(self.main_p * self.z)\\n\",\n    \"                                    \\n\",\n    \"        #similarly, compute the target Q value with probabilities obtained from the target categorical DQN \\n\",\n    \"        self.target_Q = tf.reduce_sum(self.target_p * self.z)\\n\",\n    \"        \\n\",\n    \"        #define the cross entropy loss\\n\",\n    \"        self.cross_entropy_loss = -tf.reduce_sum(self.m_ph * tf.log(self.main_p))\\n\",\n    \"        \\n\",\n    \"        #define the optimizer and minimize the cross entropy loss using Adam optimizer\\n\",\n    \"        self.optimizer = tf.train.AdamOptimizer(0.01).minimize(self.cross_entropy_loss)\\n\",\n    \"    \\n\",\n    \"        #get the main network parameters\\n\",\n    \"        main_net_params = tf.get_collection(\\\"main_net_params\\\")\\n\",\n    \"        \\n\",\n    \"        #get the target network parameters\\n\",\n    \"        target_net_params = tf.get_collection('target_net_params')\\n\",\n    \"        \\n\",\n    \"        #define the update_target_net operation for updating the target network parameters by\\n\",\n    \"        #copying the parameters of the main network\\n\",\n    \"        self.update_target_net = [tf.assign(t, e) for t, e in zip(target_net_params, main_net_params)]\\n\",\n    \"\\n\",\n    \"    #let's define a function called train to train the network\\n\",\n    \"    def train(self,s,r,action,s_,gamma):\\n\",\n    \"        \\n\",\n    \"        #increment the time step\\n\",\n    \"        self.time_step += 1\\n\",\n    \"    \\n\",\n    \"        #get the target Q values\\n\",\n    \"        list_q_ = [self.sess.run(self.target_Q,feed_dict={self.state_ph:[s_],self.action_ph:[[a]]}) for a in range(self.action_shape)]\\n\",\n    \"        \\n\",\n    \"        #select the next state action a dash as the one which has the maximum Q value\\n\",\n    \"        a_ = tf.argmax(list_q_).eval()\\n\",\n    \"        \\n\",\n    \"        #initialize an array m with shape as the number of support with zero values. The denotes\\n\",\n    \"        #the distributed probability of the target distribution after the projection step\\n\",\n    \"\\n\",\n    \"        m = np.zeros(self.atoms)\\n\",\n    \"        \\n\",\n    \"        #get the probability for each atom using the target categorical DQN\\n\",\n    \"        p = self.sess.run(self.target_p,feed_dict = {self.state_ph:[s_],self.action_ph:[[a_]]})[0]\\n\",\n    \"        \\n\",\n    \"        #perform the projection step\\n\",\n    \"        for j in range(self.atoms):\\n\",\n    \"            Tz = min(self.v_max,max(self.v_min,r+gamma * self.z[j]))\\n\",\n    \"            bj = (Tz - self.v_min) / self.delta_z \\n\",\n    \"            l,u = math.floor(bj),math.ceil(bj) \\n\",\n    \"\\n\",\n    \"            pj = p[j]\\n\",\n    \"\\n\",\n    \"            m[int(l)] += pj * (u - bj)\\n\",\n    \"            m[int(u)] += pj * (bj - l)\\n\",\n    \"    \\n\",\n    \"        #train the network by minimizing the loss\\n\",\n    \"        self.sess.run(self.optimizer,feed_dict={self.state_ph:[s] , self.action_ph:[action], self.m_ph: m })\\n\",\n    \"        \\n\",\n    \"        #update the target network parameters by copying the main network parameters\\n\",\n    \"        if self.time_step % update_target_net == 0:\\n\",\n    \"            self.sess.run(self.update_target_net)\\n\",\n    \"    \\n\",\n    \"    #let's define a function called select_action for selecting the action. We generate a random number and if the number is less than epsilon we select the random\\n\",\n    \"    #action else we select the action which has maximum Q value.\\n\",\n    \"    def select_action(self,s):\\n\",\n    \"        if random.random() <= self.epsilon:\\n\",\n    \"            return random.randint(0, self.action_shape - 1)\\n\",\n    \"        else: \\n\",\n    \"            return np.argmax([self.sess.run(self.main_Q,feed_dict={self.state_ph:[s],self.action_ph:[[a]]}) for a in range(self.action_shape)])\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training the network\\n\",\n    \"\\n\",\n    \"Now, let's start training the network. First, create the Atari game environment using the\\n\",\n    \"gym. Let's create a Tennis game environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make(\\\"Tennis-v0\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create an object to our `Categorical_DQN` class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"agent = Categorical_DQN(env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define the number of episodes:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"num_episodes = 800\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for each episode\\n\",\n    \"for i in range(num_episodes):\\n\",\n    \"    \\n\",\n    \"    #set done to False\\n\",\n    \"    done = False\\n\",\n    \"    \\n\",\n    \"    #initialize the state by resetting the environment\\n\",\n    \"    state = env.reset()\\n\",\n    \"    \\n\",\n    \"    #initialize the return\\n\",\n    \"    Return = 0\\n\",\n    \"    \\n\",\n    \"    #while the episode is not over\\n\",\n    \"    while not done:\\n\",\n    \"        \\n\",\n    \"        #render the environment\\n\",\n    \"        env.render()\\n\",\n    \"        \\n\",\n    \"        #select an action\\n\",\n    \"        action = agent.select_action(state)\\n\",\n    \"        \\n\",\n    \"        #perform the selected action\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        \\n\",\n    \"        #update the return\\n\",\n    \"        Return = Return + reward\\n\",\n    \"        \\n\",\n    \"        #store the transition information into the replay buffer\\n\",\n    \"        replay_buffer.append([state, reward, [action], next_state])\\n\",\n    \"        \\n\",\n    \"        #if the length of the replay buffer is greater than or equal to buffer size then start training the\\n\",\n    \"        #network by sampling transitions from the replay buffer\\n\",\n    \"        if len(replay_buffer) >= batch_size:\\n\",\n    \"            trans = sample_transitions(2)\\n\",\n    \"            for item in trans:\\n\",\n    \"                agent.train(item[0],item[1], item[2], item[3],gamma)\\n\",\n    \"                \\n\",\n    \"        #update the state to the next state\\n\",\n    \"        state = next_state\\n\",\n    \"    \\n\",\n    \"    #print the return obtained in the episode\\n\",\n    \"    print(\\\"Episode:{}, Return: {}\\\".format(i,Return))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we learned how categorical DQN works and how to implement them, in the next\\n\",\n    \"section, we will learn another interesting algorithm.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "14. Distributional Reinforcement Learning/README.md",
    "content": "# 14. Distributional Reinforcement Learning\n* 14.1. Why Distributional Reinforcement Learning?\n* 14.2. Categorical DQN\n   * 14.2.1. Predicting Value Distribution\n   * 14.2.2. Selecting Action Based on the Value Distribution\n   * 14.2.3. Training the Categorical DQN\n   * 14.2.4. Projection Step\n   * 14.2.5. Putting it all Rogether\n   * 14.2.6. Algorithm - Categorical DQN\n* 14.3. Playing Atari games using Categorical DQN\n* 14.4. Quantile Regression DQN\n* 14.5. Math Essentials\n   * 14.5.1. Quantile\n   * 14.5.2. Inverse CDF (Quantile function)\n* 14.6. Understanding QR-DQN\n   * 14.6.1. Action Selection\n   * 14.6.2. Loss Function\n* 14.7. Distributed Distributional DDPG\n   * 14.7.1 Critic Network\n   * 14.7.2. Actor Network\n   * 14.7.3. Algorithm - D4PG"
  },
  {
    "path": "15. Imitation Learning and Inverse RL/.ipynb_checkpoints/13.01. Supervised Imitation Learning -checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Supervised Imitation Learning \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"In the imitation learning setting, our goal is to mimic the expert. Say, we want to train our agent to drive a car, instead of training the agent from scratch by interacting with the environment we can train them with the expert demonstrations. Okay, what are expert demonstrations? Expert demonstrations are a set of trajectories consists of state-action pairs where each action is performed by the expert.\\n\",\n    \"\\n\",\n    \"We can train the agent to mimic the actions performed by the expert in the respective states. Thus, we can view expert demonstrations as training data to train our agent. The fundamental idea of imitation learning is to imitate (learn) the behavior of an expert.\\n\",\n    \"\\n\",\n    \"One of the simplest and naive ways to perform imitation learning is by treating the imitation learning task as a supervised learning task. First, we collect a set of expert demonstrations and then we will train a classifier to perform the same action performed by the expert in a particular state. We can view this as a big multiclass classification problem and train our agent to perform the action performed by the expert in the respective state.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Our goal is to minimize the loss $L(a^*, \\\\pi_{\\\\theta}(s)) $ where $a^*$ is the expert action and $\\\\pi_{\\\\theta}(s) $ denotes the action performed by our agent.\\n\",\n    \"\\n\",\n    \"Thus, in supervised imitation learning, we perform the following steps: \\n\",\n    \"\\n\",\n    \"* Collect the set of expert demonstrations\\n\",\n    \"* Initialize a policy $\\\\pi_{\\\\theta}(s) $\\n\",\n    \"* Learn the policy by minimizing the loss function $L(a^*, \\\\pi_{\\\\theta}(s)) $\\n\",\n    \"\\n\",\n    \"However, there exist several challenges and drawbacks with this method. The knowledge of the agent is limited only to the expert demonstrations (training data) so if the agent comes across a new state which is not present in the expert demonstrations then the agent will not know what action to perform in that state. \\n\",\n    \"\\n\",\n    \"Say, we train the agent to drive a car using supervised imitation learning and let the agent perform in the real world. If the training data has no state where the agent encounters a traffic signal then our agent will have no clue about the traffic signal. Also, the accuracy of the agent is highly dependent on the knowledge of the expert. If the expert demonstrations are poor or not optimal then the agent cannot learn correct actions or optimal policy. \\n\",\n    \"\\n\",\n    \"To overcome the challenges in supervised imitation learning, we introduce a new algorithm called DAgger. In the next section, we will learn how DAgger works and how it overcomes the limitations of the supervised imitation learning. \\n\",\n    \"\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "15. Imitation Learning and Inverse RL/.ipynb_checkpoints/13.02. DAgger-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# DAgger\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"__DAgger (Dataset Aggregation)__ is one of the most popularly used imitation learning algorithms. Let's understand how DAgger works with an example. Let's suppose we want to train our agent to drive a car. First, we initialize an empty dataset . \\n\",\n    \"\\n\",\n    \"__In the first iteration__, we will start off with some policy $\\\\pi_1$ and drive the car. Thus, we generate a trajectory $\\\\tau$ using the policy $\\\\pi_1$. We know that the trajectory consists of a sequence of states and actions. That is, states visited by our policy $\\\\pi_1$ and actions made in those states using our policy $\\\\pi_1$. Now, we create a new dataset $\\\\mathcal{D}_1 $ by taking only the states visited by our policy $\\\\pi_1$ and we use an expert to provide the actions for those states. That is, we take all the states from the trajectory and ask the expert to provide actions for those states. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Now, we combine the new dataset $\\\\mathcal{D}_1 $ with our initialized empty dataset $\\\\mathcal{D} $ and update $\\\\mathcal{D}$ as: \\n\",\n    \"\\n\",\n    \"$$ \\\\mathcal{D} = \\\\mathcal{D} \\\\cup  \\\\mathcal{D}_1 $$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Next, we train a classifier on this updated dataset $\\\\mathcal{D}$ and learn a new policy $\\\\pi_2$. \\n\",\n    \"\\n\",\n    \"__In the second iteration__, we use the new policy $\\\\pi_2$ to generate trajectories and create a new dataset $\\\\mathcal{D}_2 $ by taking only the states visited by the new policy  $\\\\pi_2$ and ask the expert to provide the actions for those states.\\n\",\n    \"\\n\",\n    \"Now, we combine the dataset $\\\\mathcal{D}_2 $ with  $\\\\mathcal{D} $  and update $\\\\mathcal{D} $  as:\\n\",\n    \"\\n\",\n    \"$$\\\\mathcal{D} = \\\\mathcal{D} \\\\cup  \\\\mathcal{D}_2 $$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Next, we train a classifier on this updated dataset $\\\\mathcal{D}$ and learn a new policy $\\\\pi_3$. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"__In the third iteration__, we use the new policy $\\\\pi_3$ to generate trajectories and create a new dataset $\\\\mathcal{D}_3 $  by taking only the states visited by the new policy $\\\\pi_3$ and then we ask the expert to provide the actions for those states.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Now, we combine the dataset $\\\\mathcal{D}_3 $ with  $\\\\mathcal{D} $  and update $\\\\mathcal{D} $  as:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"$$\\\\mathcal{D} = \\\\mathcal{D} \\\\cup  \\\\mathcal{D}_3 $$\\n\",\n    \"\\n\",\n    \"Next, we train a classifier on this updated dataset $\\\\mathcal{D} $  and learn a new policy $\\\\pi_4$. In this way, DAgger works in a series of iterations until it finds the optimal policy. \\n\",\n    \"\\n\",\n    \"__Now that we have a basic understanding of DAgger, let's get into more details and learn how DAgger finds the optimal policy.__\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "15. Imitation Learning and Inverse RL/15.02. DAgger.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# DAgger\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"__DAgger (Dataset Aggregation)__ is one of the most popularly used imitation learning algorithms. Let's understand how DAgger works with an example. Let's suppose we want to train our agent to drive a car. First, we initialize an empty dataset . \\n\",\n    \"\\n\",\n    \"__In the first iteration__, we will start off with some policy $\\\\pi_1$ and drive the car. Thus, we generate a trajectory $\\\\tau$ using the policy $\\\\pi_1$. We know that the trajectory consists of a sequence of states and actions. That is, states visited by our policy $\\\\pi_1$ and actions made in those states using our policy $\\\\pi_1$. Now, we create a new dataset $\\\\mathcal{D}_1 $ by taking only the states visited by our policy $\\\\pi_1$ and we use an expert to provide the actions for those states. That is, we take all the states from the trajectory and ask the expert to provide actions for those states. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Now, we combine the new dataset $\\\\mathcal{D}_1 $ with our initialized empty dataset $\\\\mathcal{D} $ and update $\\\\mathcal{D}$ as: \\n\",\n    \"\\n\",\n    \"$$ \\\\mathcal{D} = \\\\mathcal{D} \\\\cup  \\\\mathcal{D}_1 $$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Next, we train a classifier on this updated dataset $\\\\mathcal{D}$ and learn a new policy $\\\\pi_2$. \\n\",\n    \"\\n\",\n    \"__In the second iteration__, we use the new policy $\\\\pi_2$ to generate trajectories and create a new dataset $\\\\mathcal{D}_2 $ by taking only the states visited by the new policy  $\\\\pi_2$ and ask the expert to provide the actions for those states.\\n\",\n    \"\\n\",\n    \"Now, we combine the dataset $\\\\mathcal{D}_2 $ with  $\\\\mathcal{D} $  and update $\\\\mathcal{D} $  as:\\n\",\n    \"\\n\",\n    \"$$\\\\mathcal{D} = \\\\mathcal{D} \\\\cup  \\\\mathcal{D}_2 $$\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Next, we train a classifier on this updated dataset $\\\\mathcal{D}$ and learn a new policy $\\\\pi_3$. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"__In the third iteration__, we use the new policy $\\\\pi_3$ to generate trajectories and create a new dataset $\\\\mathcal{D}_3 $  by taking only the states visited by the new policy $\\\\pi_3$ and then we ask the expert to provide the actions for those states.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Now, we combine the dataset $\\\\mathcal{D}_3 $ with  $\\\\mathcal{D} $  and update $\\\\mathcal{D} $  as:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"$$\\\\mathcal{D} = \\\\mathcal{D} \\\\cup  \\\\mathcal{D}_3 $$\\n\",\n    \"\\n\",\n    \"Next, we train a classifier on this updated dataset $\\\\mathcal{D} $  and learn a new policy $\\\\pi_4$. In this way, DAgger works in a series of iterations until it finds the optimal policy. \\n\",\n    \"\\n\",\n    \"__Now that we have a basic understanding of DAgger, let's get into more details and learn how DAgger finds the optimal policy.__\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "15. Imitation Learning and Inverse RL/README.md",
    "content": "# 15. Imitation Learning and Inverse RL\n* 15.1. Supervised Imitation Learning\n* 15.2. DAgger\n   * 15.2. Understanding DAgger\n   * 15.3. Algorithm - DAgger\n* 15.3. Deep Q learning from Demonstrations\n   * 15.3.1. Phases of DQfD\n   * 15.3.2. Loss Function of DQfD\n   * 15.3.3. Algorithm - DQfD\n* 15.4. Inverse Reinforcement Learning\n* 15.5. Maximum Entropy IRL\n   * 15.5.1. Key Terms\n   * 15.5.2. Back to Max entropy IRL\n   * 15.5.3. Computing the Gradient\n   * 15.5.4. Algorithm - Maximum Entropy IRL\n* 15.6. Generative Adversarial Imitation Learning\n   * 15.6.1. Formulation of GAIL"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.01. Creating our First Agent with Baseline-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating our first agent with baseline\\n\",\n    \"\\n\",\n    \"Now, let's create our first deep reinforcement learning algorithm using baseline. Let's create\\n\",\n    \"a simple agent using deep Q network for the mountain car climbing task. We know that\\n\",\n    \"in the mountain car climbing task, a car is placed between the two mountains and the goal\\n\",\n    \"of the agent is to drive up the mountain on the right.\\n\",\n    \"\\n\",\n    \"First, let's import the gym and DQN from the stable baselines:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"                        \\n\",\n    \"import gym\\n\",\n    \"from stable_baselines import DQN\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a mountain car environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('MountainCar-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's instantiate our agent. As we can observe in the below code, we are passing the\\n\",\n    \"MlpPolicy, it implies that our network is a multilayer perceptron. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DQN('MlpPolicy', env, learning_rate=1e-3)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's train the agent by specifying the number of time steps we want to train: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.deepq.dqn.DQN at 0x7f4190078240>\"\n      ]\n     },\n     \"execution_count\": 4,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"That's it. Building a DQN agent and training them is that simple.\\n\",\n    \"\\n\",\n    \"## Evaluating the trained agent\\n\",\n    \"\\n\",\n    \"We can also evaluate the trained agent by looking at the mean rewards using\\n\",\n    \"evaluate_policy\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.common.evaluation import evaluate_policy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In the below code, agent is the trained agent, agent.get_env() gets the environment we\\n\",\n    \"trained our agent with, n_eval_episodes implies the number of episodes we need to\\n\",\n    \"evaluate our agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(), n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Storing and loading the trained agent\\n\",\n    \"\\n\",\n    \"With stable baselines, we can also save our trained agent and load them.\\n\",\n    \"We can save the agent as:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.save(\\\"DQN_mountain_car_agent\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After saving, we can load the agent as:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DQN.load(\\\"DQN_mountain_car_agent\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Viewing the trained agent\\n\",\n    \"\\n\",\n    \"After training, we can also have a look at how our trained agent performs in the\\n\",\n    \"environment.\\n\",\n    \"\\n\",\n    \"Initialize the state:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for some 5000 steps:\\n\",\n    \"for t in range(5000):\\n\",\n    \"    \\n\",\n    \"    #predict the action to perform in the given state using our trained agent:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    \\n\",\n    \"    #perform the predicted action\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    \\n\",\n    \"    #render the environment\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.04. Playing Atari games with DQN and its variants-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games with DQN\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Now, let's learn how to create a deep Q network to play Atari games with stable baselines.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary modules:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines import DQN\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Since we are dealing with Atari games we can use a convolutional neural network instead\\n\",\n    \"of a vanilla neural network. So, we use CnnPolicy:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.deepq.policies import CnnPolicy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We learned that we preprocess the game screen before feeding it to the agent. With\\n\",\n    \"baselines, we don't have to preprocess manually, instead, we can make use of make_atari\\n\",\n    \"module which takes care of preprocessing the game screen:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.common.atari_wrappers import make_atari\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create an Atari game environment. Let's create the Ice Hockey game\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = make_atari('IceHockeyNoFrameskip-v4')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DQN(CnnPolicy, env, verbose=1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.deepq.dqn.DQN at 0x7f63b8095198>\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training the agent, we can have a look at how our trained agent performs in the\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.05. Implementing DQN variants-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing DQN variants\\n\",\n    \"\\n\",\n    \"We just learned how to implement DQN using stable baselines. Now, let's see how to\\n\",\n    \"implement the variants of DQN such as double DQN, DQN with prioritized experience\\n\",\n    \"replay and dueling DQN. Implementing DQN variants is very simple with the baselines\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines import DQN\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Since we are dealing with Atari games we can use a convolutional neural network instead\\n\",\n    \"of a vanilla neural network. So, we use CnnPolicy:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.deepq.policies import CnnPolicy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We learned that we preprocess the game screen before feeding it to the agent. With\\n\",\n    \"baselines, we don't have to preprocess manually, instead, we can make use of make_atari\\n\",\n    \"module which takes care of preprocessing the game screen:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.common.atari_wrappers import make_atari\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create an Atari game environment. Let's create the Ice Hockey game\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = make_atari('IceHockeyNoFrameskip-v4')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, we define our keyword arguments as shown below: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"kwargs = {\\\"double_q\\\": True, \\\"prioritized_replay\\\": True, \\\"policy_kwargs\\\": dict(dueling=True)}\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, while instantiating our agent, we just need to pass the keyword arguments: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DQN(CnnPolicy, env, verbose=1, **kwargs)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.deepq.dqn.DQN at 0x7efe48132080>\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"That's it! Now we have the dueling double DQN with prioritized experience replay. \\n\",\n    \"\\n\",\n    \"After training the agent, we can have a look at how our trained agent performs in the\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.06. Lunar Lander using A2C-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Lunar Lander using A2C\\n\",\n    \"\\n\",\n    \"Let's learn how to implement A2C with stable baselines for the lunar landing task. In the\\n\",\n    \"lunar lander environment, our agent drives the space vehicle and the goal of the agent is to\\n\",\n    \"land correctly on the landing pad. If our agent (lander) lands away from the landing pad,\\n\",\n    \"then it loses the reward and the episode will get terminated if the agent crashes or comes to\\n\",\n    \"rest. The action space of the environment includes four discrete actions which are do\\n\",\n    \"nothing, a fire left orientation engine, fire main engine, and fire right orientation engine.\\n\",\n    \"Now, Let's see how to train the agent using A2C to correctly land on the landing pad.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv\\n\",\n    \"from stable_baselines.common.evaluation import evaluate_policy\\n\",\n    \"from stable_baselines import A2C\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the lunar lander environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('LunarLander-v2')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's use the dummy vectorized environment, we learned that in the dummy vectorized\\n\",\n    \"environment, we run each environment in the same process:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: env])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = A2C(MlpPolicy, env, ent_coef=0.1, verbose=0)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.a2c.a2c.A2C at 0x7fbb000ca518>\"\n      ]\n     },\n     \"execution_count\": 5,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=250000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, we can evaluate our agent by looking at the mean rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(),\\n\",\n    \"n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.07. Creating a custom network-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a custom network\\n\",\n    \"In the previous section, we learned how to create A2C using stable baselines. Instead of\\n\",\n    \"using the default network, can we customize the network architecture? Yes! With a stable\\n\",\n    \"baseline, we can also use our own custom architecture. Let's see how to do that. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv\\n\",\n    \"from stable_baselines.common.evaluation import evaluate_policy\\n\",\n    \"from stable_baselines.common.policies import FeedForwardPolicy\\n\",\n    \"from stable_baselines import A2C\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the lunar lander environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('LunarLander-v2')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's use the dummy vectorized environment, we learned that in the dummy vectorized\\n\",\n    \"environment, we run each environment in the same process:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: env])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can define our custom policy (custom network) as shown below. As we can\\n\",\n    \"observe in the below code, we are passing net_arch=[dict(pi=[128, 128, 128],\\n\",\n    \"vf=[128, 128, 128])], which implies our network architecture. pi implies the\\n\",\n    \"architecture of the policy network and vf implies the architecture of value network: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class CustomPolicy(FeedForwardPolicy):\\n\",\n    \"    def __init__(self, *args, **kargs):\\n\",\n    \"        super(CustomPolicy, self).__init__(*args, **kargs,\\n\",\n    \"                                           net_arch=[dict(pi=[128, 128, 128], vf=[128, 128, 128])],\\n\",\n    \"                                           feature_extraction=\\\"mlp\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = A2C(CustomPolicy, env, ent_coef=0.1, verbose=0)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent as usual:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.a2c.a2c.A2C at 0x7fb2cc0e0710>\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=250000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, we can evaluate our agent by looking at the mean rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(),\\n\",\n    \"n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/14.08. Swinging up a pendulum using DDPG-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Swinging up a pendulum using DDPG\\n\",\n    \"Let's learn how to implement the DDPG for the swinging up pendulum task using stable\\n\",\n    \"baselines. First, let's import the necessary libraries\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"from stable_baselines.ddpg.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.evaluation import evaluate_policy\\n\",\n    \"from stable_baselines.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise, AdaptiveParamNoiseSpec\\n\",\n    \"from stable_baselines import DDPG\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the pendlum environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('Pendulum-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the number of actions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"n_actions = env.action_space.shape[-1]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We know that in DDPG, instead of selecting the action directly, we add some noise using the Ornstein-Uhlenbeck process to ensure exploration. So, we create the action noise as:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DDPG(MlpPolicy, env, verbose=1, param_noise=None, action_noise=action_noise)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent as usual:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.ddpg.ddpg.DDPG at 0x7f4f1801f7f0>\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=250000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, we can evaluate our agent by looking at the mean rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(),\\n\",\n    \"n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training the agent, we can also look at how our trained agent swings up the\\n\",\n    \"pendulum by rendering the environment. Can we also look at the computational graph of\\n\",\n    \"DDPG? Yes! In the next section, we will learn how to do that.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Viewing the computational graph in TensorBoard\\n\",\n    \"\\n\",\n    \"With stables baselines, it is easier to view the computational graph of our model in\\n\",\n    \"TensorBoard. In order to that, we just need to pass the directory where we need to store our\\n\",\n    \"log files while instantiating the agent as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DDPG(MlpPolicy, env, verbose=1, param_noise=None,action_noise=action_noise, tensorboard_log=\\\"logs\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Then, we can train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/stable_baselines/common/base_class.py:1082: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.ddpg.ddpg.DDPG at 0x7f4f10154898>\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=250000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, open the terminal and type the following command to run the TensorBoard:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"tensorboard --logdir logs\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.01. Creating our First Agent with Stable Baseline-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Creating our first agent with Stable Baseline\\n\",\n    \"\\n\",\n    \"Note that currently, Stable Baselines works only with TensorFlow version 1.x. So,\\n\",\n    \"make sure you are running the Stable Baselines experiment with TensorFlow 1.x.\\n\",\n    \"\\n\",\n    \"Now, let's create our first deep reinforcement learning algorithm using baseline. Let's create\\n\",\n    \"a simple agent using deep Q network for the mountain car climbing task. We know that\\n\",\n    \"in the mountain car climbing task, a car is placed between the two mountains and the goal\\n\",\n    \"of the agent is to drive up the mountain on the right.\\n\",\n    \"\\n\",\n    \"First, let's import the gym and DQN from the stable baselines:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"                        \\n\",\n    \"import gym\\n\",\n    \"from stable_baselines import DQN\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a mountain car environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('MountainCar-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's instantiate our agent. As we can observe in the below code, we are passing the\\n\",\n    \"`MlpPolicy`, it implies that our network is a multilayer perceptron. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DQN('MlpPolicy', env, learning_rate=1e-3)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's train the agent by specifying the number of time steps we want to train: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.deepq.dqn.DQN at 0x7f4190078240>\"\n      ]\n     },\n     \"execution_count\": 4,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"That's it. Building a DQN agent and training them is that simple.\\n\",\n    \"\\n\",\n    \"## Evaluating the trained agent\\n\",\n    \"\\n\",\n    \"We can also evaluate the trained agent by looking at the mean rewards using\\n\",\n    \"`evaluate_policy`\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.common.evaluation import evaluate_policy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In the below code, agent is the trained agent, `agent.get_env()` gets the environment we\\n\",\n    \"trained our agent with, `n_eval_episodes` implies the number of episodes we need to\\n\",\n    \"evaluate our agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(), n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Storing and loading the trained agent\\n\",\n    \"\\n\",\n    \"With stable baselines, we can also save our trained agent and load them.\\n\",\n    \"We can save the agent as:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.save(\\\"DQN_mountain_car_agent\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After saving, we can load the agent as:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DQN.load(\\\"DQN_mountain_car_agent\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Viewing the trained agent\\n\",\n    \"\\n\",\n    \"After training, we can also have a look at how our trained agent performs in the\\n\",\n    \"environment.\\n\",\n    \"\\n\",\n    \"Initialize the state:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#for some 5000 steps:\\n\",\n    \"for t in range(5000):\\n\",\n    \"    \\n\",\n    \"    #predict the action to perform in the given state using our trained agent:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    \\n\",\n    \"    #perform the predicted action\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    \\n\",\n    \"    #update next state to current state \\n\",\n    \"    state = next_state\\n\",\n    \"    \\n\",\n    \"    #render the environment\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.04. Playing Atari games with DQN and its variants-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games with DQN\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Now, let's learn how to create a deep Q network to play Atari games with stable baselines.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary modules:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines import DQN\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Since we are dealing with Atari games we can use a convolutional neural network instead\\n\",\n    \"of a vanilla neural network. So, we use `CnnPolicy`:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.deepq.policies import CnnPolicy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We learned that we preprocess the game screen before feeding it to the agent. With\\n\",\n    \"baselines, we don't have to preprocess manually, instead, we can make use of `make_atari`\\n\",\n    \"module which takes care of preprocessing the game screen:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.common.atari_wrappers import make_atari\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create an Atari game environment. Let's create the Ice Hockey game\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = make_atari('IceHockeyNoFrameskip-v4')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DQN(CnnPolicy, env, verbose=1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.deepq.dqn.DQN at 0x7f63b8095198>\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training the agent, we can have a look at how our trained agent performs in the\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.05. Implementing DQN variants-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing DQN variants\\n\",\n    \"\\n\",\n    \"We just learned how to implement DQN using stable baselines. Now, let's see how to\\n\",\n    \"implement the variants of DQN such as double DQN, DQN with prioritized experience\\n\",\n    \"replay and dueling DQN. Implementing DQN variants is very simple with the baselines\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines import DQN\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Since we are dealing with Atari games we can use a convolutional neural network instead\\n\",\n    \"of a vanilla neural network. So, we use `CnnPolicy`:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.deepq.policies import CnnPolicy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We learned that we preprocess the game screen before feeding it to the agent. With\\n\",\n    \"baselines, we don't have to preprocess manually, instead, we can make use of `make_atari`\\n\",\n    \"module which takes care of preprocessing the game screen:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.common.atari_wrappers import make_atari\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create an Atari game environment. Let's create the Ice Hockey game\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = make_atari('IceHockeyNoFrameskip-v4')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, we define our keyword arguments as shown below: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"kwargs = {\\\"double_q\\\": True, \\\"prioritized_replay\\\": True, \\\"policy_kwargs\\\": dict(dueling=True)}\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, while instantiating our agent, we just need to pass the keyword arguments: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DQN(CnnPolicy, env, verbose=1, **kwargs)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.deepq.dqn.DQN at 0x7efe48132080>\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"That's it! Now we have the dueling double DQN with prioritized experience replay. \\n\",\n    \"\\n\",\n    \"After training the agent, we can have a look at how our trained agent performs in the\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.06. Lunar Lander using A2C-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Lunar Lander using A2C\\n\",\n    \"\\n\",\n    \"Let's learn how to implement A2C with stable baselines for the lunar landing task. In the\\n\",\n    \"lunar lander environment, our agent drives the space vehicle and the goal of the agent is to\\n\",\n    \"land correctly on the landing pad. If our agent (lander) lands away from the landing pad,\\n\",\n    \"then it loses the reward and the episode will get terminated if the agent crashes or comes to\\n\",\n    \"rest. The action space of the environment includes four discrete actions which are do\\n\",\n    \"nothing, a fire left orientation engine, fire main engine, and fire right orientation engine.\\n\",\n    \"Now, Let's see how to train the agent using A2C to correctly land on the landing pad.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv\\n\",\n    \"from stable_baselines.common.evaluation import evaluate_policy\\n\",\n    \"from stable_baselines import A2C\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the lunar lander environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('LunarLander-v2')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's use the dummy vectorized environment, we learned that in the dummy vectorized\\n\",\n    \"environment, we run each environment in the same process:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: env])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = A2C(MlpPolicy, env, ent_coef=0.1, verbose=0)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.a2c.a2c.A2C at 0x7fbb000ca518>\"\n      ]\n     },\n     \"execution_count\": 5,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, we can evaluate our agent by looking at the mean rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(),\\n\",\n    \"n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.07. Creating a custom network-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a custom network\\n\",\n    \"In the previous section, we learned how to create A2C using stable baselines. Instead of\\n\",\n    \"using the default network, can we customize the network architecture? Yes! With a stable\\n\",\n    \"baseline, we can also use our own custom architecture. Let's see how to do that. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv\\n\",\n    \"from stable_baselines.common.evaluation import evaluate_policy\\n\",\n    \"from stable_baselines.common.policies import FeedForwardPolicy\\n\",\n    \"from stable_baselines import A2C\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the lunar lander environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('LunarLander-v2')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's use the dummy vectorized environment, we learned that in the dummy vectorized\\n\",\n    \"environment, we run each environment in the same process:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: env])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can define our custom policy (custom network) as shown below. As we can\\n\",\n    \"observe in the below code, we are passing `net_arch=[dict(pi=[128, 128, 128]`,\\n\",\n    \"`vf=[128, 128, 128])]`, which implies our network architecture. pi implies the\\n\",\n    \"architecture of the policy network and vf implies the architecture of value network: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class CustomPolicy(FeedForwardPolicy):\\n\",\n    \"    def __init__(self, *args, **kargs):\\n\",\n    \"        super(CustomPolicy, self).__init__(*args, **kargs,\\n\",\n    \"                                           net_arch=[dict(pi=[128, 128, 128], vf=[128, 128, 128])],\\n\",\n    \"                                           feature_extraction=\\\"mlp\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = A2C(CustomPolicy, env, ent_coef=0.1, verbose=0)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent as usual:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.a2c.a2c.A2C at 0x7fb2cc0e0710>\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, we can evaluate our agent by looking at the mean rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(),\\n\",\n    \"n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.08. Swinging up a pendulum using DDPG-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Swinging up a pendulum using DDPG\\n\",\n    \"Let's learn how to implement the DDPG for the swinging up pendulum task using stable\\n\",\n    \"baselines. First, let's import the necessary libraries\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"from stable_baselines.ddpg.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.evaluation import evaluate_policy\\n\",\n    \"from stable_baselines.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise, AdaptiveParamNoiseSpec\\n\",\n    \"from stable_baselines import DDPG\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the pendlum environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('Pendulum-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the number of actions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"n_actions = env.action_space.shape[-1]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We know that in DDPG, instead of selecting the action directly, we add some noise using the Ornstein-Uhlenbeck process to ensure exploration. So, we create the action noise as:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DDPG(MlpPolicy, env, verbose=1, param_noise=None, action_noise=action_noise)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent as usual:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.ddpg.ddpg.DDPG at 0x7f4f1801f7f0>\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, we can evaluate our agent by looking at the mean rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(),\\n\",\n    \"n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training the agent, we can also look at how our trained agent swings up the\\n\",\n    \"pendulum by rendering the environment. Can we also look at the computational graph of\\n\",\n    \"DDPG? Yes! In the next section, we will learn how to do that.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Viewing the computational graph in TensorBoard\\n\",\n    \"\\n\",\n    \"With stables baselines, it is easier to view the computational graph of our model in\\n\",\n    \"TensorBoard. In order to that, we just need to pass the directory where we need to store our\\n\",\n    \"log files while instantiating the agent as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DDPG(MlpPolicy, env, verbose=1, param_noise=None,action_noise=action_noise, tensorboard_log=\\\"logs\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Then, we can train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/stable_baselines/common/base_class.py:1082: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.ddpg.ddpg.DDPG at 0x7f4f10154898>\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, open the terminal and type the following command to run the TensorBoard:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"`tensorboard --logdir logs`\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.09. Training an agent to walk using TRPO-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training an agent to walk using TRPO\\n\",\n    \"In this section, let's learn how to train the agent to walk using Trust Region Policy\\n\",\n    \"Optimization (TRPO). \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv, VecNormalize\\n\",\n    \"from stable_baselines import TRPO\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a vectorized Humanoid environment using `DummyVecEnv`:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: gym.make(\\\"Humanoid-v2\\\")])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Normalize the states (observations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = VecNormalize(env, norm_obs=True, norm_reward=False,\\n\",\n    \"                   clip_obs=10.)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = TRPO(MlpPolicy, env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Save the whole code used in this section in a python file called trpo.py and then open\\n\",\n    \"terminal and run the file:\\n\",\n    \"\\n\",\n    \"`python trpo.py`\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Recording the video\\n\",\n    \"\\n\",\n    \"In the previous section, we trained our agent to learn to walk using TRPO. Can we also\\n\",\n    \"record the video of our trained agent? Yes! With stable baselines, we can easily record a\\n\",\n    \"video of our agent using the VecVideoRecorder module.\\n\",\n    \"Note that to record the video, we need the ffmpeg package installed in our machine. If it is\\n\",\n    \"not installed then install that using the following set of commands:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"`sudo add-apt-repository ppa:mc3man/trusty-media\\n\",\n    \" sudo apt-get update\\n\",\n    \" sudo apt-get dist-upgrade\\n\",\n    \" sudo apt-get install ffmpeg`\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's import the `VecVideoRecorder` module:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.common.vec_env import VecVideoRecorder\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define a function called `record_video` for recording the video:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def record_video(env_name, agent, video_length=500, prefix='', video_folder='videos/'):\\n\",\n    \"    \\n\",\n    \"    #create the environment\\n\",\n    \"    env = DummyVecEnv([lambda: gym.make(env_name)])\\n\",\n    \"    \\n\",\n    \"    #instantiate the video recorder\\n\",\n    \"    env = VecVideoRecorder(env, video_folder=video_folder,\\n\",\n    \"        record_video_trigger=lambda step: step == 0, video_length=video_length, name_prefix=prefix)\\n\",\n    \"\\n\",\n    \"    #select actions in the environment using our trained agent where the number of time steps is\\n\",\n    \"    #set to video length:\\n\",\n    \"    state = env.reset()\\n\",\n    \"    for _ in range(video_length):\\n\",\n    \"        action, _ = agent.predict(state)\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        state = next_state\\n\",\n    \"\\n\",\n    \"    env.close()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"That's it! Now, let's call our `record_video` function. Note that we are passing the\\n\",\n    \"environment name, our trained agent, length of the video and the name of our video file: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"record_video('Humanoid-v2', agent, video_length=500, prefix='Humanoid_walk_TRPO')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we will have a new file called `Humanoid_walk_TRPO-step-0-to-step-500.mp4` in\\n\",\n    \"the `folder` videos.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/16.10. Training cheetah bot to run using PPO-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training cheetah bot to run using PPO\\n\",\n    \"In this section, learn how to train the 2D cheetah bot to run using PPO. First, import the\\n\",\n    \"necessary libraries: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv, VecNormalize\\n\",\n    \"from stable_baselines import PPO2\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a vectorized environment using `DummyVecEnv`:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: gym.make(\\\"HalfCheetah-v2\\\")])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Normalize the states (observations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = VecNormalize(env,norm_obs=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = PPO2(MlpPolicy, env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.learn(total_timesteps=250000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Save the whole code used in this section in a python file called ppo.py and then open\\n\",\n    \"terminal and run the file:\\n\",\n    \"\\n\",\n    \"`python ppo.py`\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Making a GIF of a trained agent\\n\",\n    \"\\n\",\n    \"In the previous section, we learned how to train the cheetah bot to run using PPO. Can we\\n\",\n    \"also create a GIF file of our trained agent? Yes! Let's see how to do that.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"First import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import imageio\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the list for storing images:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"images = []\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the state by resetting the environment where the agent is the agent we trained in\\n\",\n    \"the previous section:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = agent.env.reset()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Render the environment and get the image:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"img = agent.env.render(mode='rgb_array')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For every step in the environment, save the image:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"for i in range(500):\\n\",\n    \"    images.append(img)\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done ,info = agent.env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    img = agent.env.render(mode='rgb_array')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the GIF file as:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"imageio.mimsave('HalfCheetah.gif', [np.array(img) for i, img in enumerate(images) if i%2 == 0], fps=29)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we will have a new file called `HalfCheetah.gif`.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Creating a custom network-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a custom network\\n\",\n    \"In the previous section, we learned how to create A2C using stable baselines. Instead of\\n\",\n    \"using the default network, can we customize the network architecture? Yes! With a stable\\n\",\n    \"baseline, we can also use our own custom architecture. Let's see how to do that. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv\\n\",\n    \"from stable_baselines.common.evaluation import evaluate_policy\\n\",\n    \"from stable_baselines import A2C\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the lunar lander environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('LunarLander-v2')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's use the dummy vectorized environment, we learned that in the dummy vectorized\\n\",\n    \"environment, we run each environment in the same process:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: env])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can define our custom policy (custom network) as shown below. As we can\\n\",\n    \"observe in the below code, we are passing net_arch=[dict(pi=[128, 128, 128],\\n\",\n    \"vf=[128, 128, 128])], which implies our network architecture. pi implies the\\n\",\n    \"architecture of the policy network and vf implies the architecture of value network: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class CustomPolicy(FeedForwardPolicy):\\n\",\n    \"    def __init__(self, *args, **kargs):\\n\",\n    \"        super(CustomPolicy, self).__init__(*args, **kargs,\\n\",\n    \"                                           net_arch=[dict(pi=[128, 128, 128], vf=[128, 128, 128])],\\n\",\n    \"                                           feature_extraction=\\\"mlp\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = A2C(CustomPolicy, env, ent_coef=0.1, verbose=0)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent as usual:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.learn(total_timesteps=250000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, we can evaluate our agent by looking at the mean rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(),\\n\",\n    \"n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Implementing DQN variants-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing DQN variants\\n\",\n    \"\\n\",\n    \"We just learned how to implement DQN using stable baselines. Now, let's see how to\\n\",\n    \"implement the variants of DQN such as double DQN, DQN with prioritized experience\\n\",\n    \"replay and dueling DQN. Implementing DQN variants is very simple with the baselines\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines import DQN\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Since we are dealing with Atari games we can use a convolutional neural network instead\\n\",\n    \"of a vanilla neural network. So, we use CnnPolicy:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.deepq.policies import CnnPolicy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We learned that we preprocess the game screen before feeding it to the agent. With\\n\",\n    \"baselines, we don't have to preprocess manually, instead, we can make use of make_atari\\n\",\n    \"module which takes care of preprocessing the game screen:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.common.atari_wrappers import make_atari\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create an Atari game environment. Let's create the Ice Hockey game\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = make_atari('IceHockeyNoFrameskip-v4')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, we define our keyword arguments as shown below: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"kwargs = {\\\"double_q\\\": True, \\\"prioritized_replay\\\": True, \\\"policy_kwargs\\\": dict(dueling=True)}\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, while instantiating our agent, we just need to pass the keyword arguments: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DQN(CnnPolicy, env, verbose=1, **kwargs)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"That's it! Now we have the dueling double DQN with prioritized experience replay. \\n\",\n    \"\\n\",\n    \"After training the agent, we can have a look at how our trained agent performs in the\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Lunar Lander using A2C-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Lunar Lander using A2C\\n\",\n    \"\\n\",\n    \"Let's learn how to implement A2C with stable baselines for the lunar landing task. In the\\n\",\n    \"lunar lander environment, our agent drives the space vehicle and the goal of the agent is to\\n\",\n    \"land correctly on the landing pad. If our agent (lander) lands away from the landing pad,\\n\",\n    \"then it loses the reward and the episode will get terminated if the agent crashes or comes to\\n\",\n    \"rest. The action space of the environment includes four discrete actions which are do\\n\",\n    \"nothing, a fire left orientation engine, fire main engine, and fire right orientation engine.\\n\",\n    \"Now, Let's see how to train the agent using A2C to correctly land on the landing pad.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv\\n\",\n    \"from stable_baselines.common.evaluation import evaluate_policy\\n\",\n    \"from stable_baselines import A2C\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the lunar lander environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('LunarLander-v2')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's use the dummy vectorized environment, we learned that in the dummy vectorized\\n\",\n    \"environment, we run each environment in the same process:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: env])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = A2C(MlpPolicy, env, ent_coef=0.1, verbose=0)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.learn(total_timesteps=250000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, we can evaluate our agent by looking at the mean rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(),\\n\",\n    \"n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Playing Atari games with DQN and its variants-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games with DQN and its variants\\n\",\n    \"\\n\",\n    \"Now, let's learn how to create a deep Q network to play Atari games with stable baselines.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary modules:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines import DQN\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Since we are dealing with Atari games we can use a convolutional neural network instead\\n\",\n    \"of a vanilla neural network. So, we use CnnPolicy:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.deepq.policies import CnnPolicy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We learned that we preprocess the game screen before feeding it to the agent. With\\n\",\n    \"baselines, we don't have to preprocess manually, instead, we can make use of make_atari\\n\",\n    \"module which takes care of preprocessing the game screen:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.common.atari_wrappers import make_atari\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create an Atari game environment. Let's create the Ice Hockey game\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = make_atari('IceHockeyNoFrameskip-v4')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DQN(CnnPolicy, env, verbose=1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training the agent, we can have a look at how our trained agent performs in the\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Swinging up a pendulum using DDPG-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Swinging up a pendulum using DDPG\\n\",\n    \"Let's learn how to implement the DDPG for the swinging up pendulum task using stable\\n\",\n    \"baselines. First, let's import the necessary libraries\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"from stable_baselines.ddpg.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise, AdaptiveParamNoiseSpec\\n\",\n    \"from stable_baselines import DDPG\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the pendlum environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('Pendulum-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the number of actions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"n_actions = env.action_space.shape[-1]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We know that in DDPG, instead of selecting the action directly, we add some noise using the Ornstein-Uhlenbeck process to ensure exploration. So, we create the action noise as:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DDPG(MlpPolicy, env, verbose=1, param_noise=None, action_noise=action_noise)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent as usual:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.learn(total_timesteps=250000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, we can evaluate our agent by looking at the mean rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(),\\n\",\n    \"n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training the agent, we can also look at how our trained agent swings up the\\n\",\n    \"pendulum by rendering the environment. Can we also look at the computational graph of\\n\",\n    \"DDPG? Yes! In the next section, we will learn how to do that.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Viewing the computational graph in TensorBoard\\n\",\n    \"\\n\",\n    \"With stables baselines, it is easier to view the computational graph of our model in\\n\",\n    \"TensorBoard. In order to that, we just need to pass the directory where we need to store our\\n\",\n    \"log files while instantiating the agent as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DDPG(MlpPolicy, env, verbose=1, param_noise=None,action_noise=action_noise, tensorboard_log=\\\"logs\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Then, we can train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.learn(total_timesteps=250000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, open the terminal and type the following command to run the TensorBoard:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"tensorboard --logdir logs\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Training an agent to walk using TRPO-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training an agent to walk using TRPO\\n\",\n    \"In this section, let's learn how to train the agent to walk using Trust Region Policy\\n\",\n    \"Optimization (TRPO). \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv, VecNormalize\\n\",\n    \"from stable_baselines import TRPO\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a vectorized Humanoid environment using DummyVecEnv:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: gym.make(\\\"Humanoid-v2\\\")])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Normalize the states (observations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = VecNormalize(env, norm_obs=True, norm_reward=False,\\n\",\n    \"                   clip_obs=10.)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = TRPO(MlpPolicy, env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.learn(total_timesteps=250000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"for i in range(50000):\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    state, rewards, dones, info = env.step(action)\\n\",\n    \"    env.render()\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Save the whole code used in this section in a python file called trpo.py and then open\\n\",\n    \"terminal and run the file:\\n\",\n    \"\\n\",\n    \"python trpo.py\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Recoding the video\\n\",\n    \"\\n\",\n    \"In the previous section, we trained our agent to learn to walk using TRPO. Can we also\\n\",\n    \"record the video of our trained agent? Yes! With stable baselines, we can easily record a\\n\",\n    \"video of our agent using the VecVideoRecorder module.\\n\",\n    \"Note that to record the video, we need the ffmpeg package installed in our machine. If it is\\n\",\n    \"not installed then install that using the following set of commands:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"sudo add-apt-repository ppa:mc3man/trusty-media\\n\",\n    \"\\n\",\n    \"sudo apt-get update\\n\",\n    \"\\n\",\n    \"sudo apt-get dist-upgrade\\n\",\n    \"\\n\",\n    \"sudo apt-get install ffmpeg\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's import the VecVideoRecorder module:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.common.vec_env import VecVideoRecorder\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define a function called record_video for recoding the video:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def record_video(env_name, agent, video_length=500, prefix='', video_folder='videos/'):\\n\",\n    \"    \\n\",\n    \"    #create the environment\\n\",\n    \"    env = DummyVecEnv([lambda: gym.make(env_name)])\\n\",\n    \"    \\n\",\n    \"    #instantiate the video recorder\\n\",\n    \"    env = VecVideoRecorder(env, video_folder=video_folder,\\n\",\n    \"        record_video_trigger=lambda step: step == 0, video_length=video_length, name_prefix=prefix)\\n\",\n    \"\\n\",\n    \"    #select actions in the environment using our trained agent where the number of time steps is\\n\",\n    \"    #set to video length:\\n\",\n    \"    state = env.reset()\\n\",\n    \"    for _ in range(video_length):\\n\",\n    \"        action, _ = agent.predict(state)\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        state = next_state\\n\",\n    \"\\n\",\n    \"    env.close()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"That's it! Now, let's call our record_video function. Note that we are passing the\\n\",\n    \"environment name, our trained agent, length of the video and the name of our video file: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"record_video('Humanoid-v2', agent, video_length=500, prefix='Humanoid_walk_TRPO')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we will have a new file called Humanoid_walk_TRPO-step-0-to-step-500.mp4 in\\n\",\n    \"the folder videos.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Training cheetah bot to run using PPO-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training cheetah bot to run using PPO\\n\",\n    \"In this section, learn how to train the 2D cheetah bot to run using PPO. First, import the\\n\",\n    \"necessary libraries: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv, VecNormalize\\n\",\n    \"from stable_baselines import PPO2\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a vectorized environment using DummyVecEnv:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: gym.make(\\\"HalfCheetah-v2\\\")])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Normalize the states (observations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = VecNormalize(env,norm_obs=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = PPO2(MlpPolicy, env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.learn(total_timesteps=250000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"for i in range(50000):\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    state, rewards, dones, info = env.step(action)\\n\",\n    \"    env.render()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Save the whole code used in this section in a python file called ppo.py and then open\\n\",\n    \"terminal and run the file:\\n\",\n    \"\\n\",\n    \"python ppo.py\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Making a GIF of a trained agent\\n\",\n    \"\\n\",\n    \"In the previous section, we learned how to train the cheetah bot to run using PPO. Can we\\n\",\n    \"also create a GIF file of our trained agent? Yes! Let's see how to do that.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"First import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import imageio\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the list for storing images:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"images = []\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the state by resetting the environment where the agent is the agent we trained in\\n\",\n    \"the previous section:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = agent.env.reset()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Render the environment and get the image:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"img = agent.env.render(mode='rgb_array')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For every step in the environment, save the image:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"for i in range(350):\\n\",\n    \"    images.append(img)\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done ,info = agent.env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    img = agent.env.render(mode='rgb_array')\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the GIF file as:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"\\n\",\n    \"imageio.mimsave('HalfCheetah.gif', [np.array(img) for i, img in enumerate(images) if i%2 == 0], fps=29)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we will have a new file called HalfCheetah.gif.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/.ipynb_checkpoints/Untitled-checkpoint.ipynb",
    "content": "{\n \"cells\": [],\n \"metadata\": {},\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.04. Playing Atari games with DQN and its variants.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Playing Atari games with DQN\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Now, let's learn how to create a deep Q network to play Atari games with stable baselines.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary modules:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines import DQN\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Since we are dealing with Atari games we can use a convolutional neural network instead\\n\",\n    \"of a vanilla neural network. So, we use `CnnPolicy`:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.deepq.policies import CnnPolicy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We learned that we preprocess the game screen before feeding it to the agent. With\\n\",\n    \"baselines, we don't have to preprocess manually, instead, we can make use of `make_atari`\\n\",\n    \"module which takes care of preprocessing the game screen:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.common.atari_wrappers import make_atari\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create an Atari game environment. Let's create the Ice Hockey game\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = make_atari('IceHockeyNoFrameskip-v4')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DQN(CnnPolicy, env, verbose=1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.deepq.dqn.DQN at 0x7f63b8095198>\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training the agent, we can have a look at how our trained agent performs in the\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.05. Implementing DQN variants.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Implementing DQN variants\\n\",\n    \"\\n\",\n    \"We just learned how to implement DQN using stable baselines. Now, let's see how to\\n\",\n    \"implement the variants of DQN such as double DQN, DQN with prioritized experience\\n\",\n    \"replay and dueling DQN. Implementing DQN variants is very simple with the baselines\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines import DQN\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Since we are dealing with Atari games we can use a convolutional neural network instead\\n\",\n    \"of a vanilla neural network. So, we use `CnnPolicy`:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.deepq.policies import CnnPolicy\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We learned that we preprocess the game screen before feeding it to the agent. With\\n\",\n    \"baselines, we don't have to preprocess manually, instead, we can make use of `make_atari`\\n\",\n    \"module which takes care of preprocessing the game screen:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.common.atari_wrappers import make_atari\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's create an Atari game environment. Let's create the Ice Hockey game\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = make_atari('IceHockeyNoFrameskip-v4')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, we define our keyword arguments as shown below: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"kwargs = {\\\"double_q\\\": True, \\\"prioritized_replay\\\": True, \\\"policy_kwargs\\\": dict(dueling=True)}\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, while instantiating our agent, we just need to pass the keyword arguments: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DQN(CnnPolicy, env, verbose=1, **kwargs)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.deepq.dqn.DQN at 0x7efe48132080>\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"That's it! Now we have the dueling double DQN with prioritized experience replay. \\n\",\n    \"\\n\",\n    \"After training the agent, we can have a look at how our trained agent performs in the\\n\",\n    \"environment:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.06. Lunar Lander using A2C.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Lunar Lander using A2C\\n\",\n    \"\\n\",\n    \"Let's learn how to implement A2C with stable baselines for the lunar landing task. In the\\n\",\n    \"lunar lander environment, our agent drives the space vehicle and the goal of the agent is to\\n\",\n    \"land correctly on the landing pad. If our agent (lander) lands away from the landing pad,\\n\",\n    \"then it loses the reward and the episode will get terminated if the agent crashes or comes to\\n\",\n    \"rest. The action space of the environment includes four discrete actions which are do\\n\",\n    \"nothing, a fire left orientation engine, fire main engine, and fire right orientation engine.\\n\",\n    \"Now, Let's see how to train the agent using A2C to correctly land on the landing pad.\\n\",\n    \"\\n\",\n    \"First, let's import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv\\n\",\n    \"from stable_baselines.common.evaluation import evaluate_policy\\n\",\n    \"from stable_baselines import A2C\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the lunar lander environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('LunarLander-v2')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's use the dummy vectorized environment, we learned that in the dummy vectorized\\n\",\n    \"environment, we run each environment in the same process:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: env])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = A2C(MlpPolicy, env, ent_coef=0.1, verbose=0)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.a2c.a2c.A2C at 0x7fbb000ca518>\"\n      ]\n     },\n     \"execution_count\": 5,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, we can evaluate our agent by looking at the mean rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(),\\n\",\n    \"n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.07. Creating a custom network.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a custom network\\n\",\n    \"In the previous section, we learned how to create A2C using stable baselines. Instead of\\n\",\n    \"using the default network, can we customize the network architecture? Yes! With a stable\\n\",\n    \"baseline, we can also use our own custom architecture. Let's see how to do that. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"warnings.filterwarnings('ignore')\\n\",\n    \"\\n\",\n    \"import gym\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv\\n\",\n    \"from stable_baselines.common.evaluation import evaluate_policy\\n\",\n    \"from stable_baselines.common.policies import FeedForwardPolicy\\n\",\n    \"from stable_baselines import A2C\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the lunar lander environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('LunarLander-v2')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's use the dummy vectorized environment, we learned that in the dummy vectorized\\n\",\n    \"environment, we run each environment in the same process:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: env])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can define our custom policy (custom network) as shown below. As we can\\n\",\n    \"observe in the below code, we are passing `net_arch=[dict(pi=[128, 128, 128]`,\\n\",\n    \"`vf=[128, 128, 128])]`, which implies our network architecture. pi implies the\\n\",\n    \"architecture of the policy network and vf implies the architecture of value network: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class CustomPolicy(FeedForwardPolicy):\\n\",\n    \"    def __init__(self, *args, **kargs):\\n\",\n    \"        super(CustomPolicy, self).__init__(*args, **kargs,\\n\",\n    \"                                           net_arch=[dict(pi=[128, 128, 128], vf=[128, 128, 128])],\\n\",\n    \"                                           feature_extraction=\\\"mlp\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = A2C(CustomPolicy, env, ent_coef=0.1, verbose=0)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent as usual:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.a2c.a2c.A2C at 0x7fb2cc0e0710>\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, we can evaluate our agent by looking at the mean rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(),\\n\",\n    \"n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.08. Swinging up a pendulum using DDPG.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Swinging up a pendulum using DDPG\\n\",\n    \"Let's learn how to implement the DDPG for the swinging up pendulum task using stable\\n\",\n    \"baselines. First, let's import the necessary libraries\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"from stable_baselines.ddpg.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.evaluation import evaluate_policy\\n\",\n    \"from stable_baselines.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise, AdaptiveParamNoiseSpec\\n\",\n    \"from stable_baselines import DDPG\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the pendlum environment using gym:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = gym.make('Pendulum-v0')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Get the number of actions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"n_actions = env.action_space.shape[-1]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We know that in DDPG, instead of selecting the action directly, we add some noise using the Ornstein-Uhlenbeck process to ensure exploration. So, we create the action noise as:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DDPG(MlpPolicy, env, verbose=1, param_noise=None, action_noise=action_noise)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent as usual:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.ddpg.ddpg.DDPG at 0x7f4f1801f7f0>\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, we can evaluate our agent by looking at the mean rewards:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"mean_reward, n_steps = evaluate_policy(agent, agent.get_env(),\\n\",\n    \"n_eval_episodes=10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _states = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training the agent, we can also look at how our trained agent swings up the\\n\",\n    \"pendulum by rendering the environment. Can we also look at the computational graph of\\n\",\n    \"DDPG? Yes! In the next section, we will learn how to do that.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Viewing the computational graph in TensorBoard\\n\",\n    \"\\n\",\n    \"With stables baselines, it is easier to view the computational graph of our model in\\n\",\n    \"TensorBoard. In order to that, we just need to pass the directory where we need to store our\\n\",\n    \"log files while instantiating the agent as shown below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = DDPG(MlpPolicy, env, verbose=1, param_noise=None,action_noise=action_noise, tensorboard_log=\\\"logs\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Then, we can train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING:tensorflow:From /home/sudharsan/anaconda3/envs/universe/lib/python3.6/site-packages/stable_baselines/common/base_class.py:1082: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<stable_baselines.ddpg.ddpg.DDPG at 0x7f4f10154898>\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After training, open the terminal and type the following command to run the TensorBoard:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"`tensorboard --logdir logs`\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.09. Training an agent to walk using TRPO.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training an agent to walk using TRPO\\n\",\n    \"In this section, let's learn how to train the agent to walk using Trust Region Policy\\n\",\n    \"Optimization (TRPO). \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv, VecNormalize\\n\",\n    \"from stable_baselines import TRPO\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a vectorized Humanoid environment using `DummyVecEnv`:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: gym.make(\\\"Humanoid-v2\\\")])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Normalize the states (observations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = VecNormalize(env, norm_obs=True, norm_reward=False,\\n\",\n    \"                   clip_obs=10.)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = TRPO(MlpPolicy, env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.learn(total_timesteps=25000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Save the whole code used in this section in a python file called trpo.py and then open\\n\",\n    \"terminal and run the file:\\n\",\n    \"\\n\",\n    \"`python trpo.py`\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Recording the video\\n\",\n    \"\\n\",\n    \"In the previous section, we trained our agent to learn to walk using TRPO. Can we also\\n\",\n    \"record the video of our trained agent? Yes! With stable baselines, we can easily record a\\n\",\n    \"video of our agent using the VecVideoRecorder module.\\n\",\n    \"Note that to record the video, we need the ffmpeg package installed in our machine. If it is\\n\",\n    \"not installed then install that using the following set of commands:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"`sudo add-apt-repository ppa:mc3man/trusty-media\\n\",\n    \" sudo apt-get update\\n\",\n    \" sudo apt-get dist-upgrade\\n\",\n    \" sudo apt-get install ffmpeg`\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's import the `VecVideoRecorder` module:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from stable_baselines.common.vec_env import VecVideoRecorder\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Define a function called `record_video` for recording the video:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def record_video(env_name, agent, video_length=500, prefix='', video_folder='videos/'):\\n\",\n    \"    \\n\",\n    \"    #create the environment\\n\",\n    \"    env = DummyVecEnv([lambda: gym.make(env_name)])\\n\",\n    \"    \\n\",\n    \"    #instantiate the video recorder\\n\",\n    \"    env = VecVideoRecorder(env, video_folder=video_folder,\\n\",\n    \"        record_video_trigger=lambda step: step == 0, video_length=video_length, name_prefix=prefix)\\n\",\n    \"\\n\",\n    \"    #select actions in the environment using our trained agent where the number of time steps is\\n\",\n    \"    #set to video length:\\n\",\n    \"    state = env.reset()\\n\",\n    \"    for _ in range(video_length):\\n\",\n    \"        action, _ = agent.predict(state)\\n\",\n    \"        next_state, reward, done, info = env.step(action)\\n\",\n    \"        state = next_state\\n\",\n    \"\\n\",\n    \"    env.close()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"That's it! Now, let's call our `record_video` function. Note that we are passing the\\n\",\n    \"environment name, our trained agent, length of the video and the name of our video file: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"record_video('Humanoid-v2', agent, video_length=500, prefix='Humanoid_walk_TRPO')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we will have a new file called `Humanoid_walk_TRPO-step-0-to-step-500.mp4` in\\n\",\n    \"the `folder` videos.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/16.10. Training cheetah bot to run using PPO.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Training cheetah bot to run using PPO\\n\",\n    \"In this section, learn how to train the 2D cheetah bot to run using PPO. First, import the\\n\",\n    \"necessary libraries: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gym\\n\",\n    \"\\n\",\n    \"from stable_baselines.common.policies import MlpPolicy\\n\",\n    \"from stable_baselines.common.vec_env import DummyVecEnv, VecNormalize\\n\",\n    \"from stable_baselines import PPO2\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a vectorized environment using `DummyVecEnv`:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = DummyVecEnv([lambda: gym.make(\\\"HalfCheetah-v2\\\")])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Normalize the states (observations):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"env = VecNormalize(env,norm_obs=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instantiate the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent = PPO2(MlpPolicy, env)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can train the agent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"agent.learn(total_timesteps=250000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also have a look at how our trained agent performs in the environment:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = env.reset()\\n\",\n    \"while True:\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done, info = env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    env.render()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Save the whole code used in this section in a python file called ppo.py and then open\\n\",\n    \"terminal and run the file:\\n\",\n    \"\\n\",\n    \"`python ppo.py`\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Making a GIF of a trained agent\\n\",\n    \"\\n\",\n    \"In the previous section, we learned how to train the cheetah bot to run using PPO. Can we\\n\",\n    \"also create a GIF file of our trained agent? Yes! Let's see how to do that.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"First import the necessary libraries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import imageio\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the list for storing images:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"images = []\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Initialize the state by resetting the environment where the agent is the agent we trained in\\n\",\n    \"the previous section:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"state = agent.env.reset()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Render the environment and get the image:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"img = agent.env.render(mode='rgb_array')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For every step in the environment, save the image:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"for i in range(500):\\n\",\n    \"    images.append(img)\\n\",\n    \"    action, _ = agent.predict(state)\\n\",\n    \"    next_state, reward, done ,info = agent.env.step(action)\\n\",\n    \"    state = next_state\\n\",\n    \"    img = agent.env.render(mode='rgb_array')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the GIF file as:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"imageio.mimsave('HalfCheetah.gif', [np.array(img) for i, img in enumerate(images) if i%2 == 0], fps=29)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we will have a new file called `HalfCheetah.gif`.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "16. Deep Reinforcement Learning with Stable Baselines/README.md",
    "content": "\n# 16. Deep Reinforcement Learning with Stable Baselines\n\n\n* 16.1. Creating our First Agent with Baseline\n   * 16.1.1. Evaluating the Trained Agent\n   * 16.1.2. Storing and Loading the Trained Agent\n   * 16.1.3. Viewing the Trained Agent\n   * 16.1.4. Putting it all Together\n* 16.2. Multiprocessing with Vectorized Environments\n   * 16.2.1. SubprocVecEnv\n   * 16.2.2. DummyVecEnv\n* 16.3. Integrating the Custom Environments\n* 16.4. Playing Atari Games with DQN and its Variants\n   * 16.4.1. Implementing DQN Variants\n* 16.5. Lunar Lander using A2C\n   * 16.5.1. Creating a Custom Network\n* 16.6. Swinging up a Pendulum using DDPG\n   * 16.6.1. Viewing the Computational Graph in TensorBoard\n* 16.7. Training an Agent to Walk using TRPO\n   * 16.7.1. Installing MuJoCo Environment\n   * 16.7.2. Implementing TRPO\n   * 16.7.3. Recoding the video\n* 16.8. Training Cheetah Bot to Run using PPO\n   * 16.8.1. Making a GIF of a Trained Agent\n   * 16.8.2. Implementing GAIL"
  },
  {
    "path": "17. Reinforcement Learning Frontiers/.ipynb_checkpoints/15.01. Meta Reinforcement Learning-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Meta Reinforcement Learning \\n\",\n    \"\\n\",\n    \"In order to understand how meta reinforcement learning works, first let's understand meta learning. \\n\",\n    \"\\n\",\n    \"Meta learning is one of the most promising and trending research areas in the field of artificial intelligence. It is believed to be a stepping stone for attaining Artificial General Intelligence (AGI). Why do we need meta learning? But first, what is meta learning? To answer these questions, let us revisit how deep learning works.\\n\",\n    \"\\n\",\n    \"We know that in deep learning, we train the deep neural network to perform any task. But the problem with the deep neural networks is that we need to have a large training set to train our network and it will fail to learn when we have very few data points. \\n\",\n    \"\\n\",\n    \"Let's say we trained a deep learning model to perform task A. Suppose, we have a new task B, that is closely related to task A. Although the task B is closely related task A, we can't use the model which we trained for task A to perform task B. We need to train a new model from scratch for task B. So, for each task, we need to train a new model from scratch although they might be related. But is this really a true AI? Not really. How do we humans learn? We generalize our learning to multiple concepts and learn from there. But current learning algorithms master only one task. So, here is where meta learning comes in.\\n\",\n    \"\\n\",\n    \"Meta learning produces a versatile AI model that can learn to perform various tasks without having to train them from scratch. We train our meta learning model on various related tasks with few data points, so for a new related task, it can make use of the learning obtained from the previous tasks and we don't have to train them from scratch. Many researchers and scientists believe that meta learning can get us closer to achieving AGI. Learning to learn is the key focus of meta learning.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"__We will understand how exactly meta learning works by learning a popular meta learning algorithm called model agnostic meta learning in the next section.__\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.9\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "17. Reinforcement Learning Frontiers/README.md",
    "content": "# 17. Reinforcement Learning Frontiers\n* 17.1. Meta Reinforcement Learning\n* 17.2. Model Agnostic Meta Learning\n* 17.3. Understanding MAML\n* 17.4. MAML in the Supervised Learning Setting\n* 17.5. Algorithm - MAML in Supervised Learning\n* 17.6. MAML in the Reinforcement Learning Setting\n* 17.7. Algorithm - MAML in Reinforcement Learning\n* 17.8. Hierarchical Reinforcement Learning\n* 17.9. MAXQ Value Function Decomposition\n* 17.10. Imagination Augmented Agents\n"
  },
  {
    "path": "README.md",
    "content": "# [Deep Reinforcement Learning With Python](https://www.amazon.com/dp/1839210680/ref=cm_sw_r_tw_dp_x_avRDFb99EVTQ)\n\n###  Master classic RL, deep RL, distributional RL, inverse RL, and more using OpenAI Gym and TensorFlow with extensive Math \n\n## About the book\n<a target=\"_blank\" href=\"https://www.amazon.com/dp/1839210680/ref=cm_sw_r_tw_dp_x_avRDFb99EVTQ\">\n  <img src=\"./images/2.jpg\" alt=\"Book Cover\" width=\"300\" align=\"left\"/>\n \n</a>With significant enhancement in the quality and quantity of algorithms in recent\nyears, this second edition of Hands-On Reinforcement Learning with Python has been completely \nrevamped into an example-rich guide to learning state-of-the-art reinforcement\nlearning (RL) and deep RL algorithms with TensorFlow and the OpenAI Gym\ntoolkit.\n\nIn addition to exploring RL basics and foundational concepts such as the Bellman\nequation, Markov decision processes, and dynamic programming, this second\nedition dives deep into the full spectrum of value-based, policy-based, and actor-\ncritic RL methods with detailed math. It explores state-of-the-art algorithms such as DQN, TRPO, PPO\nand ACKTR, DDPG, TD3, and SAC in depth, demystifying the underlying math and\ndemonstrating implementations through simple code examples.\n\nThe book has several new chapters dedicated to new RL techniques including\ndistributional RL, imitation learning, inverse RL, and meta RL. You will learn\nto leverage Stable Baselines, an improvement of OpenAI's baseline library, to\nimplement popular RL algorithms effortlessly. The book concludes with an overview\nof promising approaches such as meta-learning and imagination augmented agents\nin research.\n## Get the book \n<div>\n<a target=\"_blank\" href=\"https://www.oreilly.com/library/view/deep-reinforcement-learning/9781839210686/\">\n  <img src=\"./images/Oreilly_safari_logo.png\" alt=\"Oreilly Safari\" hieght=150, width=150>\n</a>\n  \n<a target=\"_blank\" href=\"https://www.amazon.com/gp/product/B08HSHV72N/ref=dbs_a_def_rwt_bibl_vppi_i2\">\n  <img src=\"./images/amazon_logo.jpg\" alt=\"Amazon\" >\n</a>\n\n<a target=\"_blank\" href=\"https://www.packtpub.com/product/deep-reinforcement-learning-with-python-second-edition/9781839210686\">\n  <img src=\"./images/packt_logo.jpeg\" alt=\"Packt\" hieght=150, width=150 >\n</a>\n\n<a target=\"_blank\" href=\"https://www.google.co.in/books/edition/Deep_Reinforcement_Learning_with_Python/dFkAEAAAQBAJ?hl=en&gbpv=0&kptab=overview\">\n  <img src=\"./images/googlebooks_logo.png\" alt=\"Google Books\" \n</a>\n\n<a target=\"_blank\" href=\"https://play.google.com/store/books/details/Sudharsan_Ravichandiran_Deep_Reinforcement_Learnin?id=dFkAEAAAQBAJ\">\n  <img src=\"./images/googleplay_logo.png\" alt=\"Google Play\" >\n</a>\n<br>\n</div>\n\n<br>\n\n# Table of Contents\nDownload the detailed and complete table of contents from [here.](https://github.com/sudharsan13296/Deep-Reinforcement-Learning-With-Python/raw/master/pdf/table%20of%20contents.pdf)\n\n## [1. Fundamentals of Reinforcement Learning](01.%20Fundamentals%20of%20Reinforcement%20Learning)\n\n* [1.1. Key Elements of Reinforcement Learning](01.%20Fundamentals%20of%20Reinforcement%20Learning/1.01.%20Key%20Elements%20of%20Reinforcement%20Learning%20.ipynb)\n* [1.2. Basic Idea of Reinforcement Learning](01.%20Fundamentals%20of%20Reinforcement%20Learning/1.02.%20Basic%20Idea%20of%20Reinforcement%20Learning.ipynb)\n* [1.3. Reinforcement Learning Algorithm](01.%20Fundamentals%20of%20Reinforcement%20Learning/1.03.%20Reinforcement%20Learning%20Algorithm.ipynb)\n* [1.4. RL Agent in the Grid World](01.%20Fundamentals%20of%20Reinforcement%20Learning/1.04.%20RL%20agent%20in%20the%20Grid%20World%20.ipynb)\n* [1.5. How RL differs from other ML paradigms?](01.%20Fundamentals%20of%20Reinforcement%20Learning/1.05.%20How%20RL%20differs%20from%20other%20ML%20paradigms%3F.ipynb)\n* [1.6. Markov Decision Processes](01.%20Fundamentals%20of%20Reinforcement%20Learning/1.06.%20Markov%20Decision%20Processes.ipynb)\n* [1.7. Action Space, Policy, Episode and Horizon](01.%20Fundamentals%20of%20Reinforcement%20Learning/1.07.%20Action%20space%2C%20Policy%2C%20Episode%20and%20Horizon.ipynb)\n* [1.8. Return, Discount Factor and Math Essentials](01.%20Fundamentals%20of%20Reinforcement%20Learning/1.08.%20%20Return%2C%20Discount%20Factor%20and%20Math%20Essentials.ipynb)\n* [1.9. Value Function and Q Function](01.%20Fundamentals%20of%20Reinforcement%20Learning/1.09%20Value%20function%20and%20Q%20function.ipynb)\n* [1.10. Model-Based and Model-Free Learning](01.%20Fundamentals%20of%20Reinforcement%20Learning/1.10.%20Model-Based%20and%20Model-Free%20Learning%20.ipynb)\n* [1.11. Different Types of Environments](01.%20Fundamentals%20of%20Reinforcement%20Learning/1.11.%20Different%20Types%20of%20Environments.ipynb)\n* [1.12. Applications of Reinforcement Learning](01.%20Fundamentals%20of%20Reinforcement%20Learning/1.12.%20Applications%20of%20Reinforcement%20Learning.ipynb)\n* [1.13. Reinforcement Learning Glossary](01.%20Fundamentals%20of%20Reinforcement%20Learning/1.13.%20Reinforcement%20Learning%20Glossary.ipynb)\n\n\n### [2. A Guide to the Gym Toolkit](02.%20A%20Guide%20to%20the%20Gym%20Toolkit)\n* 2.1. Setting Up our Machine\n* [2.2. Creating our First Gym Environment](02.%20A%20Guide%20to%20the%20Gym%20Toolkit/2.02.%20%20Creating%20our%20First%20Gym%20Environment.ipynb)\n* 2.3. Generating an episode\n* 2.4. Classic Control Environments\n* [2.5. Cart Pole Balancing with Random Policy](02.%20A%20Guide%20to%20the%20Gym%20Toolkit/2.05.%20Cart%20Pole%20Balancing%20with%20Random%20Policy.ipynb)\n* 2.6. Atari Game Environments\n* 2.7. Agent Playing the Tennis Game\n* 2.8. Recording the Game\n* 2.9. Other environments\n* 2.10. Environment Synopsis\n\n### [3. Bellman Equation and Dynamic Programming](03.%20Bellman%20Equation%20and%20Dynamic%20Programming)\n* 3.1. The Bellman Equation\n* 3.2. Bellman Optimality Equation\n* 3.3. Relation Between Value and Q Function\n* 3.4. Dynamic Programming\n* 3.5. Value Iteration\n* [3.6. Solving the Frozen Lake Problem with Value Iteration](03.%20Bellman%20Equation%20and%20Dynamic%20Programming/3.06.%20Solving%20the%20Frozen%20Lake%20Problem%20with%20Value%20Iteration.ipynb)\n* 3.7. Policy iteration\n* [3.8. Solving the Frozen Lake Problem with Policy Iteration](03.%20Bellman%20Equation%20and%20Dynamic%20Programming/3.08.%20Solving%20the%20Frozen%20Lake%20Problem%20with%20Policy%20Iteration.ipynb)\n* 3.9. Is DP Applicable to all Environments?\n\n### [4. Monte Carlo Methods](04.%20Monte%20Carlo%20Methods)\n* 4.1. Understanding the Monte Carlo Method\n* 4.2. Prediction and Control Tasks\n* 4.3. Monte Carlo Prediction\n* 4.4. Understanding the BlackJack Game\n* 4.5. Every-visit MC Prediction with Blackjack Game\n* 4.6. First-visit MC Prediction with Blackjack Game\n* 4.7. Incremental Mean Updates\n* 4.8. MC Prediction (Q Function)\n* 4.9. Monte Carlo Control\n* 4.10. On-Policy Monte Carlo Control\n* 4.11. Monte Carlo Exploring Starts\n* 4.12. Monte Carlo with Epsilon-Greedy Policy\n* [4.13. Implementing On-Policy MC Control](04.%20Monte%20Carlo%20Methods/4.13.%20Implementing%20On-Policy%20MC%20Control.ipynb)\n* 4.14. Off-Policy Monte Carlo Control\n* 4.15. Is MC Method Applicable to all Tasks?\n\n\n### [5. Understanding Temporal Difference Learning](05.%20Understanding%20Temporal%20Difference%20Learning)\n* 5.1. TD Learning\n* 5.2. TD Prediction\n* [5.3. Predicting the Value of States in a Frozen Lake Environment](05.%20Understanding%20Temporal%20Difference%20Learning/5.03.%20Predicting%20the%20Value%20of%20States%20in%20a%20Frozen%20Lake%20Environment.ipynb)\n* 5.4. TD Control\n* 5.5. On-Policy TD Control - SARSA\n* [5.6. Computing Optimal Policy using SARSA](05.%20Understanding%20Temporal%20Difference%20Learning/5.06.%20Computing%20Optimal%20Policy%20using%20SARSA.ipynb)\n* 5.7. Off-Policy TD Control - Q Learning\n* [5.8. Computing the Optimal Policy using Q Learning](05.%20Understanding%20Temporal%20Difference%20Learning/5.08.%20Computing%20the%20Optimal%20Policy%20using%20Q%20Learning.ipynb)\n* 5.9. The Difference Between Q Learning and SARSA\n* 5.10. Comparing DP, MC, and TD Methods\n\n\n### [6. Case Study: The MAB Problem](06.%20Case%20Study:%20The%20MAB%20Problem)\n* 6.1. The MAB Problem\n* 6.2. Creating Bandit in the Gym\n* [6.3. Epsilon-Greedy](06.%20Case%20Study:%20The%20MAB%20Problem/6.03.%20Epsilon-Greedy.ipynb)\n* [6.4. Implementing Epsilon-Greedy](06.%20Case%20Study:%20The%20MAB%20Problem/6.04.%20Implementing%20epsilon-greedy%20.ipynb)\n* 6.5. Softmax Exploration\n* [6.6. Implementing Softmax Exploration](06.%20Case%20Study:%20The%20MAB%20Problem/6.06.%20Implementing%20Softmax%20Exploration.ipynb)\n* 6.7. Upper Confidence Bound\n* [6.8. Implementing UCB](06.%20Case%20Study:%20The%20MAB%20Problem/6.08.%20Implementing%20UCB.ipynb)\n* 6.9. Thompson Sampling\n* [6.10. Implementing Thompson Sampling](6.%20Case%20Study:%20The%20MAB%20Problem/6.10.%20Implementing%20Thompson%20Sampling.ipynb)\n* 6.11. Applications of MAB\n* [6.12. Finding the Best Advertisement Banner using Bandits](06.%20Case%20Study:%20The%20MAB%20Problem/6.12.%20Finding%20the%20Best%20Advertisement%20Banner%20using%20Bandits.ipynb)\n* 6.13. Contextual Bandits\n\n### [7. Deep Learning Foundations](07.%20Deep%20learning%20foundations)\n\n* 7.1. Biological and artifical neurons\n* 7.2. ANN and its layers \n* 7.3. Exploring activation functions \n* 7.4. Forward and backward propgation in ANN\n* [7.5. Building neural network from scratch](07.%20Deep%20learning%20foundations/7.05%20Building%20Neural%20Network%20from%20scratch.ipynb)\n* 7.6. Recurrent neural networks \n* 7.7. LSTM-RNN\n* 7.8. Convolutional neural networks\n* 7.9. Generative adversarial networks \n\n### [8. Getting to Know TensorFlow](08.%20A%20primer%20on%20TensorFlow)\n\n* 8.1. What is TensorFlow?\n* 8.2. Understanding Computational Graphs and Sessions\n* 8.3. Variables, Constants, and Placeholders\n* 8.4. Introducing TensorBoard\n* [8.5. Handwritten digits classification using Tensorflow](08.%20A%20primer%20on%20TensorFlow/8.05%20Handwritten%20digits%20classification%20using%20TensorFlow.ipynb)\n* 8.6. Visualizing Computational graph in TensorBord\n* 8.7. Introducing Eager execution\n* [8.8. Math operations in TensorFlow](08.%20A%20primer%20on%20TensorFlow/8.08%20Math%20operations%20in%20TensorFlow.ipynb)\n* 8.9. Tensorflow 2.0 and Keras\n* [8.10. MNIST digits classification in Tensorflow 2.0](08.%20A%20primer%20on%20TensorFlow/8.10%20MNIST%20digits%20classification%20in%20TensorFlow%202.0.ipynb)\n\n\n### [9. Deep Q Network and its Variants](09.%20%20Deep%20Q%20Network%20and%20its%20Variants)\n\n* 9.1. What is Deep Q Network?\n* 9.2. Understanding DQN\n* [9.3. Playing Atari Games using DQN](09.%20%20Deep%20Q%20Network%20and%20its%20Variants/9.03.%20Playing%20Atari%20Games%20using%20DQN.ipynb)\n* 9.4. Double DQN\n* 9.5. DQN with Prioritized Experience Replay\n* 9.6. Dueling DQN\n* 9.7. Deep Recurrent Q Network\n\n### [10. Policy Gradient Method](10.%20Policy%20Gradient%20Method)\n* 10.1. Why Policy Based Methods?\n* 10.2. Policy Gradient Intuition\n* 10.3. Understanding the Policy Gradient\n* 10.4. Deriving Policy Gradien\n* 10.5. Variance Reduction Methods\n* 10.6. Policy Gradient with Reward-to-go\n* [10.7. Cart Pole Balancing with Policy Gradient](10.%20Policy%20Gradient%20Method/10.07.%20Cart%20Pole%20Balancing%20with%20Policy%20Gradient.ipynb)\n* 10.8. Policy Gradient with Baseline\n\n\n### [11. Actor Critic Methods - A2C and A3C](11.%20Actor%20Critic%20Methods%20-%20A2C%20and%20A3C)\n* 11.1. Overview of Actor Critic Method\n* 11.2. Understanding the Actor Critic Method\n* 11.3. Advantage Actor Critic\n* 11.4. Asynchronous Advantage Actor Critic\n* [11.5. Mountain Car Climbing using A3C](11.%20Actor%20Critic%20Methods%20-%20A2C%20and%20A3C/11.05.%20Mountain%20Car%20Climbing%20using%20A3C.ipynb)\n* 11.6. A2C Revisited\n\n### [12. Learning DDPG, TD3 and SAC](12.%20Learning%20DDPG%2C%20TD3%20and%20SAC)\n* 12.1. Deep Deterministic Policy Gradient\n* 12.2. Components of DDPG\n* 12.3. Putting it all together\n* 12.4. Algorithm - DDPG\n* [12.5. Swinging Up the Pendulum using DDPG](12.%20Learning%20DDPG%2C%20TD3%20and%20SAC/12.02.%20Swinging%20Up%20the%20Pendulum%20using%20DDPG%20.ipynb)\n* 12.6. Twin Delayed DDPG\n* 12.7. Components of TD3\n* 12.8. Putting it all together\n* 12.9. Algorithm - TD3\n* 12.10. Soft Actor Critic\n* 12.11. Components of SAC\n* 12.12. Putting it all together\n* 12.13. Algorithm - SAC\n\n### [13. TRPO, PPO and ACKTR Methods](13.%20TRPO%2C%20PPO%20and%20ACKTR%20Methods)\n* 13.1 Trust Region Policy Optimization\n* 13.2. Math Essentials\n* 13.3. Designing the TRPO Objective Function\n* 13.4. Solving the TRPO Objective Function\n* 13.5. Algorithm - TRPO\n* 13.6. Proximal Policy Optimization\n* 13.7. PPO with Clipped Objective\n* [13.9. Implementing PPO-Clipped Method](13.%20TRPO%2C%20PPO%20and%20ACKTR%20Methods/13.09.%20Implementing%20PPO-Clipped%20Method.ipynb)\n* 13.10. PPO with Penalized Objective\n* 13.11. Actor Critic using Kronecker Factored Trust Region\n* 13.12. Math Essentials\n* 13.13. Kronecker-Factored Approximate Curvature (K-FAC)\n* 13.14. K-FAC in Actor Critic\n\n### [14. Distributional Reinforcement Learning](14.%20Distributional%20Reinforcement%20Learning)\n* 14.1. Why Distributional Reinforcement Learning?\n* 14.2. Categorical DQN\n* [14.3. Playing Atari games using Categorical DQN](14.%20Distributional%20Reinforcement%20Learning/14.03.%20Playing%20Atari%20games%20using%20Categorical%20DQN.ipynb)\n* 14.4. Quantile Regression DQN\n* 14.5. Math Essentials\n* 14.6. Understanding QR-DQN\n* 14.7. Distributed Distributional DDPG\n\n### [15. Imitation Learning and Inverse RL](15.%20Imitation%20Learning%20and%20Inverse%20RL)\n* 15.1. Supervised Imitation Learning\n* 15.2. DAgger\n* 15.3. Deep Q learning from Demonstrations\n* 15.4. Inverse Reinforcement Learning\n* 15.5. Maximum Entropy IRL\n* 15.6. Generative Adversarial Imitation Learning\n \n\n### [16. Deep Reinforcement Learning with Stable Baselines](16.%20Deep%20Reinforcement%20Learning%20with%20Stable%20Baselines)\n\n* [16.1. Creating our First Agent with Baseline](16.%20Deep%20Reinforcement%20Learning%20with%20Stable%20Baselines/16.01.%20Creating%20our%20First%20Agent%20with%20Stable%20Baseline.ipynb)\n* 16.2. Multiprocessing with Vectorized Environments\n* 16.3. Integrating the Custom Environments\n* [16.4. Playing Atari Games with DQN](16.%20Deep%20Reinforcement%20Learning%20with%20Stable%20Baselines/16.04.%20Playing%20Atari%20games%20with%20DQN%20and%20its%20variants.ipynb) \n* [16.5. Implememt DQN variants](16.%20Deep%20Reinforcement%20Learning%20with%20Stable%20Baselines/16.05.%20Implementing%20DQN%20variants.ipynb)\n* [16.6. Lunar Lander using A2C](16.%20Deep%20Reinforcement%20Learning%20with%20Stable%20Baselines/16.06.%20Lunar%20Lander%20using%20A2C.ipynb)\n* [16.7. Creating a custom network](16.%20Deep%20Reinforcement%20Learning%20with%20Stable%20Baselines/16.07.%20Creating%20a%20custom%20network.ipynb) \n* [16.8. Swinging up a Pendulum using DDPG](16.%20Deep%20Reinforcement%20Learning%20with%20Stable%20Baselines/16.08.%20Swinging%20up%20a%20pendulum%20using%20DDPG.ipynb)\n* [16.9. Training an Agent to Walk using TRPO](16.%20Deep%20Reinforcement%20Learning%20with%20Stable%20Baselines/16.09.%20Training%20an%20agent%20to%20walk%20using%20TRPO.ipynb)\n* [16.10. Training Cheetah Bot to Run using PPO](16.%20Deep%20Reinforcement%20Learning%20with%20Stable%20Baselines/16.10.%20Training%20cheetah%20bot%20to%20run%20using%20PPO.ipynb)\n\n\n### [17. Reinforcement Learning Frontiers](17.%20Reinforcement%20Learning%20Frontiers)\n* 17.1. Meta Reinforcement Learning\n* 17.2. Model Agnostic Meta Learning\n* 17.3. Understanding MAML\n* 17.4. MAML in the Supervised Learning Setting\n* 17.5. Algorithm - MAML in Supervised Learning\n* 17.6. MAML in the Reinforcement Learning Setting\n* 17.7. Algorithm - MAML in Reinforcement Learning\n* 17.8. Hierarchical Reinforcement Learning\n* 17.9. MAXQ Value Function Decomposition\n* 17.10. Imagination Augmented Agents\n"
  }
]